<a href="https://colab.research.google.com/github/r0brt/Daan_Semesterarbeit/blob/master/HaemmerliRobert_Daan_QLearning_FrozenLake.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Frozen Lake
![alt text](http://deeplizard.com/images/frozen%20lake%20winter.jpg)

### Setting up Frozen Lake


#### Libraries

In [0]:
import numpy as np
import gym
import random
import time
from IPython.display import clear_output

#### Creating the environment

Next, to create our environment, we just call gym.make() and pass a string of the name of the environment we want to set up. We'll be using the environment FrozenLake-v0. All the environments with their corresponding names you can use here are available on Gym’s website https://gym.openai.com/envs/#classic_control

In [0]:
env = gym.make("FrozenLake-v0")

#### Creating the Q-table

We’re now going to construct our Q-table, and initialize all the Q-values to zero for each state-action pair.

Remember, the number of rows in the table is equivalent to the size of the state space in the environment, and the number of columns is equivalent to the size of the action space. We can get this information using using env.observation_space.n and env.action_space.n, as shown below. We can then use this information to build the Q-table and fill it with zeros. 


In [0]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size))

In [4]:
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


#### Initializing Q-learning parameters

First, with **num_episodes**, we define the total number of episodes we want the agent to play during training. Then, with **max_steps_per_episode**, we define a maximum number of steps that our agent is allowed to take within a single episode. So, if by the one-hundredth step, the agent hasn’t reached the frisbee or fallen through a hole, then the episode will terminate with the agent receiving zero points.

Next, we set our **learning_rate**, which was mathematically shown using the symbol α
in the previous post. Then, we also set our **discount_rate**, as well, which was represented with the symbol γ previously.

Now, the last four parameters are all for related to the exploration-exploitation trade-off we talked about last time in regards to the epsilon-greedy po licy. We’re initializing our **exploration_rate** to 1 and setting the **max_exploration_rate** to 1 and a **min_exploration_rate** to 0.01. The max and min are just bounds to how large or small our exploration rate can be.
Remember, the exploration rate was represented with the symbol ϵwhen we discussed it previously.

Lastly, we set the **exploration_decay_rate** to 0.01 to determine the rate at which the exploration_rate will decay. 

In [0]:
num_episodes = 10000
max_steps_per_episode = 100

learning_rate = 0.1
discount_rate = 0.99

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

#### The Q-learning algorithm training loop 

`Update Q-table for Q(s,a)`:  
The new Q-value for this state-action pair is a weighted sum of our old value and the “learned value.” So we have our new Q-value equal to the old Q-value times one minus the learning rate plus the learning rate times the “learned value,” which is the reward we just received from our last action plus the discounted estimate of the optimal future Q-value for the next state action pair.

In [0]:
# list to hold all of the rewards
rewards_all_episodes = []

# Q-learning algorithm
for episode in range(num_episodes):
    # initialize new episode params
    state = env.reset()
    done = False                 # keeps track of whether or not our episode is finished
    rewards_current_episode = 0  # keep track of the rewards
    
    for step in range(max_steps_per_episode): 
        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:  # exploit the environment and choose the action that has the highest Q-value
            action = np.argmax(q_table[state,:]) 
        else:                                              # agent will explore the environment, and sample an action randomly
            action = env.action_space.sample()
            
        # Take new action
        new_state, reward, done, info = env.step(action)
        
        # Update Q-table for Q(s,a)
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :])
        
        # Set new state
        # Add new reward        

    # Exploration rate decay   
    # Add current episode reward to total rewards list