<a href="https://colab.research.google.com/github/r0brt/Daan_Q-Learning/blob/master/HaemmerliRobert_Daan_QLearning_FrozenLake.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Frozen Lake
![alt text](http://deeplizard.com/images/frozen%20lake%20winter.jpg)

**Ausgangslage:**  
Ein gefrorener See mit den Dimensionen 4x4, den es zu "überqueren" gilt.
Der See ist nicht komplett gefroren, sondern hat einzelne Löcher (H), welche vermieden werden sollen.

 ![alt text](https://www.analyticsindiamag.com/wp-content/uploads/2018/03/Frozen-Lake.png) ![alt text](https://analyticsindiamag.com/wp-content/uploads/2018/03/description.png)

  
**mögliche Aktionen**  
links  |  runter  |  rechts  |  rauf  
(ABER: das Eis ist rutschig, der Agent bewegt sich nicht immer in die gewünschte Richtung: *is_slippery=True*)
  
  
**Belohnungen**  
bei Erfolg (G): +1  
Ansonsten (F/H): 0

### Setting up Frozen Lake


#### Libraries

In [0]:
import numpy as np
import gym
import random
import time
from IPython.display import clear_output

#### Creating the environment

Next, to create our environment, we just call gym.make() and pass a string of the name of the environment we want to set up. We'll be using the environment FrozenLake-v0. All the environments with their corresponding names you can use here are available on Gym’s website https://gym.openai.com/envs/#classic_control

In [13]:
env = gym.make("FrozenLake-v0")
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


#### Creating the Q-table

We’re now going to construct our Q-table, and initialize all the Q-values to zero for each state-action pair.

Remember, the number of rows in the table is equivalent to the size of the state space in the environment, and the number of columns is equivalent to the size of the action space. We can get this information using using env.observation_space.n and env.action_space.n, as shown below. We can then use this information to build the Q-table and fill it with zeros. 


In [0]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size))

In [20]:
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


#### Initializing Q-learning parameters

First, with **num_episodes**, we define the total number of episodes we want the agent to play during training. Then, with **max_steps_per_episode**, we define a maximum number of steps that our agent is allowed to take within a single episode. So, if by the one-hundredth step, the agent hasn’t reached the frisbee or fallen through a hole, then the episode will terminate with the agent receiving zero points.

Next, we set our **learning_rate**, which is mathematically shown using the symbol α. Then, we also set our **discount_rate**, as well, which is represented with the symbol γ.

Now, the last four parameters are all for related to the exploration-exploitation trade-off we talked about last time in regards to the epsilon-greedy policy. We’re initializing our **exploration_rate** to 1 and setting the **max_exploration_rate** to 1 and a **min_exploration_rate** to 0.01. The max and min are just bounds to how large or small our exploration rate can be.
Remember, the exploration rate was represented with the symbol ϵ when we discussed it previously.

Lastly, we set the **exploration_decay_rate** to 0.01 to determine the rate at which the exploration_rate will decay. 

In [0]:
num_episodes = 10000
max_steps_per_episode = 100

learning_rate = 0.1
discount_rate = 0.99

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

### Training Q-learning algorithm

`Update Q-table for Q(s,a)`:  
The new Q-value for this state-action pair is a weighted sum of our old value and the “learned value.” So we have our new Q-value equal to the old Q-value times one minus the learning rate plus the learning rate times the “learned value,” which is the reward we just received from our last action plus the discounted estimate of the optimal future Q-value for the next state action pair.

In [23]:
# list to hold all of the rewards
rewards_all_episodes = []
nbr_success = 0

# Q-learning algorithm
for episode in range(num_episodes):
    # initialize new episode params
    state = env.reset()
    done = False                 # keeps track of whether or not our episode is finished
    rewards_current_episode = 0  # keep track of the rewards
    
    for step in range(max_steps_per_episode): 
        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:  # exploit the environment and choose the action that has the highest Q-value
            action = np.argmax(q_table[state,:]) 
        else:                                              # agent will explore the environment, and sample an action randomly
            action = env.action_space.sample()
            
        # Take new action
        new_state, reward, done, info = env.step(action)
        
        # Update Q-table for Q(s,a)
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
                
        # Set new state
        state = new_state
        
        # Add new reward   
        rewards_current_episode += reward
        if nbr_success == 0 and rewards_current_episode > 0:
            nbr_success = nbr_success + 1
            print("\n*** Success ", nbr_success, "***")
            print("Episode: ", episode)
            print("Step: ", step)
            print("Reward: ", reward)
            print("Reward curr. Episode: ", rewards_current_episode)
            print(q_table)
            
        if done == True: 
            break


    # Exploration rate decay
    # Exploration rate decreases or decays at a rate proportional to its current value
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode) 
    
    # Add current episode reward to total rewards list
    rewards_all_episodes.append(rewards_current_episode)

    
    
# Calculate and print the average reward per thousand episodes
rewards_per_thosand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000

print("\n********Average reward per thousand episodes********\n")
for r in rewards_per_thosand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000


*** Success  1 ***
Episode:  41
Step:  8
Reward:  1.0
Reward curr. Episode:  1.0
[[0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.  0.  0. ]
 [0.  0.1 0.  0. ]
 [0.  0.  0.  0. ]]

********Average reward per thousand episodes********

1000 :  0.046000000000000034
2000 :  0.20900000000000016
3000 :  0.3960000000000003
4000 :  0.5450000000000004
5000 :  0.6210000000000004
6000 :  0.6640000000000005
7000 :  0.6460000000000005
8000 :  0.6590000000000005
9000 :  0.7050000000000005
10000 :  0.6810000000000005


In [24]:
# Print updated Q-table
print("\n\n******** Q-table after training (10'000 episodes) ********\n")
print(q_table)



******** Q-table after training (10'000 episodes) ********

[[0.50997946 0.4966136  0.49878094 0.49532026]
 [0.27384592 0.3808344  0.26419844 0.43415148]
 [0.38306658 0.27323237 0.25088774 0.24586175]
 [0.09470931 0.05914075 0.04501136 0.06261971]
 [0.53688988 0.36940758 0.31826316 0.41978232]
 [0.         0.         0.         0.        ]
 [0.32054728 0.12992356 0.21714304 0.09305219]
 [0.         0.         0.         0.        ]
 [0.32985344 0.38290721 0.35341727 0.57962875]
 [0.48301355 0.65443127 0.41008896 0.30256534]
 [0.56254638 0.42200073 0.41783667 0.38171221]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.42509461 0.53037661 0.72644422 0.61615195]
 [0.7609492  0.85978589 0.78532778 0.78981493]
 [0.         0.         0.         0.        ]]


### Watch the agent play the game

In [26]:
# Watch our agent play Frozen Lake by playing the best action 
# from each state according to the Q-table

for episode in range(2):
    # initialize new episode params
    state = env.reset()
    done = False
    print("*****EPISODE ", episode+1, "*****\n\n\n\n")
    time.sleep(1)

    for step in range(max_steps_per_episode):        
        # Show current state of environment on screen    
        clear_output(wait=True)
        env.render()
        time.sleep(0.3)
        
        # Choose action with highest Q-value for current state  
        action = np.argmax(q_table[state,:])   
        
        # Take new action
        new_state, reward, done, info = env.step(action)
        
        if done:
            clear_output(wait=True)
            env.render()
            if reward == 1:
                # Agent reached the goal and won episode
                print("****You reached the goal!****")
                time.sleep(3)                
            else:
                # Agent stepped in a hole and lost episode      
                print("****You fell through a hole!****")
                time.sleep(3)
                clear_output(wait=True)
            break
            
        # Set new state
        state = new_state
        
env.close()

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
****You reached the goal!****
