# Reinforcement learning approach to the Frozen lake game#
## A simple illustration of the q-learning algorithm ##
We are going to teach a computer to play the Game "FROZEN LAKE" which, converniently enough, is provided by the gym environement:

In [1]:
import numpy as np
import gym
import random
env = gym.make("FrozenLake-v0")

The goal is the game is easy: one start at the top left of the screen, and needs to go down to the bottom right, avoiding the holes:
![](Frozen-Lake.png	)
The player must decide at each step what he does, but beware, the lake is slipery, so even if you move left, from time to time you will find yourself moving in a different direction.

As can be checked easyly, there are $4$ possible actions, and $16$ possible states:

In [2]:
action_size = env.action_space.n
state_size = env.observation_space.n
print("Number of possible actions: %d, number of possible states :%d." % (action_size,state_size))

Number of possible actions: 4, number of possible states :16.


Our goal will be to construct the Q-table that will give us, for each action, the total sum of reward in the future. In other words, the ideal table should be
$$
Q^*(s,a)=R^0(a)+\sum_{t=1}^{\infty} \gamma^t R^t
$$
where $R^0$ is the immediate reward if action $a$ is taken, and $R^t$ is the best possible reward for all next possible times.

Since we do not know this table *a priori* we start with a random guess Q:

In [3]:
qtable = np.random.uniform(0,1e-4,(state_size, action_size))
print(qtable)

[[2.23373536e-05 9.53030158e-05 8.95416017e-05 6.67212136e-05]
 [6.37936388e-05 5.36921572e-05 5.81251633e-05 1.14374951e-05]
 [6.35668159e-05 4.66022420e-05 1.59235050e-05 7.21301648e-05]
 [3.91223360e-05 2.69637943e-05 5.47249711e-06 6.32151651e-05]
 [7.11248386e-05 4.48782366e-05 9.37446329e-05 6.05084050e-05]
 [6.80195147e-05 6.98881962e-06 8.28213306e-05 7.79682486e-06]
 [7.96609453e-05 1.50792366e-05 2.90486844e-05 3.53425389e-05]
 [2.53335290e-05 8.53179739e-05 8.39578114e-06 4.90978506e-06]
 [2.93531871e-05 3.03221281e-05 3.66774591e-05 6.14993506e-05]
 [5.99103996e-05 4.91679620e-05 6.30646577e-05 6.03383140e-05]
 [1.89577306e-05 7.13617019e-05 5.22561255e-05 4.81692791e-05]
 [1.65522189e-05 2.77649000e-06 9.70695996e-05 7.15228114e-05]
 [7.50908402e-05 7.08843014e-05 9.41159714e-05 2.91452041e-06]
 [2.51453352e-05 3.12565910e-05 3.99746151e-05 5.72117333e-05]
 [2.51211832e-05 9.69122843e-06 5.74488511e-05 2.32887949e-05]
 [1.39831530e-06 3.16944752e-05 4.25885418e-05 6.787026

To learn this table (the Q-learning part) we need to derive the Bellman equation. It follows from the remark that for the ideal table one has:
$$Q^*(s,a)=R^0(a)+\sum_{t=1}^{\infty} \gamma^t R^t= R^0(a) + \gamma \sum_{t=1}^{\infty} \gamma^{t-1} R^t = R^0(a) + \gamma \left[R^1  + \sum_{\tau=1}^{\infty} \gamma^{\tau} R^{1+\tau}\right]$$

Since $Q^*(s,a)$ is time-translation invariant, we can thus write:
$$Q^*(s,a)= R^0(a) + \gamma \left[R_{\rm best}^1  + \sum_{\tau=0}^{\infty} \gamma^{\tau} R^{1+\tau}\right]$$
and this leads to the **Bellman equation**:
$$Q^*(s,a)=  R^0(a) + \gamma \max_{a'} Q^*(s'|a,a')$$

Given this identity, we will use the update rule:
$$
Q^{t+1}(s,a)=(1-\delta) * Q^{t}(s,a)+ \delta(R(a) * \gamma \max_{a'} Q^*(s'|a,a')$$
where $\delta$ is the learning rate.

The important point, before updating this table, is to set up an equilibrium between exploration and exploitation when we play the game: of course, we want ultimatly to play the game according to the q-table $Q^*$ (*exploitation*) but since, at the begiging, our table $Q$ is essentially random, we should also try from to time to time to allow random moves (exploration). 

We shall do this less and less over time, of course, and every time we play a new game (a new episode) we should start to believe more and more our table and so we shall set the exploration rate as
$$
\epsilon^t = \epsilon_{\min} + (\epsilon_{\max} - \epsilon_{\min})e^{-n_{\rm episode} \lambda} 
$$
where $\lambda$ is a decay rate.

Let us set up all these parameters

In [4]:
total_episodes = 20000      # Total episodes (number of games played)
learning_rate = 0.5         # Learning rate in Bellman equation (delta)
max_steps = 99              # Max steps per episode
gamma = 0.99                # Discounting rate in the Q-table

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.001            # Minimum exploration probability 
decay_rate = 0.001             # Exponential decay rate for exploration prob

We are now ready to write the learning algorithm:

In [5]:
# List of rewards
rewards = []

# For each episode/game, we play:
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    #Now, we play until dead or until it became toooooooo long
    for step in range(max_steps):
        # First we decide if we play in or out of policy:
        exp_exp_tradeoff = random.uniform(0, 1)
        # If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])
        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()

        # Now we take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Finally we perform the Bellman update...
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        #... and update the reward for this game.
        # Note that here, we only get a reward 1 if we eventually reach the goal!
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        # otherwise we continue to play
        
    # Reduce epsilon after each game/episode
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    # update rewards
    rewards.append(total_rewards)
    
    if(episode%100==0):
        av_rewards=sum(rewards)/100;
        print("Game numer %d, epsilon=%f,  total reward:%f" %(episode,epsilon,av_rewards))
        rewards = []

print(qtable)

Game numer 0, epsilon=1.000000,  total reward:0.000000
Game numer 100, epsilon=0.904933,  total reward:0.010000
Game numer 200, epsilon=0.818912,  total reward:0.020000
Game numer 300, epsilon=0.741077,  total reward:0.000000
Game numer 400, epsilon=0.670650,  total reward:0.030000
Game numer 500, epsilon=0.606924,  total reward:0.010000
Game numer 600, epsilon=0.549263,  total reward:0.060000
Game numer 700, epsilon=0.497089,  total reward:0.030000
Game numer 800, epsilon=0.449880,  total reward:0.050000
Game numer 900, epsilon=0.407163,  total reward:0.090000
Game numer 1000, epsilon=0.368512,  total reward:0.100000
Game numer 1100, epsilon=0.333538,  total reward:0.060000
Game numer 1200, epsilon=0.301893,  total reward:0.070000
Game numer 1300, epsilon=0.273259,  total reward:0.150000
Game numer 1400, epsilon=0.247350,  total reward:0.050000
Game numer 1500, epsilon=0.223907,  total reward:0.140000
Game numer 1600, epsilon=0.202695,  total reward:0.150000
Game numer 1700, epsilon=0

Game numer 14100, epsilon=0.001001,  total reward:0.800000
Game numer 14200, epsilon=0.001001,  total reward:0.750000
Game numer 14300, epsilon=0.001001,  total reward:0.750000
Game numer 14400, epsilon=0.001001,  total reward:0.790000
Game numer 14500, epsilon=0.001001,  total reward:0.580000
Game numer 14600, epsilon=0.001000,  total reward:0.750000
Game numer 14700, epsilon=0.001000,  total reward:0.680000
Game numer 14800, epsilon=0.001000,  total reward:0.770000
Game numer 14900, epsilon=0.001000,  total reward:0.760000
Game numer 15000, epsilon=0.001000,  total reward:0.690000
Game numer 15100, epsilon=0.001000,  total reward:0.640000
Game numer 15200, epsilon=0.001000,  total reward:0.750000
Game numer 15300, epsilon=0.001000,  total reward:0.710000
Game numer 15400, epsilon=0.001000,  total reward:0.710000
Game numer 15500, epsilon=0.001000,  total reward:0.700000
Game numer 15600, epsilon=0.001000,  total reward:0.610000
Game numer 15700, epsilon=0.001000,  total reward:0.5900

We can now watch our little q-table playing the game, this time using *in policty* moves only:

In [6]:
env.reset()
env.render()
for step in range(max_steps):
    print ("t=%d " % (step))
    # Take the action (index) that have the maximum expected future reward given that state
    action = np.argmax(qtable[state,:])
    new_state, reward, done, info = env.step(action)
    if done:
        env.render()            
        # We print the number of step it took.
        print("Number of steps", step)
        break
    #else we move    
    state = new_state
    env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG
t=0 
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
t=1 
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
t=2 
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
t=3 
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
t=4 
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
t=5 
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
t=6 
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
t=7 
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
t=8 
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
t=9 
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
t=10 
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
t=11 
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
t=12 
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
t=13 
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
t=14 
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Number of steps 14


Playing the parameters, we should be able to make an algortihm quite capable to play the game