# Q-learning for beginners

## Train an AI to solve the Frozen Lake environment

Feb 13, 2022 • Maxime Labonne

[Q-learning for beginners](https://mlabonne.github.io/blog/reinforcement%20learning/q-learning/frozen%20lake/gym/tutorial/2022/02/13/Q_learning.html)

In [1]:
import gym
import random
import numpy as np
import pandas as pd
from timeit import default_timer as timer

In [2]:
# Initialize the environment
environment = gym.make("FrozenLake-v1", is_slippery=False)
environment.reset()

# Initialize the Q-table
nb_states = environment.observation_space.n
nb_actions = environment.action_space.n

shape = (nb_states, nb_actions)
qtable = np.zeros(shape)

shape  # 16 tiles/states, 4 actions (L, R, U, D)

(16, 4)

In [3]:
df = pd.DataFrame(qtable)
df

Unnamed: 0,0,1,2,3
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0


In [4]:
# Randomly choose an action, for the hecc of it.
seq = ["LEFT", "DOWN", "RIGHT", "UP"]
random.choice(seq)

'DOWN'

In [5]:
# Left 0, Down 1, Right 2, Up 3
environment.action_space.sample()

3

In [6]:
# THIS IS THE BASICS

# Random action
action = environment.action_space.sample()

# Implement action
new_state, reward, done, _, info = environment.step(action)

# Display reward
reward

0.0

# Q-learning

[Buckle up.](https://mlabonne.github.io/blog/reinforcement%20learning/q-learning/frozen%20lake/gym/tutorial/2022/02/13/Q_learning.html#%F0%9F%A4%96-III.-Q-learning)

We need to update the value of our state-action pairs (each cell in the Q-table) considering:

1. The reward for reaching the next state, and 
2. The highest possible value in the next state.

The new value is the current one + the reward + the highest value in the next state.

# Training

So training our agent in code means:

1. Choosing a random action (using `action_space.sample()`) if the values in the current state are just zeros. Otherwise, we take the action with the highest value in the current state with the function np.argmax()`
2. Implementing this action by moving in the desired direction with `step(action)`
3. Updating the value of the original state with the action we took, using information about the new state and the reward given by `step(action)`


In [7]:
start_time = timer()

# Re-initialize Q-table
qtable = np.zeros((environment.observation_space.n, environment.action_space.n))

# Hyperparameters
episodes = 1000  # Total number of episodes
alpha = 0.5  # Learning rate
gamma = 0.9  # Discount factor

# List of outcomes to plot
outcomes = []

print('Q-table before training:')
print(qtable)

# Training
for _ in range(episodes):
    state, prob = environment.reset()
    done = False

    # By default, we consider our outcome to be a failure
    outcomes.append("Failure")

    # Until the agent gets stuck in a hole or reaches the goal, keep training it
    while not done:
        # Choose the action with the highest value in the current state
        if np.max(qtable[state]) > 0:
            action = np.argmax(qtable[state])
        else:
            # If there's no best action (only zeros), take a random one
            action = environment.action_space.sample()

        # Implement this action and move the agent in the desired direction
        new_state, reward, done, t, info = environment.step(action)

        # Update Q(s,a)
        qtable[state, action] = qtable[state, action] + \
                                alpha * (reward + gamma * np.max(qtable[new_state]) - qtable[state, action])

        # Update our current state
        state = new_state

        # If we have a reward, it means that our outcome is a success
        if reward:
            outcomes[-1] = "Success"

end_time = timer()
total_time = end_time - start_time
print(f"\nTrain time: {total_time:.3f} seconds")

print()
print('===========================================')
print('Q-table after training:')
print(qtable)
print("\noutcomes len:", len(outcomes))

Q-table before training:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

Train time: 0.217 seconds

Q-table after training:
[[0.      0.59049 0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.6561  0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.729   0.     ]
 [0.      0.      0.81    0.     ]
 [0.      0.9     0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      0.      0.     ]
 [0.      0.      1.      0.     ]
 [0.      0.      0.      0.     ]]

outcomes len: 1000


# Evaluating trained agent on 100 episodes

Calculate the percentage of times the agent managed to reach the goal (success rate).

In [8]:
episodes = 100
nb_success = 0

# Evaluation
for _ in range(100):
    state, p = environment.reset()
    done = False

    # Do until agent gets stuck or reaches goal
    while not done:
        # Choose action with highest value
        if np.max(qtable[state]) > 0:
            action = np.argmax(qtable[state])
        else:
            # If no best action, take random one
            action = environment.action_space.sample()

        # Implement action and move agent
        new_state, reward, done, tt, info = environment.step(action)

        # Update current state
        state = new_state

        # Tally up the reward
        nb_success += reward

print("\nnum successes:", nb_success)

print("\nepisodes:", episodes)

print(f"\nSuccess rate = {nb_success / episodes * 100}%")



num successes: 100.0

episodes: 100

Success rate = 100.0%


# Visualize the agent moving on the map

In [9]:
# import time

# import gym
# import numpy as np
# from IPython.display import clear_output

# environment = gym.make("FrozenLake-v1", is_slippery=False, render_mode="human")

# nb_states = environment.observation_space.n
# nb_actions = environment.action_space.n
# qtable = np.zeros((nb_states, nb_actions))

# state, p = environment.reset()
# environment.render()

# done = False
# sequence = []

# while not done:
#     if np.max(qtable[state]) > 0:
#         action = np.argmax(qtable[state])
#     else:
#         action = environment.action_space.sample()

#     # Add the action to the sequence
#     sequence.append(action)

#     new_state, reward, done, t, info = environment.step(action)

#     state = new_state

#     # Update the render
#     clear_output(wait=True)
#     environment.render()
#     time.sleep(1)

# print(f"Sequence = {sequence}")

# environment.close()

# Sequence = [1, 3, 0, 2, 0, 1, 1, 0, 1]

# Epsilon-Greedy algorithm

With our previous approach, the agent always chooses the action with the highest value.

We want to allow our agent to either:

1. Take the action with the highest value (exploitation)
2. Choose a random action to try to find even better ones (exploration)

## Implement a linear decay

In [10]:
start_time = timer()

# Reset qtable
qtable = np.zeros((environment.observation_space.n, environment.action_space.n))

# Hyper-parameters
episodes = 1000  # Total number of episodes
alpha = 0.5  # Learning rate
gamma = 0.9  # Discount factor

# ADD EPSILON
epsilon = 1.0  # Amount of randomness in the action selection
epsilon_decay = 0.001  # Fixed amount to decrease

outcomes = []

print('Q-table before training:')
print(qtable)

# Training
for _ in range(episodes):
    state, p = environment.reset()
    done = False

    outcomes.append("Failure")

    while not done:
        # Generate random number between 0 and 1
        rnd = np.random.random()

        # INSTEAD OF: if np.max(qtable[state]) > 0: action = np.argmax(qtable[state]);
        # DO: If random number < epsilon, take a random action
        if rnd < epsilon:
            action = environment.action_space.sample()
        else:
            # You get an array of 4 values; you pick the max value.
            action = np.argmax(qtable[state])

        new_state, reward, done, t, info = environment.step(action)

        # Update Q(s,a)
        qtable[state, action] = qtable[state, action] + \
                                alpha * (reward + gamma * np.max(qtable[new_state]) - qtable[state, action])

        state = new_state

        if reward:
            outcomes[-1] = "Success"

    # Update epsilon
    epsilon = max(epsilon - epsilon_decay, 0)

end_time = timer()

print()
print('===========================================')
print('Q-table after training:')
print(qtable)

total_time = end_time - start_time
print(f"\nTrain time: {total_time:.3f} seconds")

Q-table before training:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

Q-table after training:
[[0.531441   0.59049    0.59049    0.531441  ]
 [0.531441   0.         0.6561     0.59033723]
 [0.58560004 0.729      0.55250441 0.65566461]
 [0.63222237 0.         0.33930303 0.11399433]
 [0.59049    0.6561     0.         0.531441  ]
 [0.         0.         0.         0.        ]
 [0.         0.81       0.         0.64356913]
 [0.         0.         0.         0.        ]
 [0.6561     0.         0.729      0.59049   ]
 [0.6561     0.81       0.81       0.        ]
 [0.729      0.9        0.         0.72899988]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.80999991 0.9        0.72899999]
 [0.81       0.9        1.         0.81      ]
 [0.

### Not bad!

More of the table is filled in.  But how did it do?

In [12]:
episodes = 100
nb_success = 0

# Evaluation
for _ in range(100):
    state, p = environment.reset()
    done = False

    # Until the agent gets stuck or reaches the goal, keep training it
    while not done:
        # Choose the action with the highest value in the current state
        action = np.argmax(qtable[state])

        # Implement this action and move the agent in the desired direction
        new_state, reward, done, t, info = environment.step(action)

        # Update our current state
        state = new_state

        # When we get a reward, it means we solved the game
        nb_success += reward

# Let's check our success rate!
print(f"Success rate = {nb_success / episodes * 100}%")


Success rate = 100.0%


## Nice!

# Slippery frozen lake?

In [13]:
environment = gym.make("FrozenLake-v1", is_slippery=True)
environment.reset()

# RESET Q-table
qtable = np.zeros((environment.observation_space.n, environment.action_space.n))

# Hyper-parameters
episodes = 1000  # Total number of episodes
alpha = 0.5  # Learning rate
gamma = 0.9  # Discount factor
epsilon = 1.0  # Amount of randomness in the action selection
epsilon_decay = 0.001  # Fixed amount to decrease

# List of outcomes to plot
outcomes = []

print('Q-table before training:')
print(qtable)

# TRAIN
for _ in range(episodes):
    state, prob = environment.reset()
    done = False

    # By default, we consider our outcome to be a failure
    outcomes.append("Failure")

    # Until the agent gets stuck in a hole or reaches the goal, keep training it
    while not done:
        # Generate a random number between 0 and 1
        rnd = np.random.random()

        # If random number < epsilon, take a random action
        if rnd < epsilon:
            action = environment.action_space.sample()
        else:
            # Else, take the action with the highest value in the current state
            action = np.argmax(qtable[state])

        # Implement this action and move the agent in the desired direction
        new_state, reward, done, t, info = environment.step(action)

        # Update Q(s,a)
        qtable[state, action] = qtable[state, action] + \
                                alpha * (reward + gamma * np.max(qtable[new_state]) - qtable[state, action])

        # Update our current state
        state = new_state

        # If we have a reward, it means that our outcome is a success
        if reward:
            outcomes[-1] = "Success"

    # Update epsilon
    epsilon = max(epsilon - epsilon_decay, 0)

print()
print('===========================================')
print('Q-table after training:')
print(qtable)

# EVALUATE
episodes = 100
nb_success = 0

for _ in range(100):
    state, p = environment.reset()
    done = False

    # Until the agent gets stuck or reaches the goal, keep training it
    while not done:
        # Choose the action with the highest value in the current state
        action = np.argmax(qtable[state])

        # Implement this action and move the agent in the desired direction
        new_state, reward, done, t, info = environment.step(action)

        # Update our current state
        state = new_state

        # When we get a reward, it means we solved the game
        nb_success += reward

# Let's check our success rate!
print(f"Success rate = {nb_success / episodes * 100}%")


Q-table before training:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

Q-table after training:
[[0.0747668  0.02380324 0.02559581 0.02514055]
 [0.01810044 0.01297042 0.01014918 0.02246237]
 [0.01734818 0.02262023 0.02817611 0.02821268]
 [0.00440419 0.01710293 0.01713482 0.02780391]
 [0.08572783 0.02313136 0.02033653 0.02205256]
 [0.         0.         0.         0.        ]
 [0.00141386 0.00151028 0.0019543  0.00171982]
 [0.         0.         0.         0.        ]
 [0.05542625 0.04201732 0.01927871 0.18337886]
 [0.08799223 0.35035972 0.04802661 0.09908065]
 [0.5502953  0.07968777 0.05927805 0.0775592 ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.12861417 0.182925   0.61739371 0.0638344 ]
 [0.40841728 0.95570518 0.39422477 0.39612477]
 [0.

# Nice!

Success rate is 90%.

What if it wasn't; what if we wanna make it better?

You can tweak the hyper-parameters...

Maybe implement exponential decay for the epsilon-greedy algorithm too...

FYI &ndash; slightly modifying the hyperparameters can completely destroy the results.

This is a quirk of reinforcement learning: hyperparameters are quite moody, and it is important to understand their meaning if you want to tweak them.

It's always good to test and try new combinations to build your intuition and become more efficient.