### Goal:
###### We choose the Taxi-v3 environment for the Gymnasuim which is a standard API for reinforcement learning, and a diverse collection of reference environments.
###### We will use the Q-Learning algorithm to move the taxi to the passenger’s location, pick up the passenger, move to the passenger’s desired destination, and drop off the passenger. Once the passenger is dropped off, the episode ends.

![](taxi-v3_gymnasium.gif)

### Description
There are four designated pick-up and drop-off locations (Red, Green, Yellow and Blue) in the 5x5 grid world. The taxi starts off at a random square and the passenger at one of the designated locations.

The goal is move the taxi to the passenger’s location, pick up the passenger, move to the passenger’s desired destination, and drop off the passenger. Once the passenger is dropped off, the episode ends.

The player receives positive rewards for successfully dropping-off the passenger at the correct location. Negative rewards for incorrect attempts to pick-up/drop-off passenger and for each step where another reward is not received.

For more details please visit the link: https://gymnasium.farama.org/environments/toy_text/taxi/


### Q learning equation:
![](Q_learning_equation.png)


### Import librairies

In [2]:
import numpy as np
import pygame
import gymnasium as gym
import time

pygame 2.5.0 (SDL 2.28.0, Python 3.9.13)
Hello from the pygame community. https://www.pygame.org/contribute.html


In [3]:
pygame.init()

# Work with the Taxi-v3 enviromnent
env = gym.make('Taxi-v3', render_mode='human')

# Constants
alpha = .8
gamma = .9
epochs = 1000
final_reward = 20

In [4]:
def Qmax(state, info):
    ''' Function to get the maximum Q value,and its associed action 
    for a given state'''
    
    possible_actions = np.where(info['action_mask'] == 1)[0]
    Qmax_value = max(Q[state, possible_actions])
    
    for action in possible_actions:
        if Q[state, action] == Qmax_value:
            chosen_action = action
            
    return Qmax_value, chosen_action

### Training the Q-Algorithm

In [5]:
nbr_states = env.observation_space.n
nbr_actions = env.action_space.n

Q = np.zeros((nbr_states, nbr_actions))

start_time = time.time() # for get the training time
for epoch in range(epochs):
    init_state, info = env.reset()
    state = init_state
    
    total_reward = 0
    reward = 0

    while reward != final_reward:

        action = Qmax(state, info)[1]
        next_state, reward, _, _, info = env.step(action)
        Qmax_value = Qmax(next_state, info)[0]

        Q[state, action] = Q[state, action] + alpha * (reward + gamma * Qmax_value - Q[state, action])

        state = next_state
        total_reward += reward
    
    print(f'Total rewared after epoch {epoch + 1}: {total_reward}')

end_time = time.time()
train_time = end_time - start_time
print(f'The training time is {train_time} seconds')

env.close()

Total rewared after epoch 1: -554
Total rewared after epoch 2: -386
Total rewared after epoch 3: -492
Total rewared after epoch 4: -242
Total rewared after epoch 5: -140
Total rewared after epoch 6: -51
Total rewared after epoch 7: -85
Total rewared after epoch 8: -34
Total rewared after epoch 9: -218
Total rewared after epoch 10: -225
Total rewared after epoch 11: -294
Total rewared after epoch 12: -59
Total rewared after epoch 13: -10
Total rewared after epoch 14: -51
Total rewared after epoch 15: -131
Total rewared after epoch 16: -114
Total rewared after epoch 17: -87
Total rewared after epoch 18: -55
Total rewared after epoch 19: -146
Total rewared after epoch 20: -159
Total rewared after epoch 21: -383
Total rewared after epoch 22: -157
Total rewared after epoch 23: -72
Total rewared after epoch 24: -183
Total rewared after epoch 25: 4
Total rewared after epoch 26: -198
Total rewared after epoch 27: -93
Total rewared after epoch 28: 3
Total rewared after epoch 29: -40
Total rewar

Total rewared after epoch 245: 4
Total rewared after epoch 246: 7
Total rewared after epoch 247: 3
Total rewared after epoch 248: 6
Total rewared after epoch 249: 6
Total rewared after epoch 250: 9
Total rewared after epoch 251: 8
Total rewared after epoch 252: 7
Total rewared after epoch 253: 9
Total rewared after epoch 254: 3
Total rewared after epoch 255: 10
Total rewared after epoch 256: 7
Total rewared after epoch 257: 6
Total rewared after epoch 258: 9
Total rewared after epoch 259: 11
Total rewared after epoch 260: 8
Total rewared after epoch 261: -2
Total rewared after epoch 262: 7
Total rewared after epoch 263: 10
Total rewared after epoch 264: 7
Total rewared after epoch 265: 13
Total rewared after epoch 266: 12
Total rewared after epoch 267: 7
Total rewared after epoch 268: 10
Total rewared after epoch 269: 12
Total rewared after epoch 270: 14
Total rewared after epoch 271: 5
Total rewared after epoch 272: 6
Total rewared after epoch 273: 3
Total rewared after epoch 274: -8


Total rewared after epoch 492: 10
Total rewared after epoch 493: 10
Total rewared after epoch 494: 11
Total rewared after epoch 495: 8
Total rewared after epoch 496: 11
Total rewared after epoch 497: 5
Total rewared after epoch 498: 7
Total rewared after epoch 499: 6
Total rewared after epoch 500: 13
Total rewared after epoch 501: 3
Total rewared after epoch 502: 6
Total rewared after epoch 503: 8
Total rewared after epoch 504: 7
Total rewared after epoch 505: 9
Total rewared after epoch 506: 4
Total rewared after epoch 507: 3
Total rewared after epoch 508: 8
Total rewared after epoch 509: 7
Total rewared after epoch 510: 5
Total rewared after epoch 511: 5
Total rewared after epoch 512: 7
Total rewared after epoch 513: 8
Total rewared after epoch 514: 8
Total rewared after epoch 515: 5
Total rewared after epoch 516: 10
Total rewared after epoch 517: 8
Total rewared after epoch 518: 9
Total rewared after epoch 519: 9
Total rewared after epoch 520: 7
Total rewared after epoch 521: 9
Tota

Total rewared after epoch 739: 8
Total rewared after epoch 740: 12
Total rewared after epoch 741: 12
Total rewared after epoch 742: 13
Total rewared after epoch 743: 9
Total rewared after epoch 744: 12
Total rewared after epoch 745: 10
Total rewared after epoch 746: 6
Total rewared after epoch 747: 8
Total rewared after epoch 748: 10
Total rewared after epoch 749: 8
Total rewared after epoch 750: 6
Total rewared after epoch 751: 8
Total rewared after epoch 752: 9
Total rewared after epoch 753: 5
Total rewared after epoch 754: 5
Total rewared after epoch 755: 5
Total rewared after epoch 756: 13
Total rewared after epoch 757: 9
Total rewared after epoch 758: 4
Total rewared after epoch 759: 6
Total rewared after epoch 760: 7
Total rewared after epoch 761: 6
Total rewared after epoch 762: 5
Total rewared after epoch 763: 5
Total rewared after epoch 764: 8
Total rewared after epoch 765: 12
Total rewared after epoch 766: 8
Total rewared after epoch 767: 4
Total rewared after epoch 768: 9
To

Total rewared after epoch 986: 6
Total rewared after epoch 987: 8
Total rewared after epoch 988: 5
Total rewared after epoch 989: 8
Total rewared after epoch 990: 5
Total rewared after epoch 991: 6
Total rewared after epoch 992: 5
Total rewared after epoch 993: 3
Total rewared after epoch 994: 10
Total rewared after epoch 995: 10
Total rewared after epoch 996: 5
Total rewared after epoch 997: 5
Total rewared after epoch 998: 8
Total rewared after epoch 999: 6
Total rewared after epoch 1000: 10
The training time is 6113.090679645538 seconds


### Test the algorithm

In [13]:
# For the test, we had saved the Q value matrix in pickle file 
# We don't need to train the algorithm each time, to get the right value of Q matrix
pygame.init()

env = gym.make('Taxi-v3', render_mode='human')

init_state, info = env.reset()
state = init_state

# Get the Q_value from the pickle file
file = open('Q_matrix', 'rb')
Q = pickle.load(file)
file.close()

reward = 0
total_reward = 0

while reward != final_reward:
    action = Qmax(state, info)[1]
    next_state, reward, _, _, info = env.step(action)
    state = next_state
    total_reward += reward

print(f'Total rewared is: {total_reward}')

env.close()

Total rewared is: 9
