### Goal:
###### We choose the Taxi-v3 environment for the Gymnasuim which is a standard API for reinforcement learning, and a diverse collection of reference environments.
###### We will use the Q-Learning algorithm to move the taxi to the passenger’s location, pick up the passenger, move to the passenger’s desired destination, and drop off the passenger. Once the passenger is dropped off, the episode ends.

![](taxi-v3_gymnasium.gif)

### Description
There are four designated pick-up and drop-off locations (Red, Green, Yellow and Blue) in the 5x5 grid world. The taxi starts off at a random square and the passenger at one of the designated locations.

The goal is move the taxi to the passenger’s location, pick up the passenger, move to the passenger’s desired destination, and drop off the passenger. Once the passenger is dropped off, the episode ends.

The player receives positive rewards for successfully dropping-off the passenger at the correct location. Negative rewards for incorrect attempts to pick-up/drop-off passenger and for each step where another reward is not received.

For more details please visit the link: https://gymnasium.farama.org/environments/toy_text/taxi/


### Q learning equation:
![](Q_learning_equation.png)


### Import librairies

In [6]:
import numpy as np
import pygame
import gymnasium as gym
import time
import pickle

In [3]:
pygame.init()

# Work with the Taxi-v3 enviromnent
env = gym.make('Taxi-v3', render_mode='human')

# Constants
alpha = .8
gamma = .9
epochs = 1000
final_reward = 20

In [4]:
def Qmax(state, info):
    ''' Function to get the maximum Q value,and its associed action 
    for a given state'''
    
    possible_actions = np.where(info['action_mask'] == 1)[0]
    Qmax_value = max(Q[state, possible_actions])
    
    for action in possible_actions:
        if Q[state, action] == Qmax_value:
            chosen_action = action
            
    return Qmax_value, chosen_action

### Training the Q-Algorithm

In [2]:
nbr_states = env.observation_space.n
nbr_actions = env.action_space.n

Q = np.zeros((nbr_states, nbr_actions))

start_time = time.time() # for get the training time
for epoch in range(epochs):
    init_state, info = env.reset()
    state = init_state
    
    total_reward = 0
    reward = 0

    while reward != final_reward:

        action = Qmax(state, info)[1]
        next_state, reward, _, _, info = env.step(action)
        Qmax_value = Qmax(next_state, info)[0]

        Q[state, action] = Q[state, action] + alpha * (reward + gamma * Qmax_value - Q[state, action])

        state = next_state
        total_reward += reward
    
    print(f'Total rewared after epoch {epoch + 1}: {total_reward}')

end_time = time.time()
train_time = end_time - start_time
print(f'The training time is {train_time} seconds')

env.close()




### Test the algorithm

In [10]:
# For the test, we had saved the Q value matrix in pickle file 
# We don't need to train the algorithm each time, to get the right value of Q matrix
pygame.init()

env = gym.make('Taxi-v3', render_mode='human')

init_state, info = env.reset()
state = init_state

# Get the Q_value from the pickle file
file = open('Q_matrix', 'rb')
Q = pickle.load(file)
file.close()

reward = 0
total_reward = 0

while reward != final_reward:
    action = Qmax(state, info)[1]
    next_state, reward, _, _, info = env.step(action)
    state = next_state
    total_reward += reward

print(f'Total rewared is: {total_reward}')

env.close()

Total rewared is: 5
