# Taxi-v3

Gym provides different game environments which we can plug into our code and train an agent with. The library takes care of the API for providing all the information that our agent requires, like possible actions, score, and current state. We just need to focus on the algorithm part of our agent.

We'll be using the Gym environment called Taxi-v3, which all of the details explained in previous Notebook were pulled from. The objectives, rewards, and actions are all the same.

## 1. Install Gym(nasium)

We need to install Gymnasium first (Gym from OpenAI is no longer supported). Docs: https://gymnasium.farama.org/  

Executing the following should work.

In [1]:
pip install gymnasium

Collecting gymnasiumNote: you may need to restart the kernel to use updated packages.

  Downloading gymnasium-0.29.1-py3-none-any.whl.metadata (10 kB)
Collecting cloudpickle>=1.2.0 (from gymnasium)
  Downloading cloudpickle-3.0.0-py3-none-any.whl.metadata (7.0 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
   ---------------------------------------- 0.0/953.9 kB ? eta -:--:--
   --------- ------------------------------ 235.5/953.9 kB 3.6 MB/s eta 0:00:01
   --------------------- ------------------ 512.0/953.9 kB 4.6 MB/s eta 0:00:01
   ---------------------------- ----------- 686.1/953.9 kB 4.0 MB/s eta 0:00:01
   ---------------------------------------- 953.9/953.9 kB 4.0 MB/s eta 0:00:00
Downloading cloudpickle-3.0.0-py3-none-any.whl (20 kB)
Installing collected packages: farama-notifications, cloudpickle, gymnasium
Successfully installed cloudpickle

In [2]:
pip install "gymnasium[all]"

^C
Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install gymnasium[toy-text]

## 2. Gym's interface

Once installed, we can load the game environment and render what it looks like. When we're all done, we can close the rendering an environment with env.close(). This will clear all memory. So only use when you want to start from scratch again.
We can also print the action and state space.

In [None]:
import gymnasium as gym
import time

For now, let's just render the environment for 10 seconds, and close the window again

In [None]:
env = gym.make("Taxi-v3", render_mode="human") # load the game environment
state, info = env.reset() # reset to make sure with start with a randomly chosen start-state

print("Action Space: {}".format(env.action_space)) # 6 different actions we can take - discrete actions, no continuous actions (no 'range') 
print("State Space: {}\n".format(env.observation_space)) # 500 different states

print("Current state: %d" % env.unwrapped.s)

env.render() # visualize the environment - we only need to call render once
time.sleep(10)
env.close()


Reminder of our problem:

- The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
- The pipe ("|") represents a wall which the taxi cannot cross.
- R, G, Y, B are the possible pickup and dropoff locations. The blue letter represents the current passenger pick-up location, and the purple letter is the current destination. In the illustration, these colors are reversed.

- As verified by the prints, we have an Action Space of size 6 (south, north, east, west, pickup and dropoff) and a State Space of size 500 (the taxi's location x the passenger's location x the destination location).
- The current state is randomly chosen (a state between 0 and 499).

## 3. Back to our illustration

We can actually take our illustration, encode its state, and give it to the environment to render in Gym.

<img src="./resources/taxi.png" style="height: 350px"/>

Recall that we have the taxi at row 3, column 1, our passenger is at location 2 (=Y) (out of the 4 possible locations), and our destination is location 0 (=R). Using the Taxi-v3 state encoding method, we can do the following:

In [None]:
env = gym.make("Taxi-v3", render_mode="human")
env.reset()

# now, let's manually encode the state
state = env.unwrapped.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger location, destination location)
print("State:", state)

env.unwrapped.s = state

env.render() # visualize the environment - we only need to call render once
time.sleep(10)
env.close()


## 4. Initial State - Exercise

Generate the state where the taxi is in the lower right corner. The passenger is in the taxi and the destination is B. What is the color of the taxi right now?

## 5. Step-method

The agent performs an action by using the step-method:

```python
state, reward, done, truncated, info = env.step(action)
```

Remember: There are 6 actions (0=south, 1=north, 2=east, 3=west, 4=pickup and 5=dropoff).

The step-method returns:

- state: the new state
- reward: if your action was beneficial or not
- done: indicates if we have successfully picked up and dropped off a passenger, also called one episode
- truncated: is timelimit reached, or agent out of bounds?
- info: additional info such as performance and latency for debugging purposes

Let's go back to the illustrations state and try the different actions.

<img src="./resources/taxi.png" style="height: 250px"/>

First, let's make sure we start in the position like the image, and let's take a step west (to the left). So, we'll hit a wall, and we should receive a penalty for not reaching the destination, and we should remain in the same state (because we hit the wall).

In [None]:
env = gym.make("Taxi-v3", render_mode="human") # load the game environment again
state, info = env.reset() # reset to make sure with a clean env

state = env.unwrapped.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("Initial state:", state)
env.unwrapped.s = state

env.render() # visualize the environment - we only need to call render once
time.sleep(10)
state, reward, done, truncated, info = env.step(3) # go west
time.sleep(10)
env.close()

print("New state: %d, reward: %d" % (state, reward)) # a -1 penalty for every wall hit and the taxi won't move anywhere

Now, let's do it again, but go north instead of west. So we won't hit a wall, but still, we haven't reached our destination.

In [None]:
env = gym.make("Taxi-v3", render_mode="human") # load the game environment again
state, info = env.reset() # reset to make sure with a clean env

state = env.unwrapped.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("Initial state:", state)
env.unwrapped.s = state

env.render() # visualize the environment - we only need to call render once
time.sleep(10)
state, reward, done, truncated, info = env.step(1) # go north
time.sleep(10)
done = True

print("New state: %d, reward: %d" % (state, reward)) #
# taxi moves up

## 5. Step-method - Exercise

Generate the state again where the taxi is in the lower right corner. The passenger is in the taxi and the destination is B.

Next, carry out the two steps: first drive the taxi to the destination location, and next drop off the passenger. 
Render the environment at each step, print the new state number, the reward and done.

## 6. The Reward Table

When the Taxi environment is created, there is an initial Reward table that's also created, called *P*. **We can think of it like a matrix that has the number of states as rows and the number of actions as columns, i.e. a states × actions matrix**.

Since every state is in this matrix, we can see the default reward values assigned for state 479.

In [None]:
env.unwrapped.s = 479
env.unwrapped.P[479]

This dictionary has the structure

``
{action: [(probability, nextstate, reward, done)]}.
``

A few things to note:

- The 0-5 corresponds to the actions (south, north, east, west, pickup, dropoff) the taxi can perform at our current state.
- In this environment, the probability of each action is always 1.0.
- The nextstate is the state we would be in if we take the action at this index of the dictionary.
- All the movement actions have a -1 reward, the pickup action has a -10 reward, the dropoff actions has +20 reward in this particular state.
- done is used to tell us when we have successfully dropped off a passenger at the right location (action 5 in this state). Each successful dropoff is the end of an episode

## 7. Solving the environment without Reinforcement Learning

Let's see what will happen if we try to brute-force our way to solving the problem without RL.

We'll create an infinite loop which runs until one passenger reaches one destination (one episode), or in other words, when the received reward is 20. The `env.action_space.sample()` method automatically selects one random action from the set of all possible actions.

Let's see what happens.

In [None]:
env = gym.make("Taxi-v3", render_mode="human") # load the game environment again
state, info = env.reset() # reset to make sure with a clean env

env.unwrapped.s = 328  # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

finished = False

while not finished:
    action = env.action_space.sample() 
    state, reward, done, truncated, info = env.step(action)

    if reward == -10:
        penalties += 1

    epochs += 1
    finished = done or truncated
env.close()
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Not good. Our agent takes thousands (?) of timesteps and makes lots of wrong drop offs to deliver just one passenger to the right destination.

This is because we aren't learning from past experience. We can run this over and over, and it will never optimize. **The agent has no memory of which action was best for each state**, which is exactly what Reinforcement Learning will do for us.

## 8. Q-learning

We are going to use a *simple* RL algorithm called Q-learning which will give our agent some memory. Essentially, Q-learning lets the agent use the environment's rewards to learn, over time, the best action to take in a given state.

Therefore the Q-learning algorithm uses a Q-Table. The Q-table is a matrix where we have a row for every state (500) and a column for every action (6). The matrix is first initialized to 0, and then values are updated during training.

<img src="./resources/qtable.png" style="height: 650px"/>

The optimal action at every state is the action with the highest Q-value. So for state 328 the highest value is -1.971 (=North). For state 499 the highest value is 29 (=West). These actions indeed seem to be the best options.

In [None]:
env = gym.make("Taxi-v3", render_mode="human") # load the game environment again
state, info = env.reset() # reset to make sure with a clean env

env.unwrapped.s = 328
env.render()
print("North is best?\n")
time.sleep(5)

env.unwrapped.s = 499
env.render()
print("West is best?")
time.sleep(5)

env.close()



## 9. Training the Agent


We can now create the training algorithm that will update this Q-table as the agent explores the environment over 100 000 of episodes (of course you don't need to write this algorithm yourself).

BTW: When we create the environment, we will not be rendering it, because this slows us down way too much. Especially in the next section, where we'll be training a q-table. After this training, we'll use the 'human' rendering again.


Next, we'll initialize the Q-table to a 500×6 matrix of zeros.

In [None]:
env = gym.make("Taxi-v3", render_mode="rgb_array")
state, info = env.reset()

In [None]:
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])

In [None]:
%%time
"""Training the agent"""

import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state, info = env.reset()
    epochs, penalties, reward, = 0, 0, 0
    finished = False
    
    while not finished:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values
        next_state, reward, done, truncated, info = env.step(action) 
        finished = done or truncated

        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

Now that the Q-table has been established over 100 000 episodes, let's see what the Q-values are for our two states. Are North and West indeed the two most preferable moves for the two states?

In [None]:
print(q_table[328])
print(q_table[499])

## 10. Evaluating the agent

Let's evaluate the performance of our agent. We don't need to explore actions any further, so now the next action is always selected using the best Q-value. Training is done, so no more balancing between exploration versus exploitation. Only exploit the knowledge gathered so far, in the Q-table.

In [None]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 100

for _ in range(episodes):
    state, info = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    finished = False
    
    while not finished:
        action = np.argmax(q_table[state]) # take the action with the highest q-value
        state, reward, done, truncated, info = env.step(action)
        finished = done or truncated
        if reward == -10:
            penalties += 1

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

We can see from the evaluation that the agent's performance improved significantly and it incurred no penalties, which means it performed the correct pickup/dropoff actions with 100 different passengers.

## 11. Solving the environment with Reinforcement Learning - Exercise

Now that our agent is trained, we can use our human rendering again, and use our trained q-table to solve the taxi problem with 328 as the initial state (by analogy with section **7. Solving the environment without RL** of this notebook). Print the timesteps taken and penalties incurred. Pretty impressive! No?

## 12. Solving the environment with RL - Exercise

Solve the problem for these two initial states.

<img src="./resources/taxi1.png" style="height: 300px"/>
<img src="./resources/taxi2.png" style="height: 300px"/>

In [None]:
# solve the problem with the first/second initial state
# so encode the different states, like state = env.encode(row,column,passenger,destiny), and do the same thing: exploit the knowledge from the q-table: go for np.argmax(q_table[state])!

