# Reinforcement Q-Learning with OpenAI Gym
The short script aims to help acquaint with implement reinforcement Q-learning using OpenAI Gym

### Problem Description: Self-Driving Taxi

#### State Space: There are four locations that we can pick up and drop off a passenger with label R, G, Y, B on the grid
* The BLUE letter indicates the pickup location
* The MAGENTA letter indicates the drop off location
* The solid lines represent walls that the taxi cannnot cross
* The filled rectangle represents the taxi itself (yellow when empty and green when carrying passenger)

#### Action space: There are six possible actions
* Go south
* Go north
* Go east
* Go west
* Pickup
* Dropoff

#### Reward / Penalty:
* +20 points for a successful dropoff
* Lose 1 point for every timestep it takes
* 10 point penalty for illegal pick-up and drop-off actions

### Code Adapted From / Sources: 
* https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/
* https://github.com/felipelodur/Q-Learning-Taxi-v2/blob/master/Q-Learning-Taxi.ipynb

## Step 1: Setup

#### Make sure to have gym installed:
* conda install -c conda-forge gym <b>OR</b>
* pip install gym

In [1]:
import gym
import random

random.seed(42)

## Step 2: Initialize the Enviroment 

In [2]:
taxi_env = gym.make("Taxi-v3").env
taxi_env.render()

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : | :[43m [0m|
|Y| : |[34;1mB[0m: |
+---------+



## Step 3: Configure the Initial State

In [3]:
# (taxi row, taxi column, passenger index, destination index)
initial_state = taxi_env.encode(2, 3, 2, 0)

taxi_env.s = initial_state
taxi_env.render()

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



In [4]:
# Reward Table

# {action: [(probability, nextstate, reward, done)]} for the current state
# For the action, ["Go south", "Go north", "Go east", "Go west", "Pickup", "Dropoff"]
taxi_env.P[initial_state]

{0: [(1.0, 368, -1, False)],
 1: [(1.0, 168, -1, False)],
 2: [(1.0, 288, -1, False)],
 3: [(1.0, 248, -1, False)],
 4: [(1.0, 268, -10, False)],
 5: [(1.0, 268, -10, False)]}

## Step 4: Using Brute-Force Approach

In [5]:
taxi_env.s = 328        # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

bf_frames = []          # stored data to use to show animation

done = False

while not done:
    action = taxi_env.action_space.sample()
    state, reward, done, info = taxi_env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    bf_frames.append({
        'frame': taxi_env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
    })
    
    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 1455
Penalties incurred: 491


In [6]:
# Show Animation (Takes awhile)

from IPython.display import clear_output
from time import sleep

def print_frames(bf_frames):
    for i, frame in enumerate(bf_frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(bf_frames)

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 1455
State: 0
Action: 5
Reward: 20


## Step 5: Using Reinforcement Learning (Q-Learning Algorithm)

#### Q-Table: 
The Q-table is a matrix where we have a row for every state (5x5x5x4 = 500) and a column for every action (6). It is use to store Q-value which is a state-action pair (the sum of the instant reward and the discounted future reward). After finished the training process, we can use the Q-value to pick the most optimal action in a given state:
* Taxi location: 5 x 5 (grid environment)
* Passenger location: 5 (R, G, Y, B, inside the car)
* Drop off location: 4 (R, G, Y, B) 

#### Q-Learning Process:
* Initialize the Q-table by all zeros.
* Start exploring actions: For each state, select any one among all possible actions for the current state (S).
* Travel to the next state (S') as a result of that action (a).
* For all possible actions from the state (S') select the one with the highest Q-value.
* Update Q-table values using the equation.
* Set the next state as the current state.
* If goal state is reached, then end and repeat the process.

### Setup

In [7]:
import numpy as np

# Intialized the Q-Table (500×6 matrix of zeros)
q_table = np.zeros([taxi_env.observation_space.n, taxi_env.action_space.n])
q_table

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       ...,
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [8]:
# Hyperparameter Configuration

alpha = 0.1    # Learning Rate
gamma = 0.6    # Discount Factor
epsilon = 0.1  # Exploration
epochs = 10000

### Train the Agent

In [9]:
# Train the Agent
# Note: Clean up the original code (removed unused variables)

for run in range(epochs):
    state = taxi_env.reset()
    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = taxi_env.action_space.sample()   # Explore action space
        else:
            action = np.argmax(q_table[state])        # Exploit learned values

        next_state, reward, done, info = taxi_env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state

print("Training Finished\n")

Training Finished



### Evaluate the Agent

In [10]:
# Now that the agent finished training, lets check with our initial setup we have in step 3
taxi_env.s = initial_state
taxi_env.render()

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (Dropoff)


In [11]:
# Action = ["Go south", "Go north", "Go east", "Go west", "Pickup", "Dropoff"]
# Based on the result, the taxi should go WEST since it has the highest reward
q_table[initial_state]

array([-2.41021921, -2.40348414, -2.41508538, -2.3639511 , -7.30334373,
       -8.66553637])

In [12]:
# Show the Q-table in action with the default initial state

from IPython.display import clear_output
from time import sleep

for trip_num in range(1, 6):
    state = taxi_env.reset()
   
    done = False
    trip_length = 0
    
    while not done and trip_length < 25:
        action = np.argmax(q_table[state])
        next_state, reward, done, info = taxi_env.step(action)
        clear_output(wait=True)
        
        print("Trip Number " + str(trip_num) + " Length/Step " + str(trip_length))
        print(taxi_env.render(mode='ansi'))
        
        sleep(.5)
        state = next_state
        trip_length += 1
        
    sleep(2)
        

Trip Number 5 Length/Step 12
+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

