# Self-Driving Taxi using Q-Learning

## Summary

The goal of this project is to develop a self-driving taxi that can efficiently pick-up passengers and drop them off at their intended location.

## Implementation

### Imports

The first step is to imort the necessary libraries.

In [1]:
import numpy as np
import gym
import random
import time
from IPython.display import clear_output

### Setup the Environment

The following command creates the taxi problem from OpenAI's `gym` library.

In [2]:
env = gym.make("Taxi-v3")

Next, the q-table is created, with a q value for each (state, action) pair.

In [3]:
num_states = env.observation_space.n
print("Number of states: " + str(num_states))
num_actions = env.action_space.n
print("Number of actions: " + str(num_actions) + "\n")

q_table = np.zeros((num_states, num_actions))
print("Q Table:")
print(q_table)

Number of states: 500
Number of actions: 6

Q Table:
[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


This problem has 500 states:
 - There are 5x5 cells on the grid
 - The passenger can be in any of 5 locations (4 starting locations, or the taxi)
 - There are 4 possible destinations
Therefore, the total number of states is 5x5x5x4 = 500

There are 6 actions: `up`, `down`, `left`, `right`, `pick-up`, `drop-off`

Therefore the q-table is a 500x6 matrix.

### Parameters

Before the agent can learn the optimal policy by filling the q-table, the constants for q-learning must be defined.

In [4]:
num_episodes = 10000
max_steps = 100
lr = 0.2
discount_rate = 0.9

exploration_rate = 1
max_exploration = 1
min_exploration = 0.001
exploration_decay = 0.001

### Q-Learning Algorithm

In [5]:
rewards = []

for episode in range(num_episodes):
    state = env.reset()
    done = False
    current_rewards = 0
    
    for step in range(max_steps):
        exploration_threshold = random.uniform(0,1)
        if exploration_threshold > exploration_rate:
            action = np.argmax(q_table[state,:])
        else:
            action = env.action_space.sample()
        
        new_state, reward, done, info = env.step(action)
        
        q_table[state,action] = q_table[state,action]*(1-lr) + lr*(reward+discount_rate*np.max(q_table[new_state,:]))
        state = new_state
        current_rewards += reward
        
        if done == True:
            break
        
        exploration_rate = min_exploration + (max_exploration-min_exploration)*np.exp(-exploration_decay*episode)
        
    rewards.append(current_rewards)

### Average Rewards per 1000 Episodes

In [6]:
rewards_per_thousand_episodes = np.split(np.array(rewards), num_episodes/1000)
count = 1000

print("Average reward per thousand episodes:")
for r in rewards_per_thousand_episodes:
    print(str(count) + ": " + str(sum(r/1000)))
    count += 1000

print("\nQ Table:")
print(q_table)

Average reward per thousand episodes:
1000:-209.72699999999958
2000:-11.373000000000006
3000:2.88399999999999
4000:6.043999999999971
5000:7.135999999999967
6000:7.65799999999995
7000:7.921999999999971
8000:7.968999999999959
9000:7.813999999999964
10000:7.8569999999999585
Q Table:
[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [-1.40853083e+00  7.44485617e-03 -2.94392375e+00 -7.39848307e-01
   1.62261467e+00 -9.92263229e+00]
 [ 3.54940584e+00  5.07057193e+00  1.94780856e+00  5.08330596e+00
   7.71470000e+00 -3.56039024e+00]
 ...
 [-1.35848138e+00  9.23946047e+00  4.72674493e-01 -1.47743415e+00
  -6.98055308e+00 -6.46949164e+00]
 [-2.91217654e+00  1.46104537e+00 -2.83019682e+00 -2.71952020e+00
  -9.03397835e+00 -9.57891856e+00]
 [ 2.40898039e+00  4.91693769e+00  1.53562143e+00  1.69936248e+01
  -1.40986968e-01  7.33676919e-01]]


### Watch the Agent Play Taxi Driver

In [7]:
episodes = int(input("How many episodes would you like to see? "))
time.sleep(1)
rewards = []
#avg_rewards = 0

for episode in range(episodes):
    
    clear_output(wait=True)
    state = env.reset()
    env.render()
    done = False
    episode_reward = 0

    while(done == False):
    
        action = np.argmax(q_table[state,:])
        next_state, reward, done, info = env.step(action)
        state = next_state
        episode_reward += reward
    
        
        time.sleep(0.5)
        clear_output(wait=True)
        print('Episode: ' + str(episode+1))
        env.render()
        print('Episode Reward: ' + str(episode_reward))
        # print('Avg. Reward: ' + str(avg_rewards))
    
    time.sleep(1)
    rewards.append(episode_reward)
    # avg_rewards = (sum(rewards)/len(rewards))
    
env.close()

Episode: 4
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[35m[34;1m[43mY[0m[0m[0m| : |B: |
+---------+
  (Dropoff)
Episode Reward: 7
