<a href="https://colab.research.google.com/github/shmanth/Taxi_v3_open_ai_gym_solution_Q_learning_sarsa/blob/master/TaxiV3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Solving OpenAI Gym Environment - (Taxi-*v3*)
In this Python demo, we'll try solving the classic cab-driver problem. The purpose of this notebook is to show how to solve OpenAI Gym environments. We'll demonstrate Q-learning & SARSA on the Taxi environment.

Let's now look at the problem statement

Here, the objective is to pick up the passenger from one position and drop them off at another in minimum possible time. For this problem, we'll consider our environment to be a 5x5 grid.

 Image source: https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/
There are 4 locations (R, G, Y, B) marked in the image. And the task is to pick up the passenger from one of the four locations and drop him off at other. There is a reward of +20 for a successful dropoff, and -1 for every timestep it takes and -10 for illegal pick-up and drop-off actions.

In [1]:
# Import routines
import numpy as np
import random
import time
import gym
from IPython.display import clear_output

### Calling Env

In [2]:
env = gym.make("Taxi-v3")# Create the env

state = env.reset()
env.render() # Helps in visualizing the environment

print("Current state is:", state)

  and should_run_async(code)
  deprecation(
  deprecation(
If you want to render in human mode, initialize the environment in this way: gym.make('EnvName', render_mode='human') and don't call the render method.
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Current state is: 148


Rendering:

- yellow: taxi is unoccupied
- green: taxi is occupied by a passenger
- blue: passenger
- magenta: destination
- other grids: locations

#### State Space
The state vector for this problem is (col_index, row_index, destination_locations, passenger_position) There are 5 rows, 5 columns and 4 destination locations. What about the passenger locations? 4 or 5?

If the passenger is not in cab that means he could be only at one of the four locations. But we also need to account for 1 addition state if the passenger is inside the cab. So, passenger could be at any 4+1 possible locations.

Therefore, the state space = 5x5x4x5 = 500

In [3]:
# No. of possible states
state_size= env.observation_space.n
print("state_space: ", state_size)

state_space:  500


### Action Space
At any state, the cab driver can either move in any of the four directions or it can pickup/ drop (legally or illegally)

- 0: south
- 1: north
- 2: east
- 3: west
- 4: pickup
- 5: drop

In [4]:
# No. of possible Actions
action_size= env.action_space.n
print("Action_space: ", action_size)

Action_space:  6


## Training
Let's now solve the given MDP using Q-learning & SARSA.

## Q-Learning
Q-Learning is an off-policy optimal control algorithm. It learns the Q-values by taking the next action based on the greedy policy

In [5]:
Q_table = np.zeros((state_size, action_size))
print(Q_table)

[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


In [6]:
episodes = 100000    # No. of episodes

# hyperparameters
learning_rate = 0.1 # Learning rate
gamma = 0.8         # Discount factor
epsilon = 0.1       # Exploration- exploitation Tradeoff

In [7]:
# Keeping the policy epsilon-greedy
def epsilon_greedy(state, table):
  z = np.random.random()
  if z > epsilon:
    action = np.argmax(table[state])
  else:
    action = env.action_space.sample()
  return action

In [8]:
start = time.time()
deltas= []

for episode in range(1, episodes+1):
  state = env.reset()
  done = False
  step = 0
  biggest_change = 0

  if episode % 5000 == 0:
    print("Episode: {}".format(episode))

  while not done:
    action = epsilon_greedy(state, Q_table)
    new_state, reward, done, info = env.step(action)
    oldQ_table = Q_table[state, action]
    Q_table[state, action] += learning_rate * (reward + gamma * np.max(Q_table[new_state,:])-Q_table[state, action])
    biggest_change = max(biggest_change, np.abs(Q_table[state][action]-oldQ_table))
    state = new_state
  deltas.append(biggest_change)
  if deltas[-1] < 0.00000001:
    break
  episode +=1

end= time.time()
training_time = end-start
print("Time_taken in seconds: ", training_time)
print("maximum difference", deltas[-1])

Time_taken in seconds:  13.946112871170044
maximum difference 9.53623802146808e-09


In [9]:
Q_table[33]

array([-3.65576619, -3.70412821, -3.69712639, -3.71153713, -6.88146016,
       -7.17720704])

##Testing the Q-Table
Let's know test our Q-learning agent on a different environment

In [10]:
state = env.reset()
env.render()

In [11]:
from IPython.display import clear_output
done = False
cumulative_reward = 0


while (done==False):
  best_action = np.argmax(Q_table[state,:])
  state,reward, done, info = env.step(best_action)
  cumulative_reward += reward

  time.sleep(0.5)
  clear_output(wait=True)
  env.render()
  print("episode_reward", cumulative_reward)
env.close()

episode_reward 5
