# P1: Solve the OpenAI Gym [Taxi V3](https://gym.openai.com/envs/Taxi-v3/) Environment
---

## Introduction
[OpenAI Gym](https://gym.openai.com/docs/) is a framework that provides RL environments of varying complexity with the same standard API making it easy to develop and benchmark RL algorithms. The [Taxi-V3](https://gym.openai.com/envs/Taxi-v3/) environmnet present a simple, text environment where actions and state (observations) are both discrete. 

In [1]:
import gym

The `gym.make()` API can be used to spawn any of the available environments by passing its full name.

In [2]:
taxi = gym.make('Taxi-v3')

The Taxi environment has 500 states and 6 possible actions.

In [3]:
taxi.action_space

Discrete(6)

In [4]:
taxi.observation_space

Discrete(500)

The task and reward structure are described in the [documentation](https://github.com/openai/gym/blob/a5a6ae6bc0a5cfc0ff1ce9be723d59593c165022/gym/envs/toy_text/taxi.py#L25)

In [5]:
taxi.reset()
taxi.render()

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
|[43m [0m| : | : |
|Y| : |[34;1mB[0m: |
+---------+



## Submission
- Submit your solution as a Jupyter notebook. 
- Ensure that all cells in the notebook have been executed and the output is showing
- Ensure that your solution consistently reaches the average cumulative reward defined in the rubric (link below)
- Post your solution on Github and share the link to your commit as a direct message in Slack

**1**
Describe the methods and variables in the class DiscreteEnv which is the parent class of the Taxi V3 class.

**Variables:**

nS: number of states

nA: number of actions

action_space: The Space object corresponding to valid actions

observation_space: The Space object corresponding to valid observations

s: current state

lastaction: action taken in the last step

P: A dictionary defining transitions. For each key (state s and action a), the value is a list of tuples. Each tuple defines the probability of reaching a possible next state (from given state and action), the specified next state, the associated reward and whether the task is done: 
P[s][a] == [(probability, nextstate, reward, done), ...]

isd: An array of length nS defining the initial state distribution 

**Methods:**

step(self, action): 

*   Run one timestep of the environment's dynamics.
*   Returns:
            observation (object): agent's observation of the current
            environment
            reward (float) : amount of reward returned after previous action
            done (bool): whether the episode has ended, in which case further step() calls will return undefined results
            info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)


reset(self):


*   Run one timestep of the environment's dynamics.
*   List item


*   Returns:
            observation (object): the initial observation.



In [6]:
def epsilon_greedy_action_from_Q(Q, state, epsilon):
    actions = Q.columns
    action_probs = np.asarray([epsilon/len(actions)]*len(actions),dtype=np.float)
    
    greedy_action_index = np.argmax(Q.loc[state].values)
    action_probs[greedy_action_index] += 1-epsilon

    epsilon_greedy_action = np.random.choice(Q.columns,p=action_probs)
    
    return epsilon_greedy_action

In [21]:
#from tqdm import tqdm 
import numpy as np
import pandas as pd
Q = pd.DataFrame.from_dict({s:{a:0 for a in range(taxi.nA)} for s in range(taxi.nS)}, orient='index')

HYPER_PARAMS = {'gamma':0.9}

n_episodes = 5000
max_episode_len = 100
epsilon = 1
min_epsilon = 0.01
epsilon_decay = 0.99
alpha = 0.1

rewards = np.zeros(n_episodes)

for i in range(n_episodes):  
    taxi.reset()
    s0 = taxi.s
    done = False
    
    episode_reward = 0

    for step in range(max_episode_len):
        a0 = epsilon_greedy_action_from_Q(Q,s0,epsilon)
        out  = taxi.step(a0)
        s1 = out[0]
        reward = out[1]
        done = out[2]
        
        Q.loc[s0,a0] += alpha*(reward + HYPER_PARAMS['gamma']*Q.loc[s1].max() - Q.loc[s0,a0])
        episode_reward += reward
        s0 = s1

        assert (reward!=-10) or (i<n_episodes-1000)

        if done:
          break

    if i%100 == 0:
      print ("reward for episode {}: {}".format(i, episode_reward))
  
    epsilon *= epsilon_decay
    epsilon = max(epsilon,min_epsilon) if i<n_episodes-1000 else 0  

    rewards[i] = episode_reward
        

reward for episode 0: -397
reward for episode 100: -271
reward for episode 200: -118
reward for episode 300: -109
reward for episode 400: -100
reward for episode 500: -74
reward for episode 600: -2
reward for episode 700: -95
reward for episode 800: -73
reward for episode 900: -73
reward for episode 1000: 14
reward for episode 1100: -11
reward for episode 1200: -25
reward for episode 1300: 8
reward for episode 1400: 10
reward for episode 1500: -5
reward for episode 1600: 9
reward for episode 1700: 8
reward for episode 1800: -3
reward for episode 1900: 14
reward for episode 2000: 5
reward for episode 2100: -2
reward for episode 2200: 8
reward for episode 2300: 7
reward for episode 2400: 9
reward for episode 2500: 3
reward for episode 2600: 6
reward for episode 2700: 7
reward for episode 2800: 8
reward for episode 2900: 8
reward for episode 3000: 10
reward for episode 3100: 9
reward for episode 3200: 13
reward for episode 3300: 5
reward for episode 3400: 10
reward for episode 3500: 8
rew

In [20]:
#import matplotlib.pyplot as plt
windowed_rewards = np.convolve(rewards[-1000:], np.ones(100), 'valid')
np.quantile(windowed_rewards/100,[0.05, 0.95])
#plt.plot(windowed_rewards/100)
#plt.show()

array([7.41, 8.53])

## Evaluation
The goal of the project is to get a certain average (cumulative) reward over 100 episodes. To pass the project, you must meet all the requirments in the project [rubric](https://github.com/KnowchowHQ/rl-in-action/blob/master/C1-RL-Intro/W3OH/P1-rubric.md)