# P1: Solve the OpenAI Gym [Taxi V3](https://gym.openai.com/envs/Taxi-v3/) Environment
---

## Introduction
[OpenAI Gym](https://gym.openai.com/docs/) is a framework that provides RL environments of varying complexity with the same standard API making it easy to develop and benchmark RL algorithms. The [Taxi-V3](https://gym.openai.com/envs/Taxi-v3/) environmnet present a simple, text environment where actions and state (observations) are both discrete. 

In [1]:
import gym

The `gym.make()` API can be used to spawn any of the available environments by passing its full name.

In [2]:
taxi = gym.make('Taxi-v3')

The Taxi environment has 500 states and 6 possible actions.

In [3]:
taxi.action_space

Discrete(6)

In [4]:
taxi.observation_space

Discrete(500)

The task and reward structure are described in the [documentation](https://github.com/openai/gym/blob/a5a6ae6bc0a5cfc0ff1ce9be723d59593c165022/gym/envs/toy_text/taxi.py#L25)

In [5]:
taxi.reset()
taxi.render()

+---------+
|R: | : :[35mG[0m|
| : |[43m [0m: : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



In [6]:
HYPER_PARAMS = {'gamma':0.9}

In [7]:
def epsilon_greedy_action_from_Q(Q, state, epsilon):
    actions = Q.columns
    action_probs = np.asarray([epsilon/len(actions)]*len(actions),dtype=np.float)
    
    greedy_action_index = np.argmax(Q.loc[state].values)
    action_probs[greedy_action_index] += 1-epsilon

    epsilon_greedy_action = np.random.choice(Q.columns,p=action_probs)
    
    return epsilon_greedy_action

In [14]:
def greedy_policy_from_returns_tbl(table):
    policy = {s:None for s in table.index }
    for state in table.index:        
       greedy_action = table.loc[state].idxmax()
       policy[state] = greedy_action
            
    return policy

In [16]:
def pretty_print_policy(policy, drop_off=False, render=True):
    if render:
      taxi.render()
    taxi_row, taxi_col, pass_idx, dest_idx = taxi.decode(taxi.s)
    if drop_off: pass_idx = 4
    direction_repr = {1:' ü°ë ', 2:' ü°í ', 3:' ü°ê ', 0:' ü°ì ', None:' ‚¨§ ', 4:' O ', 5:' X '}
    for row in range(5):
        for col in range(5):
            state = taxi.encode(row, col, pass_idx, dest_idx)
            print(direction_repr[policy[state]],end='')
        print()

In [19]:
from tqdm import tqdm 
import numpy as np
import pandas as pd
Q = pd.DataFrame.from_dict({s:{a:0 for a in range(taxi.nA)} for s in range(taxi.nS)}, orient='index')

n_episodes = 5000
epsilon = 1
min_epsilon = 0.1
epsilon_decay = 0.9995

alpha = 0.0001

rewards = np.zeros(n_episodes)


for i in tqdm(range(n_episodes)):
    taxi.reset()
    s0 = taxi.s
    done = False
    
    episode_reward = 0
    while not done:
        a0 = epsilon_greedy_action_from_Q(Q,s0,epsilon)
        out  = taxi.step(a0)
        s1 = out[0]
        reward = out[1]
        done = out[2]
        
        Q.loc[s0,a0] += alpha*(reward + HYPER_PARAMS['gamma']*Q.loc[s1].max() - Q.loc[s0,a0])
        episode_reward += reward
        s0 = s1
  
    epsilon *= epsilon_decay
    epsilon = max(epsilon,min_epsilon) if i<n_episodes-1000 else 0  

    rewards[i] = episode_reward
    if (i+1)%500 == 0:
        print(f'Iteration {i+1}')
        policy = greedy_policy_from_returns_tbl(Q)
        pretty_print_policy(policy)
        

policy_Q_Learning = greedy_policy_from_returns_tbl(Q)
print(policy_Q_Learning)

pretty_print_policy(policy_Q_Learning)



  0%|          | 0/5000 [00:00<?, ?it/s][A[A

  0%|          | 1/5000 [00:00<11:04,  7.52it/s][A[A

  0%|          | 2/5000 [00:00<11:44,  7.09it/s][A[A

  0%|          | 3/5000 [00:00<12:50,  6.48it/s][A[A

  0%|          | 4/5000 [00:00<13:38,  6.11it/s][A[A

  0%|          | 5/5000 [00:00<13:12,  6.31it/s][A[A

  0%|          | 6/5000 [00:00<12:59,  6.41it/s][A[A

  0%|          | 7/5000 [00:01<12:56,  6.43it/s][A[A

  0%|          | 8/5000 [00:01<13:26,  6.19it/s][A[A

  0%|          | 9/5000 [00:01<13:49,  6.02it/s][A[A

  0%|          | 10/5000 [00:01<13:49,  6.02it/s][A[A

  0%|          | 11/5000 [00:01<14:01,  5.93it/s][A[A

  0%|          | 12/5000 [00:01<13:46,  6.04it/s][A[A

  0%|          | 13/5000 [00:02<13:16,  6.26it/s][A[A

  0%|          | 14/5000 [00:02<13:07,  6.33it/s][A[A

  0%|          | 15/5000 [00:02<12:44,  6.52it/s][A[A

  0%|          | 16/5000 [00:02<13:11,  6.29it/s][A[A

  0%|          | 17/5000 [00:02<13:40,  6.07it/

Iteration 500
+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | :[43m [0m| : |
|Y| : |[34;1mB[0m: |
+---------+
  (Pickup)
 ü°ë  ü°í  ü°ê  ü°ê  ü°ì 
 ü°ì  ü°í  ü°ë  ü°ì  ü°ë 
 ü°ê  ü°ê  ü°ê  ü°í  ü°í 
 ü°í  ü°ì  ü°í  ü°ê  ü°í 
 ü°ê  ü°ì  ü°í  ü°ë  ü°ì 




 10%|‚ñà         | 501/5000 [01:20<14:10,  5.29it/s][A[A

 10%|‚ñà         | 502/5000 [01:20<13:04,  5.74it/s][A[A

 10%|‚ñà         | 503/5000 [01:20<12:25,  6.03it/s][A[A

 10%|‚ñà         | 504/5000 [01:21<12:12,  6.14it/s][A[A

 10%|‚ñà         | 505/5000 [01:21<11:08,  6.72it/s][A[A

 10%|‚ñà         | 506/5000 [01:21<11:25,  6.56it/s][A[A

 10%|‚ñà         | 508/5000 [01:21<10:42,  6.99it/s][A[A

 10%|‚ñà         | 509/5000 [01:21<10:35,  7.06it/s][A[A

 10%|‚ñà         | 510/5000 [01:21<11:16,  6.63it/s][A[A

 10%|‚ñà         | 511/5000 [01:22<11:21,  6.59it/s][A[A

 10%|‚ñà         | 512/5000 [01:22<11:23,  6.57it/s][A[A

 10%|‚ñà         | 514/5000 [01:22<10:20,  7.23it/s][A[A

 10%|‚ñà         | 515/5000 [01:22<11:17,  6.62it/s][A[A

 10%|‚ñà         | 516/5000 [01:22<11:14,  6.65it/s][A[A

 10%|‚ñà         | 517/5000 [01:22<11:36,  6.43it/s][A[A

 10%|‚ñà         | 518/5000 [01:23<11:51,  6.30it/s][A[A

 10%|‚ñà         | 519/5000 [01:23<12:

Iteration 1000
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|Y| : |[34;1mB[0m: |
+---------+
  (West)
 ü°ë  ü°ë  ü°í  ü°ë  ü°í 
 ü°ê  ü°í  ü°í  ü°ë  ü°ê 
 ü°ê  ü°ì  ü°ë  ü°ë  ü°ë 
 ü°ì  ü°í  ü°í  ü°ì  ü°í 
 ü°ê  ü°ì  ü°ì  ü°ì  ü°ê 




 20%|‚ñà‚ñà        | 1001/5000 [02:41<13:39,  4.88it/s][A[A

 20%|‚ñà‚ñà        | 1002/5000 [02:41<12:41,  5.25it/s][A[A

 20%|‚ñà‚ñà        | 1003/5000 [02:41<11:57,  5.57it/s][A[A

 20%|‚ñà‚ñà        | 1004/5000 [02:42<11:46,  5.66it/s][A[A

 20%|‚ñà‚ñà        | 1005/5000 [02:42<11:32,  5.77it/s][A[A

 20%|‚ñà‚ñà        | 1006/5000 [02:42<11:39,  5.71it/s][A[A

 20%|‚ñà‚ñà        | 1007/5000 [02:42<11:50,  5.62it/s][A[A

 20%|‚ñà‚ñà        | 1008/5000 [02:42<11:36,  5.73it/s][A[A

 20%|‚ñà‚ñà        | 1009/5000 [02:42<11:39,  5.71it/s][A[A

 20%|‚ñà‚ñà        | 1010/5000 [02:43<11:40,  5.69it/s][A[A

 20%|‚ñà‚ñà        | 1012/5000 [02:43<10:18,  6.45it/s][A[A

 20%|‚ñà‚ñà        | 1013/5000 [02:43<10:46,  6.17it/s][A[A

 20%|‚ñà‚ñà        | 1014/5000 [02:43<10:44,  6.19it/s][A[A

 20%|‚ñà‚ñà        | 1015/5000 [02:43<11:09,  5.95it/s][A[A

 20%|‚ñà‚ñà        | 1016/5000 [02:43<11:15,  5.90it/s][A[A

 20%|‚ñà‚ñà        | 1017/5000 [02:44<10:40,  6.21it/

Iteration 1500
+---------+
|R: | : :[34;1mG[0m|
| : | : :[43m [0m|
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (Dropoff)
 ü°ê  ü°ì  ü°ì  ü°í  ü°ì 
 ü°ê  ü°ì  ü°ê  ü°ë  ü°ê 
 ü°í  ü°ì  ü°ì  ü°ì  ü°í 
 ü°ì  ü°ê  ü°ê  ü°í  ü°í 
 ü°ì  ü°í  ü°ê  ü°ì  ü°í 




 30%|‚ñà‚ñà‚ñà       | 1501/5000 [04:02<11:26,  5.09it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1502/5000 [04:02<11:07,  5.24it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1503/5000 [04:02<10:42,  5.44it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1504/5000 [04:02<10:43,  5.44it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1505/5000 [04:02<10:34,  5.50it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1506/5000 [04:03<10:29,  5.55it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1507/5000 [04:03<10:17,  5.66it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1508/5000 [04:03<10:14,  5.68it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1509/5000 [04:03<09:32,  6.10it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1510/5000 [04:03<09:35,  6.07it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1511/5000 [04:03<09:58,  5.83it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1512/5000 [04:04<09:38,  6.03it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1513/5000 [04:04<09:20,  6.22it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1514/5000 [04:04<09:15,  6.28it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 1515/5000 [04:04<09:11,  6.32it/s][A[A

 30%|‚ñà‚ñà‚ñà       | 

Iteration 2000
+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|Y| : |[34;1mB[0m: |
+---------+
  (West)
 ü°ê  ü°ë  ü°ê  ü°í  ü°í 
 ü°ë  ü°ì  ü°í  ü°ì  ü°í 
 ü°ë  ü°ê  ü°ë  ü°ë  ü°ê 
 ü°ê  ü°ì  ü°ì  ü°ë  ü°ì 
 ü°ê  ü°ì  ü°ê  O  ü°ê 


 40%|‚ñà‚ñà‚ñà‚ñà      | 2001/5000 [05:21<09:27,  5.29it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2002/5000 [05:21<09:36,  5.20it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2003/5000 [05:22<09:43,  5.14it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2004/5000 [05:22<09:29,  5.26it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2005/5000 [05:22<09:20,  5.35it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2006/5000 [05:22<09:39,  5.17it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2007/5000 [05:22<08:53,  5.61it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2008/5000 [05:22<08:29,  5.87it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2010/5000 [05:23<07:40,  6.49it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2011/5000 [05:23<07:58,  6.24it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2012/5000 [05:23<08:17,  6.01it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2013/5000 [05:23<08:38,  5.76it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2014/5000 [05:23<08:19,  5.97it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2015/5000 [05:24<08:31,  5.84it/s][A[A

 40%|‚ñà‚ñà‚ñà‚ñà      | 2016/5000 [05:24<07:28,  6.65it/s][

Iteration 2500
+---------+
|[34;1mR[0m: |[43m [0m: :G|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (Dropoff)
 ü°ê  ü°ê  ü°ì  ü°ê  ü°í 
 ü°ë  ü°í  ü°ì  ü°ê  ü°ë 
 ü°ì  ü°ë  ü°ê  ü°ì  ü°ê 
 ü°ê  ü°ë  ü°ê  ü°ê  ü°ê 
 ü°ì  ü°ê  ü°í  ü°ê  ü°ê 


 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2501/5000 [06:42<08:03,  5.17it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2502/5000 [06:42<08:05,  5.14it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2503/5000 [06:43<08:23,  4.96it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2504/5000 [06:43<08:18,  5.01it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2505/5000 [06:43<08:03,  5.16it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2506/5000 [06:43<07:41,  5.40it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2507/5000 [06:43<07:14,  5.74it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2508/5000 [06:43<07:12,  5.76it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2509/5000 [06:44<07:20,  5.65it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2510/5000 [06:44<06:59,  5.93it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2511/5000 [06:44<06:54,  6.01it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2512/5000 [06:44<07:04,  5.86it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2513/5000 [06:44<07:28,  5.54it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2514/5000 [06:44<07:31,  5.50it/s][A[A

 50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2515/5

Iteration 3000
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| | :[43m [0m| : |
|[34;1mY[0m| : |B: |
+---------+
  (East)
 ü°ì  ü°í  ü°ë  ü°ë  ü°í 
 ü°ê  ü°ë  ü°ê  ü°ì  ü°í 
 ü°ê  ü°ê  ü°í  ü°ë  ü°í 
 ü°ê  ü°í  ü°ì  ü°í  ü°í 
 ü°ê  ü°ë  ü°ì  ü°ì  ü°í 




 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3001/5000 [08:03<06:18,  5.28it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3002/5000 [08:03<06:05,  5.47it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3003/5000 [08:03<06:03,  5.49it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3004/5000 [08:03<05:44,  5.79it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3005/5000 [08:03<05:32,  5.99it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3006/5000 [08:03<05:28,  6.06it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3007/5000 [08:04<05:37,  5.91it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3008/5000 [08:04<05:57,  5.57it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3009/5000 [08:04<05:52,  5.64it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3010/5000 [08:04<05:42,  5.81it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3011/5000 [08:04<05:30,  6.02it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3012/5000 [08:04<05:29,  6.03it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3013/5000 [08:05<05:29,  6.03it/s][A[A

 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 3014/5000 [08:05<05:29,  6.02it/s][A[A

 60%

Iteration 3500
+---------+
|[34;1mR[0m: | : :G|
| :[43m [0m| : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (East)
 ü°ê  ü°ë  ü°ê  ü°í  ü°í 
 ü°ë  ü°ë  ü°ì  ü°ë  ü°í 
 ü°ê  ü°ì  ü°í  ü°ì  ü°í 
 ü°í  ü°ê  ü°í  ü°ì  ü°ì 
 ü°í  ü°ê  ü°ì  ü°í  ü°ê 




 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3501/5000 [09:23<04:59,  5.00it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3502/5000 [09:23<04:56,  5.06it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3503/5000 [09:23<04:41,  5.31it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3504/5000 [09:24<04:38,  5.37it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3505/5000 [09:24<04:31,  5.51it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3506/5000 [09:24<04:26,  5.60it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3507/5000 [09:24<04:20,  5.73it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3509/5000 [09:24<03:58,  6.24it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3510/5000 [09:25<03:59,  6.23it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3511/5000 [09:25<03:56,  6.30it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3512/5000 [09:25<04:09,  5.96it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3513/5000 [09:25<04:13,  5.86it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3514/5000 [09:25<04:16,  5.80it/s][A[A

 70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 3515/5000 [09:25<0

Iteration 4000
+---------+
|[35mR[0m: | : :G|
| : | : : |
| :[42m_[0m: : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
 X  ü°í  ü°ê  ü°ì  ü°ë 
 ü°ë  ü°í  ü°ê  ü°í  ü°ì 
 ü°ê  ü°ë  ü°í  ü°í  ü°í 
 ü°ì  ü°ë  ü°ì  ü°í  ü°ì 
 X  ü°ì  ü°ê  ü°ì  ü°ì 




 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4001/5000 [10:42<03:13,  5.15it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4002/5000 [10:42<03:05,  5.39it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4003/5000 [10:42<02:54,  5.70it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4004/5000 [10:42<02:51,  5.82it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4005/5000 [10:42<02:42,  6.12it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4006/5000 [10:43<02:38,  6.26it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4007/5000 [10:43<02:39,  6.23it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4008/5000 [10:43<02:39,  6.22it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4009/5000 [10:43<02:42,  6.10it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4010/5000 [10:43<02:40,  6.18it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4011/5000 [10:43<02:47,  5.91it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4012/5000 [10:44<02:46,  5.93it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 4013/5000 [10:44<02:41,  6.11it/s][A[A

 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

Iteration 4500
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (East)
 ü°ë  ü°ê  ü°ë  ü°ê  ü°í 
 ü°ê  ü°ê  ü°ê  ü°ê  ü°ë 
 ü°ì  ü°ê  ü°ì  ü°ë  ü°ê 
 ü°ì  ü°ì  ü°í  ü°ê  ü°ë 
 O  ü°ê  ü°í  ü°ì  ü°í 


 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4501/5000 [12:01<01:36,  5.15it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4502/5000 [12:01<01:32,  5.38it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4503/5000 [12:01<01:26,  5.73it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4504/5000 [12:01<01:25,  5.77it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4505/5000 [12:02<01:26,  5.71it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4506/5000 [12:02<01:26,  5.71it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4507/5000 [12:02<01:28,  5.56it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4508/5000 [12:02<01:26,  5.67it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4509/5000 [12:02<01:22,  5.94it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4510/5000 [12:03<01:23,  5.85it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4511/5000 [12:03<01:22,  5.95it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4512/5000 [12:03<01:22,  5.90it/s][A[A

 90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 4513/5000 [12:03<01:23,  5.83it/s][A[A

Iteration 5000
+---------+
|[34;1mR[0m: | : :G|
| : |[43m [0m: : |
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+
  (North)
 ü°ì  ü°ë  ü°ë  ü°ê  ü°ë 
 ü°ê  ü°í  ü°í  ü°í  ü°ë 
 ü°ê  ü°ë  ü°í  ü°í  ü°ê 
 ü°ì  ü°ì  ü°í  ü°í  ü°ì 
 ü°ê  ü°ì  ü°ì  ü°ê  ü°ê 
{0: 0, 1: 2, 2: 4, 3: 0, 4: 1, 5: 0, 6: 3, 7: 2, 8: 1, 9: 0, 10: 0, 11: 0, 12: 3, 13: 3, 14: 3, 15: 0, 16: 5, 17: 1, 18: 3, 19: 2, 20: 0, 21: 0, 22: 0, 23: 1, 24: 1, 25: 0, 26: 3, 27: 0, 28: 3, 29: 3, 30: 0, 31: 1, 32: 1, 33: 3, 34: 1, 35: 0, 36: 3, 37: 1, 38: 0, 39: 3, 40: 0, 41: 0, 42: 3, 43: 1, 44: 1, 45: 0, 46: 0, 47: 3, 48: 1, 49: 0, 50: 0, 51: 0, 52: 3, 53: 2, 54: 1, 55: 0, 56: 0, 57: 2, 58: 1, 59: 1, 60: 0, 61: 2, 62: 0, 63: 3, 64: 3, 65: 0, 66: 1, 67: 0, 68: 1, 69: 1, 70: 0, 71: 2, 72: 1, 73: 2, 74: 0, 75: 0, 76: 0, 77: 1, 78: 1, 79: 2, 80: 0, 81: 1, 82: 3, 83: 1, 84: 3, 85: 0, 86: 4, 87: 0, 88: 3, 89: 2, 90: 0, 91: 0, 92: 2, 93: 0, 94: 1, 95: 0, 96: 2, 97: 5, 98: 0, 99: 2, 100: 0, 101: 3, 




In [None]:
import matplotlib.pyplot as plt
windowed_rewards = np.convolve(rewards[-1000:], np.ones(100), 'valid')
plt.plot(windowed_rewards/100)
plt.show()

## Submission
- Submit your solution as a Jupyter notebook. 
- Ensure that all cells in the notebook have been executed and the output is showing
- Ensure that your solution consistently reaches the average cumulative reward defined in the rubric (link below)
- Post your solution on Github and share the link to your commit as a direct message in Slack

## Evaluation
The goal of the project is to get a certain average (cumulative) reward over 100 episodes. To pass the project, you must meet all the requirments in the project [rubric](https://github.com/KnowchowHQ/rl-in-action/blob/master/C1-RL-Intro/W3OH/P1-rubric.md)