# Taxi v3 tutorial 
https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/

**env.reset**: Resets the environment and returns a random initial state.

**env.step(action)**: Step the environment by one timestep. Returns<br/>
- observation: Observations of the environment
- reward: If your action was beneficial or not
- done: Indicates if we have successfully picked up and dropped off a passenger, also called one episode<br/>
- info: Additional info such as performance and latency for debugging purposes<br/>
    
**env.render**: Renders one frame of the environment (helpful in visualizing the environment)

In [3]:
import gym

env = gym.make("Taxi-v3").env

env.render()

+---------+
|[35m[43mR[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[34;1mB[0m: |
+---------+



In [4]:
# reset enviroment to a new, random state
env.reset()
env.render()

print("action space {}".format(env.action_space))
print("state space {}".format(env.observation_space))

+---------+
|[34;1mR[0m: | : :G|
| : | : : |
|[43m [0m: : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+

action space Discrete(6)
state space Discrete(500)


- The **filled square represents** the taxi, which is yellow without a passenger and green with a passenger.
- The **pipe ("|")** represents a wall which the taxi cannot cross.
- **R, G, Y, B** are the possible pickup and destination locations. The **blue letter** represents the current passenger pick-up location, and the **purple letter** is the current destination.

6 action space and state space of size 500.
- identify a state uniquely by assigning a unique number to every possible state
- RL learns to choose an action number from 0-5(0=south, 1=north, 2=east, 3=west, 4=pickup, 5=dropoff)
- 500 states = encoding of the **taxi's location, the passenger's location, and the destination location**
- The optimal action for each state is the action that has the highest cumulative long-term reward.

In [5]:
state = env.encode(3,1,2,0) #taxi at row 3 column1, passenger is at location2, destination location is 0
print ("state: ",state)

env.s = state
env.render()

state:  328
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+



- generate a number corressponding to a state 0 - 499 > **328**
- sen enviroment's state manually

In [6]:
state = env.encode(2,2,2,0) #taxi at row 2 column2, passenger is at location3, destination location is 0
print ("state: ",state)

env.s = state
env.render()

state:  248
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : :[43m [0m: : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



- generate a number corressponding to a state 0 - 499 > **248**
- sen enviroment's state manually

In [7]:
# Reward Table
env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

This dictionary has the structure `{action: [(probability, nextstate, reward, done)]}`
- 0-5 taxi can perfrom
- `probability` is always 1.0
- `nextstate` state if a agent take an action
 - **-1** all movement, **-10** 
 - **20** passenger and right destination at the dropoff action
- `done` dropped off a passenger in the right direction/ the end of an episode

## Solving the environment without Reinforcement Learning

In [8]:
env.s=328

epochs = 0
penalties, reward = 0,0 
frames = []
done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)
    
    if reward == -10:
        penalties +=1
        
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )
    
    epochs +=1
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 121
Penalties incurred: 39


`p` reward table 
`env.action_space.sample()` automatically selects one random action

In [9]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        # print(frame['frame'].getvalue())
        env.s = frame['state']
        env.render()
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Timestep: 121
State: 0
Action: 5
Reward: 20


This method takes thousands of timesteps and makes lots of wrong drop offs for one passenger to the right destination.

## <span style="color:red">Solving the environment  with Reinforcement Learning</span>

`p` reward table
**Q-values** `(state, action)` updating reward in the Q-table to rember if an action is beneficial
   - quality of an action taken from that state. 
   - **Better Q-values** imply better chances of getting greater rewards.
   - initialize to an arbitary value

> Q(state,action)←(1−α)Q(state,action)+α(reward+γmaxaQ(next state,all actions))
- α (alpha) is the learning rate( 0 < α ≤ 1), Qvalues that being updated every iteration
- γ (gamma) is the discount factor (0 ≤ γ ≤ 1), how much importance we want to gitve to future rewards. 
- 1 logn-term effecitve 0 - immediate reward 

#### Q-table ###
- Matrix has a row for every state(500) and column for every action(6)
- Initinalize to 0, and values are updated after training
- same dimensions as the reward table

### Random & Optimal ###
- Q-values tend to converge serving the most optimal action (exploitation) always taking the same route, possibly overfitting
- ϵ  "epsilon" lower epsilon - more penalties (Random)

### training algorithm ###
- agent explores the env over 1000 episodes
- `while not done` pick a random action or exploit Qvalue
 - by `epsilon` value comparing to the `random.uniform(0,1)` **returns an arbitrary number between 0 and 1**
- execute the chose action to obtain `next_state` and `reward`
- calculate the maximum Q-value for the actions `next_state`
- update Q-value `next_q_value`


In [17]:
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])

In [18]:
%%time

""" Training the agent """

import random
from IPython.display import clear_output

#Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

#For plotting metrics
all_epocs = []
all_penalties = []

for i in range (1, 100001):
    state = env.reset()
    epochs, penalties, reward = 0,0,0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() #Explore action space
        else:
            action = np.argmax(q_table[state]) #Exploit learned values
        
        next_state, reward, done, infor = env.step(action)
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1-alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value
        
        if reward == -10:
            penalties +=1
        
        state = next_state
        epochs +=1
        
    if 1 % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")
        
print("Traning finished. \n")
        

Traning finished. 

Wall time: 39.1 s


Now Q-Table established over 100,000 episodes.

The max Q-Value is "north" (-2.2) > Q-learning has effectively learned the best action to take

In [20]:
q_table[328]

array([ -2.41043841,  -2.27325184,  -2.38028529,  -2.35555588,
       -10.60012791, -10.41699021])

## Evaluating the agent performance
next action is always selected using the best Q-Value
**Performance improved significantly and it incurred no penalties**
correct pickup/dropoff actions with 100 different passengers

In [21]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 100

for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

Results after 100 episodes:
Average timesteps per episode: 13.44
Average penalties per episode: 0.0


## Comparing Q-learning agent vs no Reinforcement Learning

Initially the agent will make errors but, once it made a Q-table, it can act wisely maximizing the rewards

1. Average number of penalites per episode:
The smaller, the better the performance

2. Average number of timesteps per trip:
The smaller number = minimum steps to reach the destination

3. Average rewards per move
The larger the rwards = right ting


## Hyperparameters and optimizations

alpha, gamma, epsilon : intuition, hit and trial >> better ways to come up with good values.

1. **α(the learning rate)** should **decrease** as you continue to gain a larger and larger knowledge base.
2. **γ(discount factor)**: as you get closer and closer to the deadline, your preference for near-term reward should **increase**, as you won't be around long enough to get the long-term reward, which means your gamma should decrease.
3. **ϵ(add random)**: as we develop our strategy, we have less need of exploration and more exploitation to get more utility from our policy, so as **trials increase, epsilon should decrease.**
