# Taxi v3 tutorial 
https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/

**env.reset**: Resets the environment and returns a random initial state.

**env.step(action)**: Step the environment by one timestep. Returns<br/>
- observation: Observations of the environment
- reward: If your action was beneficial or not
- done: Indicates if we have successfully picked up and dropped off a passenger, also called one episode<br/>
- info: Additional info such as performance and latency for debugging purposes<br/>
    
**env.render**: Renders one frame of the environment (helpful in visualizing the environment)

In [2]:
import gym

env = gym.make("Taxi-v3").env

env.render()

+---------+
|R: | : :G|
| : | : : |
| : :[43m [0m: : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+



In [10]:
# reset enviroment to a new, random state
env.reset()
env.render()

print("action space {}".format(env.action_space))
print("state space {}".format(env.observation_space))

+---------+
|R: | : :G|
| : | : : |
| : : : : |
|[43m [0m| : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+

action space Discrete(6)
state space Discrete(500)


- The **filled square represents** the taxi, which is yellow without a passenger and green with a passenger.
- The **pipe ("|")** represents a wall which the taxi cannot cross.
- **R, G, Y, B** are the possible pickup and destination locations. The **blue letter** represents the current passenger pick-up location, and the **purple letter** is the current destination.

6 action space and state space of size 500.
- identify a state uniquely by assigning a unique number to every possible state
- RL learns to choose an action number from 0-5(0=south, 1=north, 2=east, 3=west, 4=pickup, 5=dropoff)
- 500 states = encoding of the **taxi's location, the passenger's location, and the destination location**
- The optimal action for each state is the action that has the highest cumulative long-term reward.

In [4]:
state = env.encode(3,1,2,0) #taxi at row 3 column1, passenger is at location2, destination location is 0
print ("state: ",state)

env.s = state
env.render()

state:  328
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+



- generate a number corressponding to a state 0 - 499 > **328**
- sen enviroment's state manually

In [12]:
state = env.encode(2,2,2,0) #taxi at row 2 column2, passenger is at location3, destination location is 0
print ("state: ",state)

env.s = state
env.render()

state:  248
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : :[43m [0m: : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



- generate a number corressponding to a state 0 - 499 > **248**
- sen enviroment's state manually

In [13]:
# Reward Table
env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

This dictionary has the structure `{action: [(probability, nextstate, reward, done)]}`
- 0-5 taxi can perfrom
- `probability` is always 1.0
- `nextstate` state if a agent take an action
 - **-1** all movement, **-10** 
 - **20** passenger and right destination at the dropoff action
- `done` dropped off a passenger in the right direction/ the end of an episode

## Solving the environment without Reinforcement Learning

In [14]:
env.s=328

epochs = 0
penalties, reward = 0,0 
frames = []
done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)
    
    if reward == -10:
        penalties +=1
        
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )
    
    epochs +=1
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 2324
Penalties incurred: 778


`p` reward table 
`env.action_space.sample()` automatically selects one random action

In [18]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        # print(frame['frame'].getvalue())
        env.s = frame['state']
        env.render()
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Timestep: 2324
State: 0
Action: 5
Reward: 20


This method takes thousands of timesteps and makes lots of wrong drop offs for one passenger to the right destination.

## <span style="color:red">Solving the environment  with Reinforcement Learning</span>

`p` reward table
**Q-values** `(state, action)` updating reward in the Q-table to rember if an action is beneficial
   - quality of an action taken from that state. 
   - **Better Q-values** imply better chances of getting greater rewards.
   - initialize to an arbitary value

> Q(state,action)←(1−α)Q(state,action)+α(reward+γmaxaQ(next state,all actions))
- α (alpha) is the learning rate( 0 < α ≤ 1), Qvalues that being updated every iteration
- γ (gamma) is the discount factor (0 ≤ γ ≤ 1), how much importance we want to gitve to future rewards. 
- 1 logn-term effecitve 0 - immediate reward 

#### Q-table ###
- Matrix has a row for every state(500) and column for every action(6)
- Initinalize to 0, and values are updated after training
- same dimensions as the reward table