# Learning Reinforcement Learning

I'm learning reinforcement learning from the following. I think I will say reinforcement learning is a training to get the best action in each state in the state space. But I'm not clear about when a taxi run into a wall and evaluation. <br> 
https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/

## Setup

In [0]:
import numpy as np
import random

# reinforcement learning examples in it
import gym

# visualize training
from IPython.display import clear_output

# use sleep to get a snapshot and to visualize training process
from time import sleep

## Self-driving cab

In [18]:
# taxi example of reinforcement learning
env = gym.make("Taxi-v2").env

# fix the initial location of a taxi
env.s = env.encode(3, 4, 2, 0)
# vertical and horizontal states start from 0, so 0, 1, 2, 3, 4 for each

# show the taxi and a park
env.render()

+---------+
|[35mR[0m: | : :G|
| : : : : |
| : : : : |
| | : | :[43m [0m|
|[34;1mY[0m| : |B: |
+---------+



Box is a taxi. The initial location of a taxi randomly will be assigned. Yellow means the taxi is empty. Once a passanger takes it, the box becomes black. Alphabets are possible locations for pickup or destination. The blue shows a passener is there, and the pink is the place he wants to go. After Q-learning, we will see what the best action is in this location. Since a passenger is at Y, a taxi should go north or west first step. 

In [19]:
# show the number of action states
print(env.action_space)

# show the number of observation state, which is 5 * 5 * 4 * (1 + 4) 
# (Vetical * horizontal * pickup location * passenger-in-the-taxi state and destination)
print(env.observation_space)

# internal logic
print(env.s)
print(env.P[env.s])

Discrete(6)
Discrete(500)
388
{0: [(1.0, 488, -1, False)], 1: [(1.0, 288, -1, False)], 2: [(1.0, 388, -1, False)], 3: [(1.0, 368, -1, False)], 4: [(1.0, 388, -10, False)], 5: [(1.0, 388, -10, False)]}


env.P is {action: [(probability, next state, reward, done)]}

## Reinforcement learning - Q-learning algorithm

In [10]:
# initialize state action space matrix to store learned values later

# possibilities for all the state spaces and action spaces
q_table = np.zeros([env.observation_space.n, env.action_space.n])
print(q_table.shape)

# action states = {south, north, east, west, pickup, dropoff}

(500, 6)


In [11]:
# show training time
%%time

# hyperparameters for q learning
alpha = 0.1
gamma = 0.6
epsilon = 0.1
# use grid search to know these values if you want to

for i in range(1, 100001):
  
  # initialize by reset
  state = env.reset()
  epochs = 0
  penalties = 0
  reward = 0
  done = False
  
  # done is successful passenger dropoff
  while not done:
    
    # using epsilon logic deliberately avoid taking best route many time
    # it can reduce overfitting.
    if random.uniform(0, 1) < epsilon:
      # explore action space
      action = env.action_space.sample()
      # deliberately sample action states from 0,1,2,3,4,5 to explore new possibilities
      
    else:
      # exploit learned values
      action = np.argmax(q_table[state])
      # draw best action
      
    # draw following info from the action we decided
    next_state, reward, done, info = env.step(action)
    
    # draw q value in a certain state and with a certain action
    old_value = q_table[state, action]
    
    # we already took action, so we have next state
    # np.max draw the maximum q value from each action in a certain state
    next_max = np.max(q_table[next_state])
    
    # The most important q value update algorithm for Q-learning
    new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
    # It's a combination of reward of current action in current state 
    # and discounted maximum reward from next state from current action
    
    # update Q-table of a certain state of a certain action
    q_table[state, action] = new_value 
    
    # if making a big mistake, accumulate penalty
    if reward == -10:
      penalties += 1
     
    # for iteration
    state = next_state
    epochs += 1
    
  if i % 100 == 0:
    clear_output(wait = True)
    print(f"Episode: {i}")
    
print("Training finished.\n")

Episode: 100000
Training finished.

CPU times: user 44.3 s, sys: 2.81 s, total: 47.2 s
Wall time: 45.9 s


## Check our initial intuition

In [20]:
q_table[388]
# largest value action state is north

array([-2.4737387 , -2.4510224 , -2.4594717 , -2.4510224 , -9.6392277 ,
       -9.64972901])

The highest values in Q table is the second and forth one, which are north and west, so we got the right answers.

## Evaluation

In [14]:
# evaluation
total_epochs = 0
total_penalties = 0
episodes = 100

for _ in range(episodes):
  state = env.reset()
  epochs = 0
  penalties = 0
  reward = 0
  done = False
  
  while not done:
    action = np.argmax(q_table[state])
    state, reward, done, info = env.step(action)
    
    if reward == -10:
      penalties += 1
      
    epochs += 1
    
  total_penalties += penalties
  total_epochs += epochs
  
print(episodes)
print(total_epochs / episodes)
print(total_penalties / episodes)

100
12.07
0.0


## Appendix

In [5]:
# non-reinforcement learning example
env.s = 328
epochs = 0
penalties = 0
reward = 0
frames = []
done = False

while not done:
  action = env.action_space.sample()
  state, reward, done, info = env.step(action)
  
  if reward == -10:
    penalties += 1
    
  frames.append({'frame': env.render(mode = 'ansi'),
                 'state': state,
                 'action': action,
                 'reward': reward})
  
  epochs += 1

print(epochs)
print(penalties)

256
58


In [32]:
# visualize moving process
def print_frames(frames):
  for i, frame in enumerate(frames):
    clear_output(wait = True)
    print(frame['frame'].getvalue())
    print(f"Timestep: {i + 1}")
    print(f"State: {frame['state']}")
    print(f"Action: {frame['action']}")
    print(f"Reward: {frame['reward']}")
    # to show a snapshop, delay the next image update by sleep function from time module
    sleep(.1)
  
# when a box becomes black, it means a taxi picks up a passenger
print_frames(frames)

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 296
State: 0
Action: 5
Reward: 20


In [24]:
print(env.action_space)
action = env.action_space.sample()
print(action) 

# next_state, reward, done, info = env.step(action)
print(env.step(action))

print(env.reset())

Discrete(6)
3
(468, -1, False, {'prob': 1.0})
326
