# Introduction to RL 

Based on the tutorial by https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/


For this tutorial we are going to use the Open-AI Taxi domain which is installed below (each code cell is executed by pressing shift+enter)


In [None]:
!pip install cmake 'gym[atari]' scipy



import the taxi domain, and render an initial environment



In [None]:
import gym

env = gym.make("Taxi-v3").env

env.reset ()
env.render()

+---------+
|R: | : :[34;1mG[0m|
| : | : : |
| : :[43m [0m: : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+




The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
The pipes ("|") represent walls which the taxi cannot traverse.
R, G, Y, B are the possible pickup and destination locations. The passenger can also be in the taxi. The blue letter represents the current passenger pick-up location, and the purple letter is the current destination.

The actions space is:

0 = south
1 = north
2 = east
3 = west
4 = pickup
5 = dropoff

The state encoding is as follows (play around with the values)


In [None]:
state = env.encode(1, 0, 2, 1) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.render()
env.s = state
env.render()

NameError: ignored

Another example:

In [None]:
state = env.encode(2, 3, 0, 1) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.s = state
env.render()

State: 261
+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|Y| : |B: |
+---------+



P is the transition function, implementead as a dictionary with the structure {action: [(probability, nextstate, reward, done)]}.



In [None]:
state = env.encode(0, 0, 0, 1) # (taxi row, taxi column, passenger index, destination index)
env.s = state
env.render()
print('state:',state)


print(env.P[state])
# pickup the agent
state, reward, done, info = env.step(0)
state, reward, done, info = env.step(0)
state, reward, done, info = env.step(2)
env.s = state
env.render()

print('state:',state)
print('reward:', reward)
print('done:', done)
print('info:', info)
print('transition func:', env.P[state])





+---------+
|[34;1m[43mR[0m[0m: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+

state: 1
{0: [(1.0, 101, -1, False)], 1: [(1.0, 1, -1, False)], 2: [(1.0, 21, -1, False)], 3: [(1.0, 1, -1, False)], 4: [(1.0, 17, -1, False)], 5: [(1.0, 1, -10, False)]}
+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : | : : |
| :[43m [0m: : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
state: 221
reward: -1
done: False
info: {'prob': 1.0}
transition func: {0: [(1.0, 321, -1, False)], 1: [(1.0, 121, -1, False)], 2: [(1.0, 241, -1, False)], 3: [(1.0, 201, -1, False)], 4: [(1.0, 221, -10, False)], 5: [(1.0, 221, -10, False)]}


# Brute Force Approach

Let's see what would happen if we try to brute-force our way to solving the problem without RL.

Since we have our P table for default rewards in each state, we can try to have our taxi navigate just using that.

We'll create an infinite loop which runs until one passenger reaches one destination (one episode), or in other words, when the received reward is 20. The env.action_space.sample() method automatically selects one random action from set of all possible actions.

In [None]:
env.s = 328  # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 1099
Penalties incurred: 347


Play the animation


In [None]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| | : | :[43m [0m|
|[34;1mY[0m| : |B: |
+---------+
  (East)

Timestep: 454
State: 388
Action: 2
Reward: -1


KeyboardInterrupt: ignored

# Q Learning 

Essentially, Q-learning lets the agent use the environment's rewards to learn, over time, the best action to take in a given state.

In our Taxi environment, we have the reward table, P, that the agent will learn from. It does thing by looking receiving a reward for taking an action in the current state, then updating a Q-value to remember if that action was beneficial.

The values store in the Q-table are called a Q-values, and they map to a (state, action) combination.

A Q-value for a particular state-action combination is representative of the "quality" of an action taken from that state. Better Q-values imply better chances of getting greater rewards.

For example, if the taxi is faced with a state that includes a passenger at its current location, it is highly likely that the Q-value for pickup is higher when compared to other actions, like dropoff or north.

Q-values are initialized to an arbitrary value, and as the agent exposes itself to the environment and receives different rewards by executing different actions, the Q-values are updated using the equation:


Q(state,action)←(1−α)Q(state,action)+α(reward+γmaxaQ(next state,all actions))
Where:

- α (alpha) is the learning rate (0<α≤1) - Just like in supervised learning settings, α is the extent to which our Q-values are being updated in every iteration.

- γ (gamma) is the discount factor (0≤γ≤1) - determines how much importance we want to give to future rewards. A high value for the discount factor (close to 1) captures the long-term effective award, whereas, a discount factor of 0 makes our agent consider only immediate reward, hence making it greedy.

In [None]:
print(env.observation_space)

Discrete(500)


In [None]:
#Creating the q_table
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])
print(q_table)

[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


In [None]:
%%time
"""Training the agent"""

import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1 # the learning rate
gamma = 0.6 # discount factor
epsilon = 0.1 # taking random actions

# For plotting metrics
all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

Episode: 100000
Training finished.

CPU times: user 1min 8s, sys: 15.6 s, total: 1min 24s
Wall time: 1min 11s


In [None]:
print(q_table[361])
print(q_table[1])


[ -2.49496543  -2.48942084  -2.49553924  -2.49335055 -11.18795654
 -10.77390738]
[ -2.41837063  -2.3639511   -2.41837062  -2.36395109  -2.27325184
 -11.36395103]


In [None]:
  state = env.reset()
  env.render()



+---------+
|R: | : :[34;1m[43mG[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+



# Evaluating the performance of the Q-Learning approach

In [None]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 100

frames = []

for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        frames.append({
          'frame': env.render(mode='ansi'),
          'state': state,
          'action': action,
          'reward': reward
          }
        )

        if reward == -10:
            penalties += 1

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

Results after 100 episodes:
Average timesteps per episode: 12.88
Average penalties per episode: 0.0


In [None]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.5)
        
print_frames(frames)

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)

Timestep: 1288
State: 475
Action: 5
Reward: 20


Q-learning is one of the easiest Reinforcement Learning algorithms. The problem with Q-earning however is, once the number of states in the environment are very high, it becomes difficult to implement them with Q table as the size would become very, very large. State of the art techniques uses Deep neural networks instead of the Q-table (Deep Reinforcement Learning). The neural network takes in state information and actions to the input layer and learns to output the right action over the time. Deep learning techniques (like Convolutional Neural Networks) are also used to interpret the pixels on the screen and extract information out of the game (like scores), and then letting the agent control the game.