![Nuclio logo](https://nuclio.school/wp-content/uploads/2018/12/nucleoDS-newBlack.png)

# Reinforced Learning with Q-learning - The autonomous taxi

The first example of reinforced learning more serious than the multi-armed bandit

## 1. Let's install the libraries in Colab to have a gym and be able to show the recorded videos
More details in this sample notebook from the Colab team https://colab.research.google.com/drive/18LdlDDT87eb8cCTHZsXyS9ksQPzL3i6H#scrollTo=7wY4qZhPXotR

In [1]:
!pip install gym
!apt-get install python-opengl -y
!apt install xvfb -y

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  libgle3
The following NEW packages will be installed:
  python-opengl
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 496 kB of archives.
After this operation, 5,416 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python-opengl all 3.1.0+dfsg-1 [496 kB]
Fetched 496 kB in 1s (681 kB/s)
Selecting previously unselected package python-opengl.
(Reading database ... 155229 files and directories currently installed.)
Preparing to unpack .../python-opengl_3.1.0+dfsg-1_all.deb ...
Unpacking python-opengl (3.1.0+dfsg-1) ...
Setting up python-opengl (3.1.0+dfsg-1) ...
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  xvfb
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 784 kB of 

In [2]:
!pip install gym[atari]



For rendering environment, you can use pyvirtualdisplay. So fulfill that 

In [3]:
!pip install pyvirtualdisplay
!pip install piglet

Collecting pyvirtualdisplay
  Downloading PyVirtualDisplay-2.2-py3-none-any.whl (15 kB)
Collecting EasyProcess
  Downloading EasyProcess-0.3-py2.py3-none-any.whl (7.9 kB)
Installing collected packages: EasyProcess, pyvirtualdisplay
Successfully installed EasyProcess-0.3 pyvirtualdisplay-2.2
Collecting piglet
  Downloading piglet-1.0.0-py2.py3-none-any.whl (2.2 kB)
Collecting piglet-templates
  Downloading piglet_templates-1.2.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 2.8 MB/s 
Installing collected packages: piglet-templates, piglet
Successfully installed piglet-1.0.0 piglet-templates-1.2.0


In [4]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7f76bef3ca90>

In [5]:
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1

In [6]:
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) # error only
import tensorflow as tf
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay

The following two functions are to be able to video record the gym environments and show them

To activate the video, you just have to do "env = wrap_env(env)"

In [7]:
def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

## 2. The Taxi environment

In [8]:
env = gym.make("Taxi-v3").env

env.render()

+---------+
|[35mR[0m: | : :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[43mY[0m| : |B: |
+---------+



This looks a bit like what we have been seeing in the slides, the only difference (which was like this in the v2 version of Taxi) is that the taxi is in position R, instead of (1,3) as we had seen before

The universe of the gym interface is <code> environment </code>. We have some methods that we will be using that may be useful to you:

<ul>
   <li> <code> environment.reset </code> Puts the environment in factory position, and returns a random initial state </li>
   <li> <code> environment.step (action) </code> Step of the environment after a time increment. This returns us:
   <ul>
     <li> <b> observation </b>: Observations of the environment </li>
     <li> <b> reward </b>: The reward we collect </li>
     <li> <b> done </b>: indicates if we have successfully picked up or dropped off a passenger, which we will call an <i> episode </i> </li>
     <li> <b> info </b>: Additional information such as performance or latency for debugging </li>
   </ul>
   <li> <code> environment.render </code> Paint a frame of the environment (very useful to get an idea of it)

Additional note: we have made <code> gym.make ("Taxi-v3").<b>env</b> </code> to avoid stopping us at 200 iterations which is how Gym works by default

### Let's remember the problem

We have 4 locations (with different letters), and our job is to pick up a passenger at one of them and take him to another. We receive 20 points for a successful delivery, and we lose 1 point for each time step (to optimize the journey). We have incorporated a 10 point penalty for illegal deliveries or pickups at the wrong locations.

In [9]:
env.reset()
env.render()

+---------+
|R: | : :G|
| : | : : |
| :[43m [0m: : : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+



Notice that now the taxi is in another location.

More things from the environment:
* The yellow box is the taxi
* The walls or hedges where there is no road are the pipes (|)
* R, G, Y, B are the pick-up and drop-off points. The color <font color = "blue"> <b> blue </b> </font> indicates a collection point, the color <font color = "purple"> <b> purple </b> </font> indicates delivery point.

In [11]:
print("Action space {}".format(env.action_space))
print("State space {}".format(env.observation_space))

Action space Discrete(6)
State space Discrete(500)


Notice that the Action Space is of size 6 and the State Space is of size 500.

In terms of mapping that you know that the actions go from 0 to 5 with these values:

* 0 = south
* 1 = north
* 2 = east
* 3 = west
* 4 = pick up passenger
* 5 = leave passenger

Index of collection or delivery points

* 0 = R (0,0)
* 1 = G (0.4)
* 2 = Y (4,0)
* 3 = B (4,3)

But first of all, we are going to take a look at that position (3,1) that we had seen, we will see the information of that state. With a passenger waiting in position 2 (<font color = "blue"> <b> Y </b> </font>) with destination 0 (<font color = "purple"> <b> R </b> </font>)

In [12]:
state = env.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger position index, destination position index)
print("State:", state)

env.s = state # Podemos asignar el estado actual al que queramos definir
env.render()

State: 328
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+



### The reward table

As you can see from 0 to 499 coordinates that we have in our environment, it would be interesting if what we have arranged in our rewards in the presentation is the same as there is in Gym

In [13]:
env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

The indices in the dictionary at position 328 are the actions to take from this point. And they follow a very clear structure:

<code> {action: [(probability, next state, reward, done)]} </code>

As you can see, in this environment we have assigned a <code> probability </code> to each action of 100% (we do not make distinctions, nor do we force agent behavior.

<code> next state </code> tells us in which coordinates we would end up if we took the action of the index.

The <code> reward </code> is shown in that third position, with values ​​of -1 if we "add" one more time step and -10 if the taxi happens to pick up or drop off a passenger. And if we looked at the correct destination coordinate with a passenger inside the taxi, a nice 20 would appear in <code> reward </code> in action 5 (leave passenger).

The <code> performed </code> position is used to tell us that we have dropped a passenger at the correct location. Each successful installment will be our **episode** finale.

## 3. What if we let the taxi do it alone (no Reinforcement Learning)

By the method of brute force. It is about using the <code> P </code> table of rewards, which will be the guide for navigating the taxi.

The idea is to set up an infinite loop that won't stop until the taxi drops a passenger at their destination (a simple **episode**). Or when we receive a reward of 20.

The <code> environment.action_space.sample () </code> method takes an action randomly from all possible actions.

Let's see what happens ...

In [19]:
env.s = 328

epochs = 0
punishment, reward = 0, 0

total_epochs = 0
total_punishments = 0

frames = [] # for animation !!

terminal_episode = False

while not terminal_episode:
    action = env.action_space.sample()
    state, reward, terminal_episode, info = env.step(action)

    if reward == -10:
      punishment += 1
    
    # We put each rendered frame inside a dictionary for animation
    frames.append ({
        'frame': env.render(mode = 'ansi'),
        'episode': epochs,
        'state': state,
        'action': action,
        'reward': reward
        }
    )
    epochs += 1

total_punishments += punishment
total_epochs += epochs
episodes = 1
    
print ("Time steps used: {}". format(epochs))
print ("Accumulated penalties: {}". format(punishment))

random_halftime = total_epochs / episodes
average_random_punishment = total_punishments / episodes

print (f"Results after {episodes} episodes:")
print (f"Average time per episode: {random_halftime}")
print (f"Average of punishments per episode: {average_random_punishment}")

Time steps used: 1342
Accumulated penalties: 436
Results after 1 episodes:
Average time per episode: 1342.0
Average of punishments per episode: 436.0


### Let's format the results in a nice animation

In [20]:
from IPython.display import clear_output
from time import sleep

def visualise_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Episode: {frame['episode']}")
        print(f"Time: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)

visualise_frames(frames)

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Episode: 1341
Time: 1342
State: 0
Action: 5
Reward: 20


It has cost him his life to deliver the passenger to his destination ... not bad.

## 4. Q-Learning's turn

Let us remember that now we must update the values of Q (a, s) in the Q-table, because it is from that guide that we will get the best actions for our agent, the taxi.

Higher or better values of Q will indicate where we have to go with our taxi. In other words, if the taxi is in a state where a passenger is waiting for it, it is quite likely that the values for <code> pick up passenger </code> are higher than the rest of the available actions.

### 4.1 Initializing the Q-table to a 500x6 matrix filled with zeros

In [21]:
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])

print(q_table.shape)

(500, 6)


We set the time, and we load a few necessary libraries

In [22]:
%%time

from IPython.display import clear_output

CPU times: user 10 µs, sys: 2 µs, total: 12 µs
Wall time: 16.5 µs


### 4.2 Let's build the code for training

We will define the update parameters of the Q table (alpha, gamma) and the epsilon ... yes our greedy friend epsilon!!

In [23]:
alpha = 0.1 # Our learning rate
gamma = 0.6 # Our reward discount
epsilon = 0.1 # greedy epsilon

We define a variable to be able to measure metrics

In [24]:
all_epochs = []
all_penalties = []

We set up the episode loop, where inside we will put the episode loop itself.

In [25]:
for i in range (1, 100001):
  state = env.reset () # We start the environment in a random position each time

  epochs, punishment, reward = 0, 0, 0
 
  terminal_episode = False

  while not terminal_episode:
    # Here goes our friend Epsilon greedy, to determine when we explore or explode
    if np.random.random () < epsilon:
      action = env.action_space.sample() # We explore the action space
    else:
      action = np.argmax(q_table[state]) # We exploit what we have learned
    
    next_state, reward, terminal_episode, info = env.step (action)

    # Let's update the values ​​of the Q table in the action position, state after seeing what happened to us
    # We store the previous value of table Q
    old_value = q_table[state, action]

    # From the states position, I keep the value of the action that would give the highest value in the Q table
    next_max = np.max(q_table[next_state])

    # We calculate the update form for Q (a, s)
    new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
    
    q_table[state, action] = new_value

    if reward == -10:
      punishment += 1

    state = next_state
    epochs += 1

  if i% 100 == 0:
    clear_output (wait = True)
    print (f"Episode: {i}")



print ("Training finished. \ n")

Episode: 100000
Training finished. \ n


After the training is finished, let's look at the values of position 328 for each action available in that environment as learned by the taxi.

In [26]:
q_table[328]

array([ -2.39618228,  -2.27325184,  -2.40921663,  -2.3577794 ,
       -10.60114447, -10.95515178])

Action 1 (go north) is the one with the best "reward"

### 4.3 Let's evaluate how our agent does it

Now we don't explore anymore, so we remove the Epsilon Greedy part and use directly from the Q-table values.

In [None]:
total_epochs, total_punishments = 0, 0
episodes = 100

frames = [] # for animation

for i in range (episodes):
    state = env.reset ()
    epochs, punishment, reward = 0, 0, 0
    
    terminal_episode = False
    
    while not terminal_episode:
        action = np.argmax(q_table[state])
        status, reward, terminal_episode, info = env.step(action)

        if reward == -10:
          punishment += 1
        
        epochs += 1

        # We put each rendered frame inside a dictionary for animation
        frames.append ({
          'frame': env.render(mode = 'ansi'),
          'episode': i,
          'state': state,
          'action': action,
          'reward': reward
          }
        )

    total_punishments += punishment
    total_epochs += epochs

qlearning_time = total_epochs / episodes
average_punishment_qlearning = total_punishments / episodes

print (f"Results after {episodes} episodes:")
print (f"Average time per episode: {qlearning_time}")
print (f"Average punishments per episode: {average_punishment_qlearning}")

In [None]:
visualise_frames(frames)