
<h2 id="Code-is-from-this-video-and-this-article">Code is from <a href="https://youtu.be/q2ZOEFAaaI0">this video</a> and <a href="https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe">this article</a><a class="anchor-link" href="#Code-is-from-this-video-and-this-article">¶</a></h2>


In [1]:

import numpy as np
import gym
import random




<h2 id="The-Taxi-Problem">The Taxi Problem<a class="anchor-link" href="#The-Taxi-Problem">¶</a></h2>There are four designated locations in the grid world indicated by R(ed), B(lue), G(reen), and Y(ellow). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drive to the passenger's location, pick up the passenger, drive to the passenger's destination (another one of the four specified locations), and then drop off the passenger. Once the passenger is dropped off, the episode ends.
There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is the taxi), and 4 destination locations.
Actions: 
There are 6 discrete deterministic actions:
<ul>
<li>0: move south</li>
<li>1: move north</li>
<li>2: move east </li>
<li>3: move west </li>
<li>4: pickup passenger</li>
<li>5: dropoff passenger</li>
</ul>
<p></p>
Rewards: 
There is a reward of -1 for each action and an additional reward of +20 for delievering the passenger. There is a reward of -10 for executing actions "pickup" and "dropoff" illegally.
Rendering:
<ul>
<li>blue: passenger</li>
<li>magenta: destination</li>
<li>yellow: empty taxi</li>
<li>green: full taxi</li>
<li>other letters: locations</li>
</ul>


In [2]:

env = gym.make("Taxi-v2")
env.render()



[2019-11-22 22:55:06,155] Making new env: Taxi-v2


+---------+
|R: | : :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |[34;1mB[0m:[43m [0m|
+---------+




<h2 id="Initialize-needed-variables-and-creating-the-Q-table">Initialize needed variables and creating the Q-table<a class="anchor-link" href="#Initialize-needed-variables-and-creating-the-Q-table">¶</a></h2>


In [3]:

print("Number of actions: %d" % env.action_space.n)
print("Number of states: %d" % env.observation_space.n)



Number of actions: 6
Number of states: 500


In [4]:

action_size = env.action_space.n
state_size = env.observation_space.n



In [5]:

qtable = np.zeros((state_size, action_size))
print(qtable)



[[ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 ..., 
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.]]


In [6]:

total_episodes = 50000
total_test_episodes = 5
max_steps = 99

learning_rate = 0.7
discount_rate = 0.9               #Also known as gamma

epsilon = 1.0
max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.01




<h2 id="What-is-discount-factor?">What is discount factor?<a class="anchor-link" href="#What-is-discount-factor?">¶</a></h2>The discount factor affects how much weight it gives to future rewards in the value function. A discount factor, γ=0, will result in state/action values representing the immediate reward, while a higher discount factor, γ=0.9, will result in the values representing the cumulative discounted future reward an agent expects to receive (behaving under a given policy)




<h2 id="Q-Learning-equation">Q-Learning equation<a class="anchor-link" href="#Q-Learning-equation">¶</a></h2>
<p></p>
\begin{equation*}
Q^{new}(s_t, a_t) = (1 - \alpha) \cdot Q(s_t, a_t) + \alpha \cdot (r_t + \gamma \cdot max Q (s_{t+1} ,a))
\end{equation*}



<h2 id="Training-the-Q-table">Training the Q-table<a class="anchor-link" href="#Training-the-Q-table">¶</a></h2>


In [7]:

for episode in range(total_episodes):
    #Reset environment every time a new episode begins
    state = env.reset()
    step = 0
    done = False
    
    for step in range(max_steps):
        #Choose an action in current state
        
        #Generate random number
        exp_exp_tradeoff = random.uniform(0,1)
        
        #If random number > epsilon --> exploitation (select the action with the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state, :])
        #Else, do a random choice --> exploration
        else:
            action = env.action_space.sample()
        
        #Do the action (a) and observe the outcome state (s') and reward (R)
        new_state, reward, done, info = env.step(action)
        
        #Update q value for the state based on the formula
        #Q(s,a) = Q(s,a) + lr[R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + discount_rate * np.max(qtable[new_state, :]) - qtable[state, action])
        state = new_state
        
        if done is True:
            break
    
    episode += 1
    
    #Reduce epsilon (because we want to reduce the number of exploration as time passes)
    epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)




<h2 id="Testing-the-performance-of-the-Q-table">Testing the performance of the Q-table<a class="anchor-link" href="#Testing-the-performance-of-the-Q-table">¶</a></h2>


In [8]:

env.reset()
rewards = []

for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    print("******************************************************************")
    print("EPISODE ", episode)
    
    for step in range(max_steps):
        env.render()
        action = np.argmax(qtable[state, :])
        new_state, reward, done, info = env.step(action)
        total_rewards += reward
        
        if done is True:
            env.render()
            rewards.append(total_rewards)
            print("Score: ", total_rewards)
            break
        state = new_state

env.close()
print("Mean score over time: " + str(sum(rewards) / total_test_episodes))



******************************************************************
EPISODE  0
+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |[43mB[0m: |
+---------+

+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B:[43m [0m|
+---------+
  (East)
+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : : : : |
| : : : : |
| | : | :[43m [0m|
|Y| : |B: |
+---------+
  (North)
+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : : : : |
| : : : : |
| | : |[43m [0m: |
|Y| : |B: |
+---------+
  (West)
+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : : : : |
| : : :[43m [0m: |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : : : : |
| : :[43m [0m: : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|[34;1mR[0m: | : :[35mG[0m|
| : :[43m [0m: : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|[34;1mR[0m: | : :[35mG[0m|
| :[43m [0m: : : |
| : : : 