
# Reinforcement Learning
### env: _taxi-v2_

In [1]:
import gym

## Loading and initializing an environment

In [2]:
env = gym.make('Taxi-v2')
env.reset()

[2017-10-17 14:16:55,944] Making new env: Taxi-v2


228

### Observation states

In [3]:
print('Total number of states = {:,}'.format(env.observation_space.n))

Total number of states = 500


### Visualizing the state
In this environment the yellow square represents the taxi, the (“|”) represents a wall, the blue letter represents the pick-up location, and the purple letter is the drop-off location. The taxi will turn green when it has a passenger aboard. While we see colors and shapes that represent the environment, the algorithm does not think like us and only understands a flattened state, in this case an integer.

In [4]:
env.render()

+---------+
|[35mR[0m: | : :G|
| : : : : |
| :[43m [0m: : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



### Action space
This shows us there are a total of six actions available. Gym will not always tell you what these actions mean, but in this case, the six possible actions are: down (0), up (1), right (2), left (3), pick-up (4), and drop-off (5).

In [5]:
print('Total number of action the agent can carry out = {:,}'.format(env.action_space.n))

Total number of action the agent can carry out = 6


### Overriding and moving the agent state

These four variables are: the new state (St+1 = 14), reward (Rt+1 = -1), a boolean stating whether the environment is terminated or done, and extra info for debugging. Every Gym environment will return these same four variables after an action is taken, as they are the core variables of a reinforcement learning problem.

Take a look at the rendered environment. What do you expect the environment would return if you were to move left? It would, of course, give the exact same return as before. The environment always gives a -1 reward for each step in order for the agent to try and find the quickest solution possible. If you were measuring your total accumulated reward, constantly running into a wall would heavily penalize your final reward. The environment will also give a -10 reward every time you incorrectly pick up or drop off a passenger.



In [6]:
env.env.s = 114  # reset the current state of the agent
env.render()

+---------+
|R: | : :G|
|[43m [0m: : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+



In [7]:
step = env.step(1)  # move up (1)
print(step)
env.render()

(14, -1, False, {'prob': 1.0})
+---------+
|[43mR[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (North)


In [8]:
step = env.step(0)  # move down (0)
print(step)
env.render()

(114, -1, False, {'prob': 1.0})
+---------+
|R: | : :G|
|[43m [0m: : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (South)


In [9]:
step = env.step(2)  # move right (2)
print(step)
env.render()

(134, -1, False, {'prob': 1.0})
+---------+
|R: | : :G|
| :[43m [0m: : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (East)


In [10]:
step = env.step(3)  # move left (3)
print(step)
env.render()

(114, -1, False, {'prob': 1.0})
+---------+
|R: | : :G|
|[43m [0m: : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (West)


In [11]:
step = env.step(1)  # move up (1)
print(step)
env.render()

(14, -1, False, {'prob': 1.0})
+---------+
|[43mR[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (North)


## Random Actions

One surprising way you could solve this environment is to choose randomly among the six possible actions. The environment is considered solved when you successfully pick up a passenger and drop them off at their desired location. Upon doing this, you will receive a reward of 20 and done will equal True. The odds are small, but it’s still possible, and given enough random actions you will eventually luck out. A core part of evaluating any agent’s performance is to compare it to a completely random agent. In a Gym environment, you can choose a random action using **`env.action_space.sample()`**. You can create a loop that will do random actions until the environment is solved. We will put a counter in there to see how many steps it takes to solve the environment.

In [12]:
state = env.reset()
total_reward = 0
reward = None
counter = 0

while state != 20:
    random_action = env.action_space.sample()
    state, reward, done, info = env.step(random_action)
    total_reward += reward
    counter += 1
    # Let's finish this after 10,000 attempts
    if counter > 10000:
        break
print(counter)

KeyboardInterrupt: 

In [15]:
print('Okay, I give up! After {:,} attempts we still haven\'t found a solution :('.format(counter))

Okay, I give up! After 17,800,285 attempts we still haven't found a solution :(


## Q-action value table
You may luck out and solve the environment fairly quickly, but on average, a completely random policy will solve this environment in about 2000+ steps, so in order to maximize our reward, we will have to have the algorithm remember its actions and their associated rewards. In this case, the algorithm’s memory is going to be a Q action value table.

To manage this Q table, we will use a NumPy array. The size of this table will be the number of states (500) by the number of possible actions (6).

In [16]:
import numpy as np

In [17]:
Q = np.zeros(shape=[env.observation_space.n, env.action_space.n])

Over multiple episodes of trying to solve the problem, we will be updating our Q values, slowly improving our algorithm’s efficiency and performance. We will also want to track our total accumulated reward for each episode, which we will define as G.

In [18]:
G = 0  # total accumulated reward for each episode

Similar to most machine learning problems, we will need a learning rate as well. I will use my personal favorite of 0.618, also known as the mathematical constant phi or the _Golden ratio_.

In [19]:
alpha = 0.618  # learning rate

### Basic Q Learning algorithm