# Reinforcement Learning Introduction
---

[Gym](https://gym.openai.com/) is a toolkit for developing and comparing reinforcement learning algorithms. To install type:

`pip install gym`

It comes with a few environments to work with; in this case the ones from *classic control*. To install other environments:

`pip install gym[atari]`

and replace atari for the environment group: atari, box2d, etc. You could also use: 

`pip install gym[all]`

to install all the environments and dependencies. 

In [1]:
#remove " > /dev/null 2>&1" to see what is going on under the hood
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[all] > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Basic imports

In [2]:
import gym
from gym.wrappers import Monitor
from gym.wrappers.monitoring.video_recorder import VideoRecorder
import matplotlib.pyplot as plt

from IPython.display import HTML
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay
from base64 import b64encode

import numpy as np
import random, math, os, glob, io, base64

In [4]:
display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7f8f72fe5a10>

In [5]:
def render_mp4(videopath: str) -> str:
  """
  Gets a string containing a b4-encoded version of the MP4 video
  at the specified path.
  """
  mp4 = open(videopath, 'rb').read()
  base64_encoded_mp4 = b64encode(mp4).decode()
  return f'<video width=400 autoplay loop controls><source src="data:video/mp4;' \
         f'base64,{base64_encoded_mp4}" type="video/mp4"></video>'

#### Load and run the *CartPole* environment

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

In [6]:
# Loads the cartpole environment
env = gym.make('CartPole-v0')
before_training = "before_training.mp4"
video_before = VideoRecorder(env, before_training)
env.reset()

try:
    while True:
        env.render()
        video_before.capture_frame()

        # Takes a random action from the action space of the environment
        action = env.action_space.sample()
        
        # A step function returns, based on the taken action: 
        # observation: the state of the environment
        # reward: reward of the previous action
        # done: if the finish conditions were met
        # info: a dict for debugging
        observation, reward, done, info = env.step(action)
        
        if done:
            break
finally:
    video_before.close()
    env.close()

In [7]:
HTML(render_mp4(before_training))

We can obtain specific information for the environment, such as the possible actions, and the available states of the environment. *Discrete* mean that it can take *n* actions. For the CartPole game, it represents left and right. *Box* represents a n-dimension array with the variables that define the state of the game. For the CartPole game, those are: 

- Cart Position [-2.4, 2.4]
- Cart Velocity (-inf, inf)
- Pole Angle [-41.8°, 41.8°]
- Pole Angular velocity (-inf, inf)

For more detail, visit the [CartPole](https://github.com/openai/gym/wiki/CartPole-v0) documentation.

In [None]:
print(env.action_space) 
print(env.observation_space)

Discrete(2)
Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)


## Value based Reinforcement Learning - Q learning
---
Q-learning is a model-free reinforcement learning algorithm. The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances. It does not require a model (hence the connotation "model-free") of the environment, and it can handle problems with stochastic transitions and rewards, without requiring adaptations.

We start with a table of states by actions: columns will be the actions, and rows will be the states. The value of each cell will be the maximum expected reward for that state and action. We will update the *Q-table* to always choose the best action. We know for each state (each line in the Q-table) what’s the best action to take, by finding the highest score in that line.

To update the table, and obtain an optimal Q value function, we have the following functions:

- Q-value function: $Q^{\pi}(s,a)$ --> It returns the expected future reward of that action at that state.
- $Q^*(s,a) \approx Q(s, a)$ --> Optimal Q value function given a state and an action.

The function $Q(s,a)$ is given by:

$Q(s_t,a_t) = Q(s_t,a_t) + \alpha \big( r_{t+1} + \gamma max_a Q(s_{t+1},a) - Q(s_t, a_t)\big)$

where:

- $\alpha$ is a learning rate
- $\gamma$ is the discount factor
- $Q(s_t, a_t)$ is the previous value
- $r_{t+1}$ is the reward
- $max_a Q(s_{t+1},a)$ is the estimate of optimal future value

The Q-learning algorithm if the following:

1. Initialize the Q-table
2. Choose an action $a$
3. Perform the action
4. Measure the reward
5. Update $Q(s,a)$

### Exploration

We initialize the Q-table with zeros, so we need to find a way to get a value. We will use the exploration/exploitation trade-off. In this case, we will use the decaying epsilon greedy strategy:

- We specify an exploration rate “epsilon,” which we set to 1 in the beginning. This is the rate of steps that we’ll do randomly. In the beginning, this rate must be at its highest value, because we don’t know anything about the values in Q-table. This means we need to do a lot of exploration, by randomly choosing our actions.
- We generate a random number. If this number > epsilon, then we will do “exploitation” (this means we use what we already know to select the best action at each step). Else, we’ll do exploration.
- The idea is that we must have a big epsilon at the beginning of the training of the Q-function. Then, reduce it progressively as the agent becomes more confident at estimating Q-values.

### Initialize Q-table

We need to create a table for each possible state and each possible action. However, for the cartpole problem, and many others, each state is a continuos space. In order to create a viable table, we need to define discrete states that contain each of the possible continuous values. Too many states imply a large table but, a very specific action; too few states imply a smaller table, but not a broad range of actions.

In [8]:
env = gym.make('CartPole-v0')

after_training = "after_training.mp4"
video_after = VideoRecorder(env, after_training)

env.reset()

array([ 0.00840537,  0.00648291,  0.03607303, -0.01202062])

In [None]:
# Ranges of each of the continous states
print(env.observation_space.high)
print(env.observation_space.low)

[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


In [9]:
# proposed number of buckets to hold the continous values
# We dont care much for the position and velocity of the cart, but we care for the 
# angle, and angular velocity of the pole.
buckets = (1, 1, 6, 5,)
actions = (env.action_space.n,)

q_table = np.zeros((buckets + actions))
print(q_table.shape)

(1, 1, 6, 5, 2)


In [10]:
upper_bounds = [env.observation_space.high[0], 1.0, env.observation_space.high[2], 1.0]
lower_bounds = [env.observation_space.low[0], -1.0, env.observation_space.low[2], -1.0]

def discretize(obs):
    ratios = [(ob + abs(lower_bounds[i])) / (upper_bounds[i] - lower_bounds[i]) for i, ob in enumerate(obs)]
    new_obs = [int(round((buckets[i] - 1) * ratios[i])) for i in range(len(obs))]
    new_obs = [min(buckets[i] - 1, max(0, new_obs[i])) for i in range(len(obs))]
    return tuple(new_obs)

In [None]:
print(discretize([-0.04534635, -0.01749441,  1.01300242, -0.00601015]))

(0, 0, 5, 2)


#### Initialize parameters

In [33]:
episodes = 501

alpha = 0.1
gamma = 0.9

# Exploration
epsilon = 1.0
min_epsilon = 0.1
max_epsilon = 1.0
decay = 0.01

#### Q-learning

In [34]:
try:
    for episode in range(episodes):

        # Reset environment
        current_state = discretize(env.reset())
        
        # Decaying e greedy
        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay * episode)
        
        done = False
        total_reward = 0
        
        while not done:

            exp_tradeoff = random.uniform(0,1)

            # Exploitation vs exploration
            # Exploitation
            if exp_tradeoff > epsilon:
                action = np.argmax(q_table[current_state])
            else: 
                # Exploration
                action = env.action_space.sample()

            observation, reward, done, info = env.step(action)
            new_state = discretize(observation)
            
            total_reward += reward
            
            # Update the q-table
            q_table[current_state][action] += alpha * (reward + gamma * np.max(q_table[new_state]) - q_table[current_state][action])

            current_state = new_state
        
        if not episode % 100:
            print("Episode:", episode, "Score:", total_reward)
            
finally:
  env.close()

Episode: 0 Score: 23.0
Episode: 100 Score: 48.0
Episode: 200 Score: 200.0
Episode: 300 Score: 42.0
Episode: 400 Score: 200.0
Episode: 500 Score: 200.0


In [35]:
# Test the trained q-table. Here we only use it to take the optimal action for
# a given state

try:
    current_state = discretize(env.reset())
    total_reward = 0

    while True:
        env.render()
        video_after.capture_frame()

        action = np.argmax(q_table[current_state])

        observation, reward, done, info = env.step(action)
        new_state = discretize(observation)
        
        total_reward += reward
        current_state = new_state
    
        if done:
            break
            
finally:
  print(f'Total reward: {total_reward}')
  video_after.close()
  env.close()

Total reward: 200.0


In [36]:
HTML(render_mp4(after_training))