# RL example: the PoleCart environment

Wow, you've made it this far, congratulations! This notebook will show you how to solve a simple environment by using a RL algorithm called _Deep Q-Learning_. We will use Keras to code it.

But first, we need to learn about a couple more things.

## Simulating environments with OpenAI's Gym

`Gym` is a library by OpenAI that provides ready-made environments for us to test our algorithms and allows us to define new environments if needed.

We will use `CartPole-v0` environment, which you can learn more about [in this link](https://gym.openai.com/envs/CartPole-v0/).

The basic workflow with any Gym environment is as follows:
1. First, we `reset()` the environment and get an initial _observation_ of the environment in its initial conditions.
  * For the purpose of this exercise, _observation = state_, but other environments may not work like this.
1. Based on the observation, we choose an _action_ and `step()` the environment with the action. The `step()` method returns the _next observation_, a _reward_ and a _`done`_ boolean that tells us whether it's time to reset the environment again.
  * It also returns an additional dictionary object with debug info but we will not be using it.
1. We can now perform any additional operations we want to deal with the info we got from `step()` and go back to step 2 to repeat the process again.
  * If _`done`_ is `True` then we can exit the loop and perform even more additional operations with the obtained info before going back to step 1 again. When this happens, we have completed an ***episode***.

The `CartPole-v0` environment is a 2D physics-based environment in which we've got a cart with a pendulum-like pole attached to it on top with a free-rotating unactuated joint; the cart can move left or right on a rail and the pole can rotate freely but will most likely fall down due to gravity.

The goal is to move the cart left and right so that the pole stays balanced on top of the cart. Each timestep has a +1 reward, but the _`done`_ condtition becomes true when the pole goes more than 15 degrees from vertical or when the cart moves more than 2.4 units from the center of the screen. We can apply a force of +1 or -1 to the cart to move it left or right on each timestep.

Let's run some code and see how it all works on a random _episode_.

## Auxiliary functions

Please run all of these blocks before continuing.

In [1]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [2]:
import gym
import random
import os
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from tensorflow.keras.optimizers import Adam

In [3]:
# for video stuff
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only

import math
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay

In [4]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7fc57161c890>

In [5]:
def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    
def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

## Random execution

We will now simply run the environment and see how it looks. We will choose random actions; we don't expect the pole to stand up for long so the _episode_ will most likely be short.

Run the code below as many times as you want to generate random episodes and see what they look like.

In [None]:
# Let's generate a random trajectory...

# Choose the Cart-Pole environment from the OpenAI Gym
env = wrap_env(gym.make("CartPole-v1"))

# Initialize the variables ob (observation=state), done (breaks loop) 
# and total_rew (reward)
ob, done, total_rew = env.reset(), False, 0

# Execution loop
while not done:
  env.render()
  
  # Sample a random action from the environment
  ac = env.action_space.sample()
  
  # Obtain the new state, reward and whether the episode has terminated
  ob, rew, done, info = env.step(ac)
  
  # Accumulate the reward
  total_rew += rew
  
print('Cumulative reward:', total_rew)
  
# ... and visualize it!
env.close()
show_video()

# RL algorithm: Deep Q-Learning

Now that we've seen what the CartPole environment looks like, we can start developing a way to solve it.

In this example we will see an algorithn called ***Deep Q-Learning***, or DQL for short.

DQL has the following "flavor":
* Model-free: we do not make use of a model to represent the environment.
* Value-based: we will estimate the quality of each possible action and we will _infer a policy_ from these quality values.
* Off-policy: we won't always use the quality values that we predict with the latest inferred policy.

## What is Q-learning? And why "Deep" Q-learning?

If you recall from the RL agent components explanation, a _value function_ is a function that returns an _estimate reward_ based on the current state and/or action.
* The _state-value function `V`_ is a function that returns the expected reward based on the state _`s`_.
* The _action-value function `Q`_ is a function that returns the expected reward based on the state _`s`_ and the action _`a`_.
* In a way, _`V`_ could be undestood as _`Q`_ for all possible actions in _`s`_.

Q-learning is using _`Q`_ for _deriving a policy_: we simply choose the action with the highest Q value (_greedy policy_) or we sample an action based on probability (_epsilon-greedy policy_, more on this later).

**Deep** Q-learning is when we make use of a deep neural network (DNN) to estimate Q values, along with a few additional tricks. Because DNNs can output multiple values, we can make our DNN output the Q values for all available actions in a single go, which will come in handy later.

## Q-learning algorithm

So, how do wwe exactly estimate the ***future*** reward if we don't know what will happen in the future?

When we create a randomly initialized neural network and try to predict a result (or a reward in our case) without training it beforehand the output will be nonsensical, but by using a ***loss function*** we can calculate how far it is from the result we actually want and change the network accordingly.

Our loss function will have 2 important parts: the ***prediction*** reward and the ***target*** reward, so that we can compare our predictions and change them accordingly. But we don't actually have a "real" target to compare to, so we will use our own predictions to calculate the target as follows:

1. Starting from a state `s`, we pick an action `a` by predicting Q values and choosing the best one. This is the _prediction_ part of the loss function.
1. Observe both the reward and new state `s'`.
1. We now calculate the _target_ part of the loss function by predicting the maximum Q value of the state `s'`, multiplying it with a _decay rate_ (so that this "future reward prediction" is less important than our original prediction) and add it to the reward we got on step 2.
1. We increase the timestep by 1, so that `s'` is now `s`, and go back to step 1.

Here's what the loss function will look like for DQL:

![dqn loss](https://miro.medium.com/max/2400/1*rQDqgYwfnsbEu6u6ZdRMpQ.png)

>Source: https://medium.com/@gtnjuvin/my-journey-into-deep-q-learning-with-keras-and-gym-3e779cc12762

Using our own predictions as targets is called _bootstrapping_. Don't worry if you're confused; the important bit is that you understand that we make use of our predictions and the actual rewards we get to train our network and improve its predictions.

We square the results because it has interesting properties: negative errors become positive and it boosts small differences, which improves training. You might notice that this loss is similar to Mean Square Error loss that we use for regression, and you'd be right.

The cool thing is that Keras takes care of lots of things, so we only have to worry about defining the target in our code.

>Note: the actual [DQL algorithm published by Google's DeepMind](https://arxiv.org/abs/1312.5602) makes use of 2 separate deep neural networks: the _policy network_ makes the predictions and the _target network_ is used to calculate the targets; the policy network is continuously being updated and the target network is actually a copy if policy network but its weights are kept frozen and it's only updated every 1000 steps or so by copying the policy network again. This is done in order to keep the targets more stable. This is why DQL is an _off-policy_ algorithm. However, for the sake of simplicity we will only use a single network in our code.

## DQL tricks: replay memory

We now know everything we need to implement a solution for the cartpole environment, but the truth is that the solution wouldn't be very effective and our network would take forever to improve. This is because training our network with ***consecutive samples*** is problematic: the samples are _too correlated_, which leads to inefficient learning and ***catastrophic forgetting*** (the network forgets previously learned info as it keeps learning new info and can't predict correctly what it previously could).

In order to solve this, we can create a ***replay memory***: we create a table of transitions (remember Markov Decision Processes?) which we update continuously, and we train our network with ***random minibatches*** of transitions that we sample from the table.

By using this method we destroy any possible correlation between the data: the network learns from unrelated timesteps and learns how to improve its predictions based on random situations.

With the replay memory, our final DQL algorithm will look like this:
1. Collect transitions and store them in the replay memory for N episodes.
  * For the sake of simplicity, in our code N will be 1.
1. Sample a random minibatch of transitions from the replay memory.
1. Compute the targets using the minibatch.
1. Optimise the network.
1. Repeat as many times as desired.

Now you know everything you need to know to understand the code. Let's go!

# DQL implementation

The code below is heavily based on [Gaeta Juvin's](https://medium.com/@gtnjuvin/my-journey-into-deep-q-learning-with-keras-and-gym-3e779cc12762) [DQL implementation for CartPole](https://github.com/GaetanJUVIN/Deep_QLearning_CartPole).

## Agent

There is just one thing we haven't explained yet: _greedy_ vs _epsilon-greedy_ policies.

When we haven't been able to learn much because we're in the early stages of training, it makes sense to _explore multiple actions_ by randomly choosing them and see how well they work. But as we learn more and more, our guesses should be more educated.

A _greedy policy_ would be looking at the Q values and choosing the highest one. This is good for trained networks but not so much when we're just starting out.

We can use instead an _epsilon-greedy policy_: by defining an _exploration rate_ (AKA _epsilon_) we can randomly choose between picking a random action or picking the action with the highest Q value, with the exploration rate defining the chances of going the random or the high Q route. We can also define an _exploration decay_ that makes the exploration rate smaller as we train further until it reaches a minimum value, in order to make it possible to still explore random actions every once in a while.

In [7]:
class Agent():
  def __init__(self, state_size, action_size):
    self.weight_backup      = "cartpole_weight.h5"
    self.state_size         = state_size
    self.action_size        = action_size
    self.memory             = deque(maxlen=2000)
    self.learning_rate      = 0.001
    self.gamma              = 0.95
    self.exploration_rate   = 1.0
    self.exploration_min    = 0.01
    self.exploration_decay  = 0.995
    self.brain              = self._build_model()

  def _build_model(self):
    """Returns a simple neural network made of dense layers"""
    model = Sequential()
    model.add(Dense(24, input_dim=self.state_size, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(self.action_size, activation='linear'))
    model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))

    # Load previously saved weights
    if os.path.isfile(self.weight_backup):
      model.load_weights(self.weight_backup)
      self.exploration_rate = self.exploration_min
      print('Found saved weights, loaded into model')
    else:
      print('Backup weights for model not found')
    return model

  def save_model(self):
    self.brain.save(self.weight_backup)

  def act(self, state):
    """Chooses the next action following an epsilon-greedy policy"""
    # Generate a random number and compare against epsilon. 
    if np.random.rand() <= self.exploration_rate:
      # Exploration time! We choose a random action
      return random.randrange(self.action_size)
    # Greedy time! We predict the Q values for all available actions and
    # choose the action with the highest one
    act_values = self.brain.predict(state)
    return np.argmax(act_values[0])

  def remember(self, state, action, reward, next_state, done):
    """Appends a transition to the replay memory"""
    self.memory.append((state, action, reward, next_state, done))

  def replay(self, sample_batch_size):
    """Samples a random minibatch from the replay memory, calculates targets and trains the DNN with it"""
    if len(self.memory) < sample_batch_size:
      return
    # Sample a random minibatch
    sample_batch = random.sample(self.memory, sample_batch_size)
    # We calculate the target
    for state, action, reward, next_state, done in sample_batch:
      # If we're done, then target = reward because there's no future Q-values
      target = reward
      if not done:
        # full formula goes here
        target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
      # We now map the current state to the future discounted reward
      target_f = self.brain.predict(state)
      target_f[0][action] = target
      # and now we optimize our network
      self.brain.fit(state, target_f, epochs=1, verbose=0)
    # We now discount the exploration rate so that future decisions tend to
    # be more informed -> greedy options will be more likely
    if self.exploration_rate > self.exploration_min:
      self.exploration_rate *= self.exploration_decay

## Environment

We'll define our main code here: the `train` and `run` methods.

The `train` method will contain our main DQL loop: we follow the regular RL loop of observing the environment, deciding the action, acting on the environment and receiving the reward, but with the addition of storing every decision in our replay memory. Once an episode is done, we will replay our memories and improve our network.

The `run` method will simply run a single episode with our trained network and display a video of its results.

In [16]:
class CartPole:
  def __init__(self, episodes=200):
    self.sample_batch_size = 32
    self.episodes          = episodes
    self.env               = wrap_env(gym.make("CartPole-v1"))

    self.state_size        = self.env.observation_space.shape[0]
    self.action_size       = self.env.action_space.n
    self.agent             = Agent(self.state_size, self.action_size)


  def train(self):
    self.env = wrap_env(gym.make("CartPole-v1"))
    self.agent.exploration_rate   = 1.0
    try:
      for index_episode in range(self.episodes):
        state = self.env.reset()
        state = np.reshape(state, [1, self.state_size])

        done = False
        index = 0
        # main loop
        while not done:

          action = self.agent.act(state)

          next_state, reward, done, _ = self.env.step(action)
          next_state = np.reshape(next_state, [1, self.state_size])
          # we save our decision in the replay memory
          self.agent.remember(state, action, reward, next_state, done)
          state = next_state
          index += 1
        print("Episode {}# Score: {}".format(index_episode, index + 1))
        # after the episode, we replay the memories and learn
        self.agent.replay(self.sample_batch_size)
    finally:
      self.agent.save_model()
    env.close()

  def run(self, epsilon=0.0):
    self.env = wrap_env(gym.make("CartPole-v1"))
    self.agent.exploration_rate = epsilon
    state, done, total_reward = self.env.reset(), False, 0
    state = np.reshape(state, [1, self.state_size])

    while not done:
      self.env.render()
      action = self.agent.act(state)
      state, reward, done, _ = self.env.step(action)
      state = np.reshape(state, [1, self.state_size])
      total_reward += reward
    print('Cumulative reward:', total_reward)
    self.env.close()
    show_video()

Now run some episodes, the more the better! In colab, 200 episodes should take about 14 minutes.

In [None]:
# Run some episodes, the more the better. 200 episodes should take about 14 minutes in Colab.
cartpole = CartPole(200)
cartpole.train()

Did you get good results? You can try rerunning the previous block and see how it behaves. The more episodes, the more likely is it that you'll improve results.

Now run the code below to check how your network behaves. Is it better than the random executions above? Run the block multiple times to get new episodes; you might get lucky!

If your results are constantly below 50 or so, you can try increasing the exploratory rate slightly; try 0.1 or 0.2 in the code below.

In [None]:
cartpole.run(0.0)

If you'd like to try pre-trained weights, run the code below and try again. FYI: according to the Gym docs, this environment is considered "solved" when you consistently get results over 500...

In [22]:
!wget https://github.com/GaetanJUVIN/Deep_QLearning_CartPole/raw/master/cartpole_weight.h5

--2022-01-23 00:17:40--  https://github.com/GaetanJUVIN/Deep_QLearning_CartPole/raw/master/cartpole_weight.h5
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/GaetanJUVIN/Deep_QLearning_CartPole/master/cartpole_weight.h5 [following]
--2022-01-23 00:17:40--  https://raw.githubusercontent.com/GaetanJUVIN/Deep_QLearning_CartPole/master/cartpole_weight.h5
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33168 (32K) [application/octet-stream]
Saving to: ‘cartpole_weight.h5.1’


2022-01-23 00:17:41 (17.5 MB/s) - ‘cartpole_weight.h5.1’ saved [33168/33168]



In [None]:
cartpole = CartPole(200)
cartpole.run(0.0)