# Solving Cart Pole

The goal of this jupyter notebook is to explain basic concepts about reinforcement learning (rl) by solving the [Cart Pole problem](https://gym.openai.com/envs/CartPole-v1/) (The hello world of reinforcement learning).

**Disclaimer:** To fully profit from this tutorial it is useful to have some basic understanding about reinforcement learning (You should know about the concepts and use of rewards, observations and environments). If you are completely new to rl, check out this [video](https://www.youtube.com/watch?v=2pWv7GOvuf0) or this [article](https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html))

Lets start with some imports and a helper function to animate the behaviours of our reinforcement learning agent:


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import tensorflow as tf
import matplotlib as mpl
mpl.rc('animation', html='jshtml')

def update_scene(num, frames, patch):
    patch.set_data(frames[num])
    return patch,

# Define some helper functions :-)
def plot_animation(frames, repeat=False, interval=40):
    fig = plt.figure()
    patch = plt.imshow(frames[0])
    plt.axis('off')
    anim = animation.FuncAnimation(
        fig, update_scene, fargs=(frames, patch),
        frames=len(frames), repeat=repeat, interval=interval)
    plt.close()
    return anim

## OpenAI gym environments

To train and use a rl algorithm, we first need to have an environment. 
For example: If you want to teach a reinforcement learning agent to play an Atari game -  you first need to have an environment which simulates this Atari game, so the agent can interact with that virtual Atari environment. Or another example: If you want to teach a reinforcement learning agent to play table soccer - you first need to have an environment which can be used by the agent to learn playing.

Luckily, OpenAI Gym provides loads of simulated environments to test rl algorithms. One of them is the [cart pole environment](https://gym.openai.com/envs/CartPole-v0/). 

**It's highly recommended to read up on OpenAI gym environments [here](https://gym.openai.com/docs/) to get started !**

In [None]:
import gym

## The Cart Pole environment

The [cart pole environment](https://gym.openai.com/envs/CartPole-v0/) contains a cart which can be controlled by moving it either left or right. On top of that cart, there is a pole which we want to balance. The goal is to train an reinforcemnt learning agent which learns to balance the pole.

In [None]:
# load the openAI Gym environment
env = gym.make('CartPole-v1')
# Initiallize the environment by calling reset
env.reset()
# plot the environment by calling the render method
img = env.render(mode="rgb_array")
plt.imshow(img)

To perform an action in an OpenAI environment one just has to call the `.step()` method (Parameter defines the action to take). In the Card Pole environment we have 2 actions (A discrete 2-dimensional action space). Balancing the card to the left (Action=0) or balancing to the right (Action=1).

In [None]:
# Lets play one step - meaning: We are balancing the cart a bit to the left (Action=0 means left).
observation, reward, done, info = env.step(0)

As an response to an action performed, the `step()` function returns some information and state of the environment containing:
1. Observation: Containing new state information about the environment. One observation is composed as follows:

Pos | Observation | Min | Max
---|---|---|---
0 | Cart Position | -2.4 | 2.4
1 | Cart Velocity | -Inf | Inf
2 | Pole Angle | ~ -41.8&deg; | ~ 41.8&deg;
3 | Pole Velocity At Tip | -Inf | Inf

2. Reward: A scalar reward which can be used to train a Reinforcement Learning algorithm (remember: The goal of an RL agent is to maximise rewards). In the Card Pole environment we get a 1.0 Reward for every time step that the card pole does not tip over. **The ultimate goal is to maximise the sum of rewards (The more reward, the longer we balanced the pole!!)** A reward of 200 means we balanced the pole over a maximum time span (An episode automatically terminates after 200 time steps)
3. Done: Is True if the episode ended (In this case, after the pole tipped too far to be balanced - or after 200 timesteps are over.)
4. info: Further information

In [None]:
print(observation)

**Pro Tip:** Every OpenAI gym environment implements `env.observation_space` and `env.action_space` which you can read out to get information of how to interact with the environment:

In [None]:
env.observation_space

In [None]:
env.action_space

## Test the environment


To sanity check the environment, we use a loop to balance to one site multiple times. We expect the pole to tip to the oposite site.

In [None]:
# reset again to denote a new start of a new episode
obs = env.reset()

#
frames = []

# Balance to the left for 15 timesteps
for i in range(15):
    # type of action we perform is 0, meaning we going left
    env.step(0)
    frames.append(env.render(mode="rgb_array"))

plot_animation(frames)

... And yes. The pole tips. Now we understood the environemnt and we can try to balance the pole :-)

## Hardcoded Policy

lets first try to solve Cart Pole by implementing a hard coded policy - without the use of any rl. We just look at the angle of the pole and try to balance it by choosing an action

In [None]:
def basic_policy(obs):
    angle = obs[2]
    if angle < 0:
        return 0
    return 1

In [None]:
frames = []

obs = env.reset()
i = 0 
for step in range(200):
    i += 1
    img = env.render(mode="rgb_array")
    frames.append(img)
    action = basic_policy(obs)
    obs, reward, done, info = env.step(action)
    
    if done:
        break

print(i)

In [None]:
plot_animation(frames)

... As you can see, the episode mostly ends at around 40 steps (Pole tips over to far to be balanced - so the environment returns `done=True`).  

# Q-learning from scratch

Before we jump into the implementation, lets clarify some basics:

**What is actually a Q-Value?**

A Q-Value (Quality value) $Q(s,a)$ denotes the sum of discounted future rewards an agent can expect after it chooses to take action $a$ in state $s$.

**What do we mean by discounted future rewards?**

We are using a discount factor $\gamma$ to discount rewards which we are expecting to get in the future.

If this explanation is not enough (probably its not enough 🤯), please make sure to check out David Silvers introduction to RL video: https://www.youtube.com/watch?v=2pWv7GOvuf0


---------

**Disclaimer:** Almost all the content from this section is derived or copied from https://github.com/ageron/handson-ml2/blob/master/18_reinforcement_learning.ipynb which is an excellent source !

In [None]:
from collections import deque
from tensorflow import keras

In [None]:
# creat environment
env = gym.make("CartPole-v0")

# clear
keras.backend.clear_session()
initial_observation = env.reset()

The following neural net is used as Deep-Q-Network (DQN). Given a state of the environment, it will estimate a Q-value for each possible action for that particular state.

In [None]:
input_shape = [4] # == env.observation_space.shape
n_outputs = 2 # == env.action_space.n

dqn = keras.models.Sequential([
    keras.layers.Dense(32, activation="elu", input_shape=input_shape),
    keras.layers.Dense(32, activation="elu"),
    keras.layers.Dense(n_outputs)
])

To select an action using this DQN, we just pick the action with the largest predicted Q-value. However, to ensure that the agent explores the environment, we choose a random action with probability epsilon. This is called epsilon greedy strategy... 

In [None]:
def epsilon_greedy_policy(state, epsilon=0):
    if np.random.rand() < epsilon:
        return np.random.randint(n_outputs)
    else:
        Q_values = dqn.predict(state[np.newaxis])
        return np.argmax(Q_values[0])

We will also need a replay memory. It will contain the agent's experiences, in the form of tuples: (obs, action, reward, next_obs, done). Experiences in the replay_memory are later used to train the neural network.

In [None]:
replay_memory = deque(maxlen=2000)

# Convenience function to draw random batch from replay_memory
def sample_experiences(batch_size):
    indices = np.random.randint(len(replay_memory), size=batch_size)
    batch = [replay_memory[index] for index in indices]
    states, actions, rewards, next_states, dones = [
        np.array([experience[field_index] for experience in batch])
        for field_index in range(5)]
    return states, actions, rewards, next_states, dones

Now we can create a function that will use the DQN to play one step, and record its experience in the replay memory:

In [None]:
def play_one_step(env, state, epsilon):
    action = epsilon_greedy_policy(state, epsilon)
    next_state, reward, done, info = env.step(action)
    replay_memory.append((state, action, reward, next_state, done))
    return next_state, reward, done, info

Below we define the `training_step()` function which is used to train the DQN.

According to the Bellmann Optimality Equation one can define a Q-value for a state action pair $Q(s,a)$ as the following:

$Q(s,a) = R+\gamma * \underset{a'}{\max} Q(s', a')$

This means: A Q-value for a state action pair $Q(s,a)$ equals the immediate reward for taking this action in this particular state $(R)$ plus the sum of discounted future rewards it expects to get ($\gamma * \underset{a'}{\max} Q(s', a')$). 

This means: 
If we take an experience from our batch, we can calculate the Q-values of the current state by:

**1.)** Either using the DQN to estimate the Q-values for all possible actions given the current state

**2.)** Or Using the DQN to estimate the Q-values for the `next_state`, discount it with $\gamma$ and sum it with the reward (Bellmann Optimality Equation)

Both estimates should be the same according to the Bellmann Optimality Equation. The goal during the training of the DQN is to reduce the mean squared error between **1.** and **2.**. Doing so, the DQN gets better in estimating Q-values and thus better in solving the task.

In [None]:
batch_size = 32
discount_rate = 0.95
optimizer = keras.optimizers.Adam(lr=1e-2)
loss_fn = keras.losses.mean_squared_error

def training_step(batch_size):
    # draw a random batch of experiences
    states, actions, rewards, next_states, dones = sample_experiences(batch_size)
    
    # use DQN to predict Q values for all possible actions from the next state.
    next_Q_values = dqn.predict(next_states)
    
    # only use the maximum Q value as this will be the action which will be selected.
    max_next_Q_values = np.max(next_Q_values, axis=1)
    
    # calculate target_Q_value based on the Bellmann Optimality equation
    target_Q_values = (rewards + (1 - dones) * discount_rate * max_next_Q_values)
    target_Q_values = target_Q_values.reshape(-1, 1)
    mask = tf.one_hot(actions, n_outputs)
    
    with tf.GradientTape() as tape:
        # Directly estimate Q-values for actions of the current state
        Q_values = dqn(states)
        Q_values = tf.reduce_sum(Q_values * mask, axis=1, keepdims=True)
        # calculate loss based on mse 
        loss = tf.reduce_mean(loss_fn(target_Q_values, Q_values))
    # perform gradient descent step to train the neural net
    grads = tape.gradient(loss, dqn.trainable_variables)
    optimizer.apply_gradients(zip(grads, dqn.trainable_variables))

Finally we are ready to train our DQN. Details can be found in the comments in the code

In [None]:
# setting seeds helps to reproduce results
env.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

rewards = [] 
best_score = 0

# we are training our model for 600 episodes !
for episode in range(600):
    # at the beginning of each episode we reset the environment to start fresh again :-)
    obs = env.reset()    
    # As the env resets automatically at step 200, its enough to iterate for max 200 steps.
    for step in range(200):
        # epsilon denotes the probability to explore rather than to take the action with the highest Q-value. 
        # If this is not clear, make sure to read about epsilon greedy strategies
        epsilon = max(1 - episode / 500, 0.01)
        # Calling the play_one_step method to fill our replay_memory with experiences
        obs, reward, done, info = play_one_step(env, obs, epsilon)
        # In case the pole tips to far, we lost and the episode ends :-(
        if done:
            break
    # We keep track of number of steps played of each episode to analyse the training process later
    rewards.append(step)
    
    # We run one training step after episode 50. (So the replay_buffer is full)
    if episode > 50:
        training_step(batch_size)
        
    # Making sure to keep track of the model with the best weights. 
    if step > best_score:
        best_weights = dqn.get_weights()
        best_score = step
    print("\rEpisode: {}, Steps: {}, eps: {:.3f}".format(episode, step + 1, epsilon), end="") # Not shown

dqn.set_weights(best_weights)

In [None]:
plt.plot(rewards)
plt.title('Rewards per training episode')

Last but not least lets have a look at our trained agent in action :)

In [None]:
state = env.reset()

frames = []

for step in range(200):
    Q_values = dqn.predict(state[np.newaxis])
    action = np.argmax(Q_values[0])
    state, reward, done, info = env.step(action)
    if done:
        break
    img = env.render(mode="rgb_array")
    frames.append(img)
    
plot_animation(frames)

# ChainerRL 

Lets see how the Task of balancing the pole could be solved with the RL framework chainerRL

------

**Disclaimer** Almost all of the content of this section is derived or copied from: https://github.com/chainer/chainerrl/blob/master/examples/quickstart/quickstart.ipynb

In [None]:
import chainer
import chainer.functions as F
import chainer.links as L
import chainerrl
import gym
import numpy as np

import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

In [None]:
env = gym.make('CartPole-v0')
env.reset()

In [None]:
obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

_q_func = chainerrl.q_functions.FCStateQFunctionWithDiscreteAction(
    obs_size, n_actions,
    n_hidden_layers=2, n_hidden_channels=50)

In [None]:
# Use Adam to optimize q_func. eps=1e-2 is for stability.
optimizer = chainer.optimizers.Adam(eps=1e-2)
optimizer.setup(_q_func)

In [None]:
# Set the discount factor that discounts future rewards.
gamma = 0.95

# Use epsilon-greedy for exploration
explorer = chainerrl.explorers.ConstantEpsilonGreedy(
    epsilon=0.3, random_action_func=env.action_space.sample)

# DQN uses Experience Replay.
# Specify a replay buffer and its capacity.
replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity=2000)

# Since observations from CartPole-v0 is numpy.float64 while
# Chainer only accepts numpy.float32 by default, specify
# a converter as a feature extractor function phi.
phi = lambda x: x.astype(np.float32, copy=False)

# Now create an agent that will interact with the environment.
agent = chainerrl.agents.DoubleDQN(
    _q_func, optimizer, replay_buffer, gamma, explorer,
    replay_start_size=500, update_interval=1,
    target_update_interval=100, phi=phi)

In [None]:
rewards = []

for episode in range(600):
    obs = env.reset()
    reward = 0
    for step in range(200):
        # Uncomment to watch the behaviour
        # env.render()
        action = agent.act_and_train(obs, reward)
        obs, reward, done, _ = env.step(action)
        if done:
            break
    # We keep track of number of steps played of each episode to analyse the training process later
    rewards.append(step)
    
    agent.stop_episode_and_train(obs, reward, done)
    print("\rEpisode: {}, Steps: {}".format(episode, step + 1), end="") # Not shown


In [None]:
obs = env.reset()

frames = []

for step in range(200):
    action = agent.act(obs)
    obs, r, done, _ = env.step(action)
    img = env.render(mode="rgb_array")
    frames.append(img)
    if done:
        break
    
agent.stop_episode()

plot_animation(frames)

Training with chainerRL can be further simplified:

In [None]:
chainerrl.experiments.train_agent_with_evaluation(
    agent, env,
    steps=10000,
    eval_n_steps = None,# Train the agent for 2000 steps
    eval_n_episodes=10,       # 10 episodes are sampled for each evaluation
    train_max_episode_len=2000,  # Maximum length of each episode
    eval_interval=3300,   # Evaluate the agent after every 1000 steps
    outdir='result')      # Save everything to 'result' directory

# Sources

- https://github.com/ageron/handson-ml2/blob/master/18_reinforcement_learning.ipynb
- https://www.youtube.com/watch?v=2pWv7GOvuf0
- https://github.com/chainer/chainerrl/blob/master/examples/quickstart/quickstart.ipynb