# OpenAI Universe part 2: deep q-networks
Last part we used a random search algorithm to "solve" the cartpole environment. This time we are going to take things to the next level and implement a deep q-network.

## Background
Q-learning is a reinforcement learning technique that tries to predict the reward of a state-action pair. For the cartpole environment the state consists of four values, and there are two possible actions. For a certain state S we can predict the reward if we were to push left $Q(S,left)$ or right $Q(S,right)$. 

In the Atari game environment you get a reward of 1 every time you score a point. This scoring can happen when you hit a block in breakout, an alien in Space Invaders, or eat a pallet in Pacman. In the cartpole environment you get a reward every time the pole is standing on the cart (which is: every frame). The trick of q-learning is that it not only considers the direct reward, but also the expected future reward. After applying action $a$ we enter state $S_{t+1}$ and take the following into account: 
- The reward $r$ we obtained by performing this action
- The expected maximum reward $Q(S_{t+1},a)$, in the cartpole environment this is $max(Q(S_{t+1},left), Q(S_{t+1},right)$

We combine this into a neat formula where say that the predicted value should be $r$ in a 

\begin{equation*}
Q(S,a) = \left\{
\begin{array}{ll}
r & \text{for terminal} S_{t+1} \\
r + \gamma max_a Q(S_{t+1},a)& \text{for nonterminal } S_{t+1}
\end{array} \right.
\end{equation*}
Where $\gamma$ is the discount factor. Taking a small $\gamma$ (for example 0.2) means that you don't really care about long-term rewards, a large $\gamma$ (0.95) means that you care a lot about the long-term rewards. In our case we do care a lot about long-term rewards, so we take a large $\gamma$. 

Let's apply our knowledge of q-learning on the same environment we tried last time: the CartPole environment. 


In [1]:
%matplotlib notebook
from time import gmtime, strftime
import threading
import time

import numpy as np
import matplotlib.pyplot as plt

from ipywidgets import widgets
from IPython.display import display
import tensorflow as tf
import gym
from gym import wrappers
import random

from matplotlib import animation
from JSAnimation.IPython_display import display_animation

env = gym.make('CartPole-v0')
env = wrappers.Monitor(env, '/tmp/cartpole-experiment-1')
observation = env.reset()


[2017-07-03 12:50:28,161] Making new env: CartPole-v0
[2017-07-03 12:50:28,169] Creating monitor directory /tmp/cartpole-experiment-1
[2017-07-03 12:50:28,170] Starting new video recorder writing to /tmp/cartpole-experiment-1/openaigym.video.0.26236.video000000.mp4


## Value approximation
There are many ways in which you can estimate the Q-value for each (state,action) pair. The latest "cool" thing to do is estimate it using a neural network. This is also what we will be doing!

We will build our network in Tensorflow: an open-source libary for machine-learning. If you are not familiar with Tensorflow, the most important thing to know is that we will fist build our network, then initialise it and use it. All python variables are "placeholders" in a session. You can find more information on the [Tensorflow homepage](https://www.tensorflow.org/get_started/)

I created a very simple network layout with four inputs (the four variables we observe) and two outputs (either push left or right). I added four fully connected layers: 
- From 4 to 16 variables
- From 16 to 32 variables
- From 32 to 8 variables
- From 8 to 2 variables

Every layer is a dense layer with a RELU nonlinearity except for the last layer as this one has to predict the expected Q-value. 

In [2]:
# Network input
networkstate = tf.placeholder(tf.float32, [None, 4], name="input")
networkaction = tf.placeholder(tf.int32, [None], name="actioninput")
networkreward = tf.placeholder(tf.float32,[None], name="groundtruth_reward")
action_onehot = tf.one_hot(networkaction, 2, name="actiononehot")

# The variable in our network: 
w1 = tf.Variable(tf.random_normal([4,16], stddev=0.35), name="W1")
w2 = tf.Variable(tf.random_normal([16,32], stddev=0.35), name="W2")
w3 = tf.Variable(tf.random_normal([32,8], stddev=0.35), name="W3")
w4 = tf.Variable(tf.random_normal([8,2], stddev=0.35), name="W4")
b1 = tf.Variable(tf.zeros([16]), name="B1")
b2 = tf.Variable(tf.zeros([32]), name="B2")
b3 = tf.Variable(tf.zeros([8]), name="B3")
b4 = tf.Variable(tf.zeros(2), name="B4")

# The network layout
layer1 = tf.nn.relu(tf.add(tf.matmul(networkstate,w1), b1), name="Result1")
layer2 = tf.nn.relu(tf.add(tf.matmul(layer1,w2), b2), name="Result2")
layer3 = tf.nn.relu(tf.add(tf.matmul(layer2,w3), b3), name="Result3")
predictedreward = tf.add(tf.matmul(layer3,w4), b4, name="predictedReward")

# Learning 
qreward = tf.reduce_sum(tf.multiply(predictedreward, action_onehot), reduction_indices = 1)
loss = tf.reduce_mean(tf.square(networkreward - qreward))
tf.summary.scalar('loss', loss)
optimizer = tf.train.RMSPropOptimizer(0.0001).minimize(loss)
merged_summary = tf.summary.merge_all()


## Session management and Tensorboard

Now we start the session. I added support for Tensorboard: a nice tool to visualise your learning. At the moment I only added one summary: the loss of the network. 
If you did not install Docker yet, make sure [you do this](https://docs.docker.com/engine/installation/#supported-platforms). To run tensorboard you have to run:

```
docker run -p 6006:6006 -v $(pwd):/mounted rmeertens/tensorboard
```

Then navigate to localhost:6006 to see your tensorboard.


In [3]:
sess = tf.InteractiveSession()
summary_writer = tf.summary.FileWriter('trainsummary',sess.graph)
sess.run(tf.global_variables_initializer())

## Learning Q(S,a)
An interesting paper you can use as guideline for deep q-networks is "Playing Atari with Deep Reinforcement Learning (https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf). This paper by deepmind explains how they were able to teach a neural network to play Atari games. 

One of the main contributions of this paper is their use of an "experience replay mechanism". If you were to train your neural network in the order of images you see normally the network quickly forgets what it saw before. To fix this we save what we saw in a memory with the following variables: 

($S$, $action$, $reward$, $is terminal$, $S_{t+1}$)

Now every frame we sample a random minibatch of our memory and train our network on that. We also only keep the newer experiences to keep our memory fresh with good actions. The full algorithm in their paper looks like this: 
![dqn algorith](dqn alg.png)


In [4]:
import random
replay_memory = [] # (state, action, reward, terminalstate, state_t+1)
epsilon = 0.1
BATCH_SIZE = 32
GAMMA = 0.9
MAX_LEN_REPLAY_MEMORY = 30000
FRAMES_TO_PLAY = 300001
MIN_FRAMES_FOR_LEARNING = 1000

for i_epoch in range(FRAMES_TO_PLAY):
    
    ### Select an action and perform this
    ### EXERCISE: this is where your network should play and try to come as far as possible!
    ### You have to implement epsilon-annealing yourself
    if random.random() <= epsilon:
        action = env.action_space.sample() 
    else:
        pred_q = sess.run(predictedreward, feed_dict={networkstate:[observation]})
        action = np.argmax(pred_q)
        
    newobservation, reward, terminal, info = env.step(action)

    ### I prefer that my agent gets 0 reward if it dies
    if terminal: 
        reward = 0
        
    ### Add the observation to our replay memory
    replay_memory.append((observation, action, reward, terminal, newobservation))
    
    ### Reset the environment if the agent died
    if terminal: 
        newobservation = env.reset()
    observation = newobservation
    
    ### Learn once we have enough frames to start learning
    if len(replay_memory) > MIN_FRAMES_FOR_LEARNING: 
        experiences = random.sample(replay_memory, BATCH_SIZE)
        totrain = [] # (state, action, delayed_reward)
        
        ### Calculate the predicted reward
        nextstates = [var[4] for var in experiences]
        pred_reward = sess.run(predictedreward, feed_dict={networkstate:nextstates})
        
        ### Set the "ground truth": the value our network has to predict:
        for index in range(BATCH_SIZE):
            state, action, reward, terminalstate, newstate = experiences[index]
            predicted_reward = max(pred_reward[index])
            
            if terminalstate:
                delayedreward = reward
            else:
                delayedreward = reward + GAMMA*predicted_reward
            totrain.append((state, action, delayedreward))
            
        ### Feed the train batch to the algorithm 
        states = [var[0] for var in totrain]
        actions = [var[1] for var in totrain]
        rewards = [var[2] for var in totrain]
        _, l, summary = sess.run([optimizer, loss, merged_summary], feed_dict={networkstate:states, networkaction: actions, networkreward: rewards})
        

        ### If our memory is too big: remove the first element
        if len(replay_memory) > MAX_LEN_REPLAY_MEMORY:
                replay_memory = replay_memory[1:]

        ### Show the progress 
        if i_epoch%100==1:
            summary_writer.add_summary(summary, i_epoch)
        if i_epoch%1000==1:
            print("Epoch %d, loss: %f" % (i_epoch,l))

[2017-07-03 12:50:28,764] Starting new video recorder writing to /tmp/cartpole-experiment-1/openaigym.video.0.26236.video000001.mp4
[2017-07-03 12:50:28,884] Starting new video recorder writing to /tmp/cartpole-experiment-1/openaigym.video.0.26236.video000008.mp4
[2017-07-03 12:50:29,016] Starting new video recorder writing to /tmp/cartpole-experiment-1/openaigym.video.0.26236.video000027.mp4
[2017-07-03 12:50:29,193] Starting new video recorder writing to /tmp/cartpole-experiment-1/openaigym.video.0.26236.video000064.mp4


Epoch 1001, loss: 0.878417


[2017-07-03 12:50:29,586] Starting new video recorder writing to /tmp/cartpole-experiment-1/openaigym.video.0.26236.video000125.mp4
[2017-07-03 12:50:30,450] Starting new video recorder writing to /tmp/cartpole-experiment-1/openaigym.video.0.26236.video000216.mp4


Epoch 2001, loss: 2.612569
Epoch 3001, loss: 2.125304


[2017-07-03 12:50:31,621] Starting new video recorder writing to /tmp/cartpole-experiment-1/openaigym.video.0.26236.video000343.mp4


Epoch 4001, loss: 0.685737
Epoch 5001, loss: 0.848861


[2017-07-03 12:50:33,332] Starting new video recorder writing to /tmp/cartpole-experiment-1/openaigym.video.0.26236.video000512.mp4


Epoch 6001, loss: 4.561125
Epoch 7001, loss: 1.534148
Epoch 8001, loss: 0.372116
Epoch 9001, loss: 0.539587


[2017-07-03 12:50:36,870] Starting new video recorder writing to /tmp/cartpole-experiment-1/openaigym.video.0.26236.video000729.mp4


Epoch 10001, loss: 0.227657
Epoch 11001, loss: 0.071213
Epoch 12001, loss: 0.081353
Epoch 13001, loss: 3.078383
Epoch 14001, loss: 0.184846
Epoch 15001, loss: 0.426586
Epoch 16001, loss: 0.040447
Epoch 17001, loss: 0.969385
Epoch 18001, loss: 0.104043
Epoch 19001, loss: 1.657083
Epoch 20001, loss: 0.097585
Epoch 21001, loss: 0.051327
Epoch 22001, loss: 0.182340
Epoch 23001, loss: 0.028440
Epoch 24001, loss: 0.038255
Epoch 25001, loss: 0.084144
Epoch 26001, loss: 0.156991
Epoch 27001, loss: 0.009564
Epoch 28001, loss: 0.012276
Epoch 29001, loss: 0.022192
Epoch 30001, loss: 0.072836
Epoch 31001, loss: 0.036808
Epoch 32001, loss: 1.623818
Epoch 33001, loss: 0.045964
Epoch 34001, loss: 0.124939
Epoch 35001, loss: 0.027984
Epoch 36001, loss: 0.092817
Epoch 37001, loss: 0.024530
Epoch 38001, loss: 0.014200
Epoch 39001, loss: 0.021270
Epoch 40001, loss: 0.022633
Epoch 41001, loss: 0.021919
Epoch 42001, loss: 1.377276
Epoch 43001, loss: 0.021066
Epoch 44001, loss: 0.023433
Epoch 45001, loss: 0

[2017-07-03 12:51:12,883] Starting new video recorder writing to /tmp/cartpole-experiment-1/openaigym.video.0.26236.video001000.mp4


Epoch 48001, loss: 0.017084
Epoch 49001, loss: 0.032988
Epoch 50001, loss: 0.017523
Epoch 51001, loss: 0.020461
Epoch 52001, loss: 0.033347
Epoch 53001, loss: 0.053939
Epoch 54001, loss: 0.029761
Epoch 55001, loss: 0.038916
Epoch 56001, loss: 0.026225
Epoch 57001, loss: 0.017586
Epoch 58001, loss: 0.014760
Epoch 59001, loss: 0.015527
Epoch 60001, loss: 0.013314
Epoch 61001, loss: 0.012487
Epoch 62001, loss: 0.020720
Epoch 63001, loss: 0.008561
Epoch 64001, loss: 0.009799
Epoch 65001, loss: 0.007222
Epoch 66001, loss: 0.010561
Epoch 67001, loss: 0.007533
Epoch 68001, loss: 0.007155
Epoch 69001, loss: 0.623195
Epoch 70001, loss: 0.517900
Epoch 71001, loss: 0.007171
Epoch 72001, loss: 0.003998
Epoch 73001, loss: 0.003392
Epoch 74001, loss: 0.005929
Epoch 75001, loss: 0.003257
Epoch 76001, loss: 0.006366
Epoch 77001, loss: 0.006322
Epoch 78001, loss: 0.917649
Epoch 79001, loss: 0.013265
Epoch 80001, loss: 0.006011
Epoch 81001, loss: 0.507124
Epoch 82001, loss: 0.408160
Epoch 83001, loss: 0

KeyboardInterrupt: 

## Testing the algorithm
Now we have a trained network that gives use the expected $Q(s,a)$ for a certain state. We can use this to balance the stick (and see how long it lasts) and see what the network predicts at each frame:


In [8]:
def display_frames_as_gif(frames, filename_gif = None):
    """
    Displays a list of frames as a gif, with controls
    """
    plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi = 72)
    patch = plt.imshow(frames[0])
    plt.axis('off')

    def animate(i):
        patch.set_data(frames[i])

    anim = animation.FuncAnimation(plt.gcf(), animate, frames = len(frames), interval=50)
    if filename_gif: 
        anim.save(filename_gif, writer = 'imagemagick', fps=20)
    display(display_animation(anim, default_mode='loop'))

terminal = False
while not terminal:

    action = env.action_space.sample() 
    newobservation, reward, terminal, info = env.step(action)

### Play till we are dead
for _ in range(100):
    observation = env.reset()
    term = False
    predicted_q = []
    frames = []
    while not term:
        rgb_observation = env.render(mode = 'rgb_array')
        frames.append(rgb_observation)
        pred_q = sess.run(predictedreward, feed_dict={networkstate:[observation]})
        predicted_q.append(pred_q)
        action = np.argmax(pred_q)
        observation, _, term, _ = env.step(action)
    print(len(frames))   
    
    
### Plot the replay!
#display_frames_as_gif(frames,filename_gif='dqn_run.gif')

ResetNeeded: Trying to step environment which is currently done. While the monitor is active for CartPole-v0, you cannot step beyond the end of an episode. Call 'env.reset()' to start the next episode.

In [11]:

# terminal = False
# while not terminal:

#     action = env.action_space.sample() 
#     newobservation, reward, terminal, info = env.step(action)
env.close()
gym.upload('/tmp/cartpole-experiment-1', api_key='sk_Xpt4s8khRPGveJgB2tUafg')



[2017-07-03 12:55:51,134] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/cartpole-experiment-1')
[2017-07-03 12:55:51,143] [CartPole-v0] Uploading 1853 episodes of training data
[2017-07-03 12:55:53,288] [CartPole-v0] Uploading videos of 11 training episodes (35235 bytes)
[2017-07-03 12:55:53,765] [CartPole-v0] Creating evaluation object from /tmp/cartpole-experiment-1 with learning curve and training video
[2017-07-03 12:55:54,170] 
****************************************************
You successfully uploaded your evaluation on CartPole-v0 to
OpenAI Gym! You can find it at:

    https://gym.openai.com/evaluations/eval_42HU0CptTAWUBmpweA7vbw

****************************************************


In [None]:
plt.plot([var[0] for var in predicted_q])
plt.legend(['left', 'right'])
plt.xlabel("frame")
plt.ylabel('predicted Q(s,a)')

## Handling difficult situations - team up with your robot
You can see in the graph above that our q-function, without the final mistake it made, has a good idea how well it is doing. At moments the pole is going sideways the maximum expected reward lowers. This is a good moment to team up with your robot and guide him when he is in trouble. 

Collaborating is easy: if your robot does not know what to do, we can ask the user to provide input. The initial state the robot is in gives us a lot of information: $Q(S,a)$ tells us how much reward the robot expects for the next frames of its run. If during execution of the robots strategy the maximum expected $Q$ drops a bit below this number we can interpret this as the robot being in a dire situation. We then ask for the user to say if the cart should move left or right. 

Note that in the graph above the agent died, even though it expected a lot of reward. This method is not foolproof, but does help the agent to survive longer. 

In [None]:
%matplotlib inline
plt.ion()
observation = env.reset()

### We predict the reward for the initial state, if we are slightly below this ideal reward, let the human take over. 
TRESHOLD = max(max(sess.run(predictedreward, feed_dict={networkstate:[observation]})))-0.2
TIME_DELAY = 0.5 # Seconds between frames 
terminated = False
while not terminated:
    ### Show the current status
    now = env.render(mode = 'rgb_array')
    plt.imshow(now)
    plt.show()

    ### See if our agent thinks it is safe to move on its own
    pred_reward = sess.run(predictedreward, feed_dict={networkstate:[observation]})
    maxexpected = max(max(pred_reward))
    if maxexpected > TRESHOLD: 
        action = np.argmax(pred_reward)
        print("Max expected: " + str(maxexpected))
        time.sleep(TIME_DELAY)
    else:
        ### Not safe: let the user select an action!
        action = -1
        while action < 0:
            try:
                action = int(raw_input("Max expected: " + str(maxexpected) + " left (0) or right(1): "))
                print("Performing: " + str(action))
            except:
                pass
    
    ### Perform the action
    observation, _, terminated, _ = env.step(action)

print("Unfortunately, the agent died...")

### Exercises
Now that you and your neural network can balance a stick there are many things you can do to improve. As everyones skills are different I wrote down some ideas you can try:
#### Machine learning starter: 
- Improve the neural network. You can toy around with layers (size, type), tune the hyperparameters, or many more. 
- Toy around with the value of gamma, visualise for several values what kind of behaviour the agent will exercise. Is the agent more careful with a higher gamma value?
#### Tensorflow starter: 
- If you don't have a lot of experience you can either try to improve the neural network, or you can experiment with the Tensorboard tool. Try to add plots of the average reward during training. If you implemented epsilon-greedy exploration this number should go up during training. 
#### Reinforcement learning starter: 
- Because our agent only performs random actions our network dies pretty often during training. This means that it has a good idea what to do in its start configurations, but might have a problem when it survived for a longer time. Epsilon-greedy exploration prevents this. With this method you roll a die: with probability epsilon you take a random action, otherwise you take the action the agent thinks is best. You can either set epsilon to a specific value (0.25? 0.1?) or gradually take a lower value to encourage exploration. 
- Team up with your agent! We already help our agent when he thinks he is in a difficult situation, we could also let it ask for help during training. By letting the agent ask for help with probability epsilon you explore the state space in a way that makes more sense than random exploration, and this will give you a better agent. 
#### Reinforcement learning itermediate: 
- Right now we only visualise the loss, which is no indication for how good the network is. According to the paper [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) the average expected $Q$ should go up during learning (in combination with epsilon-greedy exploration). 
- Artur Juliani suggests that you can use a [target network](https://medium.com/@awjuliani/simple-reinforcement-learning-with-tensorflow-part-4-deep-q-networks-and-beyond-8438a3e2b8df). During training your network is very "unstable", it "swings" in all directions which can take a long time to converge. You can add a second neural network (exactly the same layout as the first one) that calculates the predicted reward. During training, every $X$ frames, you set the weights of your target network equal to the weights of your other network. 


### Conclusion
In part two we implemented a deep q-network in Tensorflow, and used it to control a cartpole. We saw that the network can "know" when it has problems, and then teamed up with our agent to help him out. Hopefully you enjoyed working with neural networks, the OpenAI gym, and working together with your agent. 

Initially I wanted to dive into the Atari game environments and skip the CartPole environment for the deep q-networks. Unfortunately, training takes too long (24 hours) before the agent is capable of exercising really cool moves. As I still think it is a lot of fun to learn how to play Atari games I made a third part with some exercises you can take a look at. 

### Acknowledgments 
This blogpost is the first part of my TRADR summerschool workshop on using human input in reinforcement learning algorithms. More information can be found [on their homepage](https://sites.google.com/view/tradr/home)
