<a href="https://colab.research.google.com/github/vijender412/Colab/blob/master/colab_Reinforcement_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Simple reinforcement learning

This colab solves Cartpole using OpenAI's gym using vanilla policy gradients (PG) with TensorFlow.

## Install neccessary packages
Make sure to include X virtual framebuffer (`xvfb`), which is a display server implementing the X11 display to allow us to write our results out to a gif.

In [0]:
%%bash
apt-get update
apt-get install -y python-numpy python-dev cmake zlib1g-dev libjpeg-dev libav-tools xorg-dev python-opengl libboost-all-dev libsdl2-dev swig libffi-dev
apt-get install -y xvfb
pip install gym
pip install gym[atari]
pip install moviepy

## Load Python packages


In [0]:
from google.colab import files  # To download gif output.
import imageio  
imageio.plugins.ffmpeg.download()  # ffmpeg for imageio to write gif.

## Write Python to file
Later we will call it using `xvfb`, which is neccesary because colab on running on a remote server doesn't have a display environment to display results. However we can write the results to a gif to view it.

In [0]:
%%writefile cartpole.py
'''Solve cartpole using policy gradients.

source: https://github.com/ageron/handson-ml/blob/master/16_reinforcement_learning.ipynb'''

import gym
import numpy as np
import tensorflow as tf

from moviepy.editor import ImageSequenceClip

tf.reset_default_graph()

# Inputs:
# Num	Observation	Min	Max
# 0	Cart Position	-2.4	2.4
# 1	Cart Velocity	-Inf	Inf
# 2	Pole Angle	~ -41.8°	~ 41.8°
# 3	Pole Velocity At Tip	-Inf	Inf
# Outputs: Left/right

n_inputs = 4  
n_hidden = 4
n_outputs = 1  # Left/right.

learning_rate = 0.01

initializer = tf.contrib.layers.variance_scaling_initializer()

X = tf.placeholder(tf.float32, shape=[None, n_inputs])

# Build network.
hidden = tf.layers.dense(X, n_hidden, activation=tf.nn.elu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden, n_outputs)
outputs = tf.nn.sigmoid(logits)  # Probability of action 0 (left).
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)

y = 1. - tf.to_float(action)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
    gradient_placeholder = tf.placeholder(tf.float32, shape=grad.get_shape())
    gradient_placeholders.append(gradient_placeholder)
    grads_and_vars_feed.append((gradient_placeholder, variable))
training_op = optimizer.apply_gradients(grads_and_vars_feed)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

def discount_rewards(rewards, discount_rate):
    discounted_rewards = np.zeros(len(rewards))
    cumulative_rewards = 0
    for step in reversed(range(len(rewards))):
        cumulative_rewards = rewards[step] + cumulative_rewards * discount_rate
        discounted_rewards[step] = cumulative_rewards
    return discounted_rewards

def discount_and_normalize_rewards(all_rewards, discount_rate):
    all_discounted_rewards = [discount_rewards(rewards, discount_rate) for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean)/reward_std for discounted_rewards in all_discounted_rewards]

  
env = gym.make("CartPole-v0")

n_games_per_update = 10
n_max_steps = 1000
n_iterations = 100
save_iterations = 10
discount_rate = 0.95

with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        print("\rIteration: {}".format(iteration), end="")
        all_rewards = []
        all_gradients = []
        for game in range(n_games_per_update):
            current_rewards = []
            current_gradients = []
            obs = env.reset()
            for step in range(n_max_steps):
                action_val, gradients_val = sess.run([action, gradients], feed_dict={X: obs.reshape(1, n_inputs)})
                obs, reward, done, info = env.step(action_val[0][0])
                current_rewards.append(reward)
                current_gradients.append(gradients_val)
                if done:
                    break
            all_rewards.append(current_rewards)
            all_gradients.append(current_gradients)

        all_rewards = discount_and_normalize_rewards(all_rewards, discount_rate=discount_rate)
        feed_dict = {}
        for var_index, gradient_placeholder in enumerate(gradient_placeholders):
            # Modulate the gradient with advantage/reward 
            mean_gradients = np.mean([reward * all_gradients[game_index][step][var_index]
                                      for game_index, rewards in enumerate(all_rewards)
                                          for step, reward in enumerate(rewards)], axis=0)
            feed_dict[gradient_placeholder] = mean_gradients
        sess.run(training_op, feed_dict=feed_dict)
        if iteration % save_iterations == 0:
            saver.save(sess, "./my_policy_net_pg.ckpt")
    # Write out result at the end.
    obs = env.reset()
    steps = []
    done = False
    while not done:
      s = env.render('rgb_array')
      steps.append(s)
      action_val, gradients_val = sess.run([action, gradients], 
                                           feed_dict={X: obs.reshape(1, n_inputs)})
      obs, reward, done, info = env.step(action_val[0][0])
      
clip = ImageSequenceClip(steps, fps=30)
clip.write_gif('result.gif', fps=30)

## Train model and view results

In [0]:
%%bash
xvfb-run -s "-screen 0 1400x900x24" python cartpole.py
# Copy to colab folder so we can visualize it.
cp result.gif /usr/local/share/jupyter/nbextensions/google.colab/

In [0]:
%%html
<img src='/nbextensions/google.colab/result.gif'/>

In [0]:
# Alternatively, download file!
files.download('result.gif')

## Additional reading
* [Deep Reinforcement Learning on GCP](https://cloud.google.com/blog/products/ai-machine-learning/deep-reinforcement-learning-on-gcp-using-hyperparameters-and-cloud-ml-engine-to-best-openai-gym-games)
* [Scale the model to solve Atari Breakout](https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/blogs/rl-on-gcp)

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.