# REINFORCE in TensorFlow

This notebook implements a basic reinforce algorithm a.k.a. policy gradient for CartPole env.

It has been deliberately written to be as simple and human-readable.


The notebook assumes that you have [openai gym](https://github.com/openai/gym) installed.

In case you're running on a server, [use xvfb](https://github.com/openai/gym#rendering-on-a-server)

In [1]:
#XVFB will be launched if you run on a server
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1

bash: ../xvfb: No such file or directory
env: DISPLAY=:1


In [0]:
import gym
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make("CartPole-v0")

#gym compatibility: unwrap TimeLimit
if hasattr(env,'env'):
    env=env.env

env.reset()
n_actions = env.action_space.n
state_dim = env.observation_space.shape

#plt.imshow(env.render("rgb_array"))

# Building the policy network

For REINFORCE algorithm, we'll need a model that predicts action probabilities given states.

For numerical stability, please __do not include the softmax layer into your network architecture__. 

We'll use softmax or log-softmax where appropriate.

In [0]:
import tensorflow as tf

tf.reset_default_graph()
#create input variables. We only need <s,a,R> for REINFORCE
states = tf.placeholder('float32',(None,)+state_dim,name="states")
actions = tf.placeholder('int32',name="action_ids")
cumulative_rewards = tf.placeholder('float32', name="cumulative_returns")

In [5]:
tf.get_default_graph().get_operations()

[<tf.Operation 'states' type=Placeholder>,
 <tf.Operation 'action_ids' type=Placeholder>,
 <tf.Operation 'cumulative_returns' type=Placeholder>]

In [6]:

#<define network graph using raw tf or any deep learning library>
model=tf.keras.Sequential([
    tf.keras.layers.Dense(256,activation='relu'),
    tf.keras.layers.Dense(128,activation='relu'),
    tf.keras.layers.Dense(n_actions,activation='linear')
])
logits = model(states)#<linear outputs (symbolic) of your network>

policy = tf.nn.softmax(logits)
log_policy = tf.nn.log_softmax(logits)

Instructions for updating:
Colocations handled automatically by placer.


In [0]:
#tf.get_default_graph().get_operations()

In [0]:
#utility function to pick action in one given state
get_action_proba = lambda s: policy.eval({states:[s]})[0] 

#### Loss function and updates

We now need to define objective and update over policy gradient.

Our objective function is

$$ J \approx  { 1 \over N } \sum  _{s_i,a_i} \pi_\theta (a_i | s_i) \cdot G(s_i,a_i) $$


Following the REINFORCE algorithm, we can define our objective as follows: 

$$ \hat J \approx { 1 \over N } \sum  _{s_i,a_i} log \pi_\theta (a_i | s_i) \cdot G(s_i,a_i) $$

When you compute gradient of that function over network weights $ \theta $, it will become exactly the policy gradient.


In [0]:
x = tf.constant([1, 4,10,1])
y = tf.constant([2, 5,10,1])
z = tf.constant([3, 6,10,1])
tf.stack([x, y, z])  # [[1, 4], [2, 5], [3, 6]] (Pack along first dim.)
tf.stack([x, y, z], axis=-1)  # [[1, 2, 3], [4, 5, 6]]

<tf.Tensor 'stack_1:0' shape=(4, 3) dtype=int32>

In [8]:
#get probabilities for parti
indices = tf.stack([tf.range(tf.shape(log_policy)[0]),actions],axis=-1)
print(indices)
log_policy_for_actions = tf.gather_nd(log_policy,indices)
print(log_policy_for_actions)
indices = tf.stack([tf.range(tf.shape(policy)[0]),actions],axis=-1)
policy_for_actions=tf.gather_nd(policy,indices)

Tensor("stack:0", shape=(?, 2), dtype=int32)
Tensor("GatherNd:0", shape=(?,), dtype=float32)


In [0]:
# policy objective as in the last formula. please use mean, not sum.
# note: you need to use log_policy_for_actions to get log probabilities for actions taken
#cum_reward_for_actions=tf.gather_nd(cumulative_rewards,indices)
J = tf.reduce_mean(log_policy_for_actions*cumulative_rewards)#<YOUR CODE


In [0]:
print(cumulative_rewards.shape)

<unknown>


In [0]:
#regularize with entropy
entropy =0 - tf.reduce_mean(policy_for_actions*log_policy_for_actions)#<compute entropy. Don't forget the sign!>

In [11]:
#all network weights
all_weights = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)#<a list of all trainable weights in your network>

#weight updates. maximizing J is same as minimizing -J. Adding negative entropy.
loss = -J -0.1 * entropy

update = tf.train.AdamOptimizer().minimize(loss,var_list=all_weights)

Instructions for updating:
Use tf.cast instead.


### Computing cumulative rewards

In [0]:
def get_cumulative_rewards(rewards, #rewards at each step
                           gamma = 0.99 #discount for reward
                           ):
    """
    take a list of immediate rewards r(s,a) for the whole session 
    compute cumulative rewards R(s,a) (a.k.a. G(s,a) in Sutton '16)
    R_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ...
    
    The simple way to compute cumulative rewards is to iterate from last to first time tick
    and compute R_t = r_t + gamma*R_{t+1} recurrently
    
    You must return an array/list of cumulative rewards with as many elements as in the initial rewards.
    """
    
    #<your code here>
    cum_reward=[]
    for j,r_t in enumerate(rewards):
        reward=0
        for i,r in enumerate(rewards[j:]):
            reward+=(gamma**i)*r
        cum_reward.append(reward)
    
    return cum_reward
    

In [13]:
assert len(get_cumulative_rewards(range(100))) == 100
assert np.allclose(get_cumulative_rewards([0,0,1,0,0,1,0],gamma=0.9),[1.40049, 1.5561, 1.729, 0.81, 0.9, 1.0, 0.0])
assert np.allclose(get_cumulative_rewards([0,0,1,-2,3,-4,0],gamma=0.5), [0.0625, 0.125, 0.25, -1.5, 1.0, -4.0, 0.0])
assert np.allclose(get_cumulative_rewards([0,0,1,2,3,4,0],gamma=0), [0, 0, 1, 2, 3, 4, 0])
print("looks good!")

looks good!


In [0]:
def train_step(_states,_actions,_rewards):
    """given full session, trains agent with policy gradient"""
    _cumulative_rewards = get_cumulative_rewards(_rewards)
    update.run({states:_states,actions:_actions,cumulative_rewards:_cumulative_rewards})

### Playing the game

In [0]:
def generate_session(t_max=1000):
    """play env with REINFORCE agent and train at the session end"""
    
    #arrays to record session
    states,actions,rewards = [],[],[]
    
    s = env.reset()
    
    for t in range(t_max):
        
        #action probabilities array aka pi(a|s)
        action_probas = get_action_proba(s)
        
        a = np.random.choice(range(n_actions),p=action_probas)#<pick random action using action_probas>
        
        new_s,r,done,info = env.step(a)
        
        #record session history to train later
        states.append(s)
        actions.append(a)
        rewards.append(r)
        
        s = new_s
        if done: break
            
    train_step(states,actions,rewards)
            
    return sum(rewards)
        

In [16]:
s = tf.InteractiveSession()
s.run(tf.global_variables_initializer())

for i in range(100):
    
    rewards = [generate_session() for _ in range(100)] #generate new sessions
    
    print ("mean reward:%.3f"%(np.mean(rewards)))

    if np.mean(rewards) > 300:
        print ("You Win!")
        break
        

mean reward:34.530
mean reward:92.420
mean reward:178.660
mean reward:371.000
You Win!


In [0]:
s = tf.InteractiveSession()
s.run(tf.global_variables_initializer())

for i in range(100):
    
    rewards = [generate_session() for _ in range(100)] #generate new sessions
    
    print ("mean reward:%.3f"%(np.mean(rewards)))

    if np.mean(rewards) > 300:
        print ("You Win!")
        break
        


mean reward:27.590
mean reward:70.340
mean reward:129.570
mean reward:188.330
mean reward:211.530
mean reward:240.490
mean reward:235.760
mean reward:218.030
mean reward:258.470
mean reward:184.760
mean reward:298.920
mean reward:507.360
You Win!


### Results & video

In [0]:
#record sessions
import gym.wrappers
env = gym.wrappers.Monitor(gym.make("CartPole-v0"),directory="videos",force=True)
sessions = [generate_session() for _ in range(100)]
env.close()


[2017-04-08 03:29:10,315] Making new env: CartPole-v0
[2017-04-08 03:29:10,329] Clearing 6 monitor files from previous run (because force=True was provided)
[2017-04-08 03:29:10,336] Starting new video recorder writing to /home/jheuristic/Downloads/sonnet/sonnet/examples/videos/openaigym.video.0.14221.video000000.mp4
[2017-04-08 03:29:16,834] Starting new video recorder writing to /home/jheuristic/Downloads/sonnet/sonnet/examples/videos/openaigym.video.0.14221.video000001.mp4
[2017-04-08 03:29:23,689] Starting new video recorder writing to /home/jheuristic/Downloads/sonnet/sonnet/examples/videos/openaigym.video.0.14221.video000008.mp4
[2017-04-08 03:29:33,407] Starting new video recorder writing to /home/jheuristic/Downloads/sonnet/sonnet/examples/videos/openaigym.video.0.14221.video000027.mp4
[2017-04-08 03:29:45,840] Starting new video recorder writing to /home/jheuristic/Downloads/sonnet/sonnet/examples/videos/openaigym.video.0.14221.video000064.mp4
[2017-04-08 03:29:56,812] Finishe

In [0]:
#show video
from IPython.display import HTML
import os

video_names = list(filter(lambda s:s.endswith(".mp4"),os.listdir("./videos/")))

HTML("""
<video width="640" height="480" controls>
  <source src="{}" type="video/mp4">
</video>
""".format("./videos/"+video_names[-1])) #this may or may not be _last_ video. Try other indices

In [0]:
import re
import requests
import json


class Grader(object):
    def __init__(self, assignment_key, all_parts=()):
        """
        Assignment key is the way to tell Coursera which problem is being submitted.
        """
        self.submission_page = \
            'https://www.coursera.org/api/onDemandProgrammingScriptSubmissions.v1'
        self.assignment_key = assignment_key
        self.answers = {part: None for part in all_parts}

    def submit(self, email, token):
        submission = {
                    "assignmentKey": self.assignment_key,
                    "submitterEmail": email,
                    "secret": token,
                    "parts": {}
        }
        for part, output in self.answers.items():
            if output is not None:
                submission["parts"][part] = {"output": output}
            else:
                submission["parts"][part] = dict()
        request = requests.post(self.submission_page, data=json.dumps(submission))
        response = request.json()
        if request.status_code == 201:
            print('Submitted to Coursera platform. See results on assignment page!')
        elif u'details' in response and u'learnerMessage' in response[u'details']:
            print(response[u'details'][u'learnerMessage'])
        else:
            print("Unknown response from Coursera: {}".format(request.status_code))
            print(response)

    def set_answer(self, part, answer):
        """Adds an answer for submission. Answer is expected either as string, number, or
           an iterable of numbers.
           Args:
              part - str, assignment part id
              answer - answer to submit. If non iterable, appends repr(answer). If string,
                is appended as provided. If an iterable and not string, converted to
                space-delimited repr() of members.
        """
        if isinstance(answer, str):
            self.answers[part] = answer
        else:
            try:
                self.answers[part] = " ".join(map(repr, answer))
            except TypeError:
                self.answers[part] = repr(answer)


def array_to_grader(array, epsilon=1e-4):
    """Utility function to help preparing Coursera grading conditions descriptions.
    Args:
       array: iterable of numbers, the correct answers
       epslion: the generated expression will accept the answers with this absolute difference with
         provided values
    Returns:
       String. A Coursera grader expression that checks whether the user submission is in
         (array - epsilon, array + epsilon)"""
    res = []
    for element in array:
        if isinstance(element, int):
            res.append("[{0}, {0}]".format(element))
        else:
            res.append("({0}, {1})".format(element - epsilon, element + epsilon))
    return " ".join(res)


In [19]:
import sys
import numpy as np
sys.path.append("..")
#import grading


def submit_cartpole(generate_session, email, token):
    sessions = [generate_session() for _ in range(100)]
    session_rewards = np.array(sessions)
    grader = Grader("oyT3Bt7yEeeQvhJmhysb5g")
    grader.set_answer("7QKmA", int(np.mean(session_rewards)))
    grader.submit(email, token)


submit_cartpole(generate_session, "ss.ghule@ncl.res.in", "HM9E30DSYM5WoVaU")

Submitted to Coursera platform. See results on assignment page!


In [0]:
# That's all, thank you for your attention!
# Not having enough? There's an actor-critic waiting for you in the honor section.
# But make sure you've seen the videos first.