# CartPole with Policy Gradient Method
In this exercise you will build a policy-gradient based agent that can solve the classic CartPole task where we must balance a pole on a cart for as long as possible. You will specifically be using the OpenAI Gym in order to have a reactive environment that gives you observations (state) and reward with a given action from your model. For those who are not familiar with OpenAI Gym, please check out the short [tutorial](https://gym.openai.com/docs) to cover the basics of the different game environments.

In [1]:
from __future__ import division

import numpy as np
try:
    import cPickle as pickle
except:
    import pickle
import tensorflow as tf
%matplotlib inline
import matplotlib.pyplot as plt
import math

try:
    xrange = xrange
except:
    xrange = range

### Loading the CartPole Environment
If you don't already have the OpenAI gym installed, use  `pip install gym` to grab it.

In [2]:
import gym
env = gym.make('CartPole-v0')

[2017-09-24 10:50:15,450] Making new env: CartPole-v0


First, we will run the task using random movements. The cart can either move left or right in order to balance the pole. We will randomly choose which direction to move the cart.

## Inline Question 1:
How well do we do with random action? (Hint: not so well.)

In [3]:
env.reset()
random_episodes = 0
reward_sum = 0
while random_episodes < 10:
    env.render()
    observation, reward, done, _ = env.step(np.random.randint(0,2))
    reward_sum += reward
    if done:
        random_episodes += 1
        print("Reward for this episode was:",reward_sum)
        reward_sum = 0
        env.reset()

('Reward for this episode was:', 16.0)
('Reward for this episode was:', 21.0)
('Reward for this episode was:', 12.0)
('Reward for this episode was:', 9.0)
('Reward for this episode was:', 12.0)
('Reward for this episode was:', 15.0)
('Reward for this episode was:', 43.0)
('Reward for this episode was:', 27.0)
('Reward for this episode was:', 14.0)
('Reward for this episode was:', 57.0)


The goal of the task is to achieve a reward of 200 per episode. For every step the agent keeps the pole in the air, the agent recieves a +1 reward. By randomly choosing actions, our reward for each episode is only a couple dozen. Let's make that better with RL!

### Setting up our Neural Network agent
This time we will be using a Policy neural network that takes observations, passes them through a single hidden layer, and then produces a probability of choosing a left/right movement. To learn more about this network, see [Andrej Karpathy's blog on Policy Gradient networks](http://karpathy.github.io/2016/05/31/rl/).

In [4]:
# hyperparameters
H = 10 # number of hidden layer neurons
batch_size = 5 # every how many episodes to do a param update?
learning_rate = 1e-2 # feel free to play with this to train faster or more stably.
gamma = 0.99 # discount factor for reward

D = 4 # input dimensionality

In [26]:
tf.reset_default_graph()

#This defines the network as it goes from taking an observation of the environment to 
#giving a probability of chosing to the action of moving left or right.
observations = tf.placeholder(tf.float32, [None,D] , name="input_x")
W1 = tf.get_variable("W1", shape=[D, H],
           initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.nn.relu(tf.matmul(observations,W1))
W2 = tf.get_variable("W2", shape=[H, 1],
           initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(layer1,W2)
probability = tf.nn.sigmoid(score)

#From here we define the parts of the network needed for learning a good policy.
tvars = tf.trainable_variables()
input_y = tf.placeholder(tf.float32,[None,1], name="input_y")
advantages = tf.placeholder(tf.float32,name="reward_signal")

################################################################################
# TODO: Implement the loss function.                                           #
# This sends the weights in the direction of making actions that gave good     #
# advantage (reward overtime) more likely, and actions that didn't less likely.#
################################################################################
pass
loss = -tf.reduce_mean(advantages * tf.log(input_y * (1 - probability) + (1 - input_y) * probability))
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

newGrads = tf.gradients(loss,tvars)

# Once we have collected a series of gradients from multiple episodes, we apply them.
# We don't just apply gradeients after every episode in order to account for noise in the reward signal.
adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer
W1Grad = tf.placeholder(tf.float32,name="batch_grad1") # Placeholders to send the final gradients through when we update.
W2Grad = tf.placeholder(tf.float32,name="batch_grad2")
batchGrad = [W1Grad,W2Grad]
updateGrads = adam.apply_gradients(zip(batchGrad,tvars))

### Advantage function
This function allows us to weigh the rewards our agent recieves. In the context of the Cart-Pole task, we want actions that kept the pole in the air a long time to have a large reward, and actions that contributed to the pole falling to have a decreased or negative reward. We do this by weighing the rewards from the end of the episode, with actions at the end being seen as negative, since they likely contributed to the pole falling, and the episode ending. Likewise, early actions are seen as more positive, since they weren't responsible for the pole falling.

In [27]:
def discount_rewards(r):
    ################################################################################
    # TODO: Implement the discounted rewards function                              #
    # Return discounted rewards weighed by gamma. Each reward will be replaced     #
    # with a weight reward that involves itself and all the other rewards occuring #
    # after it. The later the reward after it happens, the less effect it has on   #
    # the current rewards's discounted reward                                      #
    # Hint: [r0, r1, r2, ..., r_N] will look someting like:                        #
    #       [(r0 + r1*gamma^1 + ... r_N*gamma^N), (r1 + r2*gamma^1 + ...), ...]    #
    ################################################################################
    pass
    ret = np.zeros_like(r)
    prev = 0
    for t in reversed(range(0, r.size)):
        ret[t] = r[t] + prev * gamma
        prev = ret[t]
    return ret
    ################################################################################
    #                                 END OF YOUR CODE                             #
    ################################################################################

### Running the Agent and Environment

Here we run the neural network agent, and have it act in the CartPole environment.

In [29]:
xs,hs,dlogps,drs,ys,tfps = [],[],[],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 1
total_episodes = 10000
init = tf.global_variables_initializer()

# Launch the graph
with tf.Session() as sess:
    rendering = False
    sess.run(init)
    observation = env.reset() # Obtain an initial observation of the environment

    # Reset the gradient placeholder. We will collect gradients in 
    # gradBuffer until we are ready to update our policy network. 
    gradBuffer = sess.run(tvars)
    for ix,grad in enumerate(gradBuffer):
        gradBuffer[ix] = grad * 0
    
    while episode_number <= total_episodes:
        
        # Rendering the environment slows things down, 
        # so let's only look at it once our agent is doing a good job.
        if reward_sum/batch_size > 100 or rendering == True : 
            env.render()
            rendering = True
            
        # Make sure the observation is in a shape the network can handle.
        x = np.reshape(observation,[1,D])
        
        
        ################################################################################
        # TODO: Run the policy network and get an action to take.                      #
        # Output: action                                                               #
        ################################################################################
        prob = sess.run(probability, feed_dict={observations: x})
        action = 0
        if np.random.uniform() < prob :
            action = 1
        ################################################################################
        #                                 END OF YOUR CODE                             #
        ################################################################################
             
        xs.append(x) # observation
        y = 1 if action == 0 else 0 # a "fake label"
        ys.append(y)

        ################################################################################
        # TODO: Step the environment and get new measurements                          #
        # Output: observation, reward, done                                            #
        ################################################################################
        pass
        observation, reward, done, _ = env.step(action)
        ################################################################################
        #                                 END OF YOUR CODE                             #
        ################################################################################

        reward_sum += reward

        drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)

        if done: 
            episode_number += 1
            # stack together all inputs, hidden states, action gradients, and rewards for this episode
            epx = np.vstack(xs)
            epy = np.vstack(ys)
            epr = np.vstack(drs)
            tfp = tfps
            xs,hs,dlogps,drs,ys,tfps = [],[],[],[],[],[] # reset array memory

            # compute the discounted reward backwards through time
            discounted_epr = discount_rewards(epr)
            # size the rewards to be unit normal (helps control the gradient estimator variance)
            discounted_epr -= np.mean(discounted_epr)
            discounted_epr //= np.std(discounted_epr)
            
            # Get the gradient for this episode, and save it in the gradBuffer
            tGrad = sess.run(newGrads,feed_dict={observations: epx, input_y: epy, advantages: discounted_epr})
            for ix,grad in enumerate(tGrad):
                gradBuffer[ix] += grad
                
            # If we have completed enough episodes
            if episode_number % batch_size == 0: 
                
                ################################################################################
                # TODO: Update the policy network with our gradients and set gradBuffer to 0   #
                ################################################################################
                pass
                sess.run(updateGrads, feed_dict={W1Grad:gradBuffer[0] , W2Grad:gradBuffer[1]})
                for ix,grad in enumerate(gradBuffer):
                    gradBuffer[ix] = grad * 0
                ################################################################################
                #                                 END OF YOUR CODE                             #
                ################################################################################
                
                # Give a summary of how well our network is doing for each batch of episodes.
                running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
                print('Average reward for episode %f.  Total average reward %f.' % (reward_sum/batch_size, running_reward/batch_size))
                
                if reward_sum/batch_size > 200: 
                    print("Task solved in",episode_number,'episodes!')
                    break
                    
                reward_sum = 0
            
            observation = env.reset()
        
print(episode_number,'Episodes completed.')

Average reward for episode 15.000000.  Total average reward 15.000000.
Average reward for episode 27.000000.  Total average reward 15.120000.
Average reward for episode 25.200000.  Total average reward 15.220800.
Average reward for episode 17.200000.  Total average reward 15.240592.
Average reward for episode 22.800000.  Total average reward 15.316186.
Average reward for episode 22.400000.  Total average reward 15.387024.
Average reward for episode 18.600000.  Total average reward 15.419154.
Average reward for episode 23.800000.  Total average reward 15.502962.
Average reward for episode 25.400000.  Total average reward 15.601933.
Average reward for episode 49.000000.  Total average reward 15.935913.
Average reward for episode 17.400000.  Total average reward 15.950554.
Average reward for episode 14.400000.  Total average reward 15.935049.
Average reward for episode 31.200000.  Total average reward 16.087698.
Average reward for episode 34.400000.  Total average reward 16.270821.
Averag

Average reward for episode 108.000000.  Total average reward 49.350703.
Average reward for episode 111.200000.  Total average reward 49.969196.
Average reward for episode 156.200000.  Total average reward 51.031504.
Average reward for episode 103.600000.  Total average reward 51.557189.
Average reward for episode 137.400000.  Total average reward 52.415617.
Average reward for episode 123.000000.  Total average reward 53.121461.
Average reward for episode 141.200000.  Total average reward 54.002247.
Average reward for episode 184.400000.  Total average reward 55.306224.
Average reward for episode 125.000000.  Total average reward 56.003162.
Average reward for episode 159.600000.  Total average reward 57.039130.
Average reward for episode 134.000000.  Total average reward 57.808739.
Average reward for episode 167.600000.  Total average reward 58.906652.
Average reward for episode 138.800000.  Total average reward 59.705585.
Average reward for episode 175.200000.  Total average reward 60.

Average reward for episode 191.600000.  Total average reward 136.106026.
Average reward for episode 191.600000.  Total average reward 136.660966.
Average reward for episode 186.200000.  Total average reward 137.156356.
Average reward for episode 185.200000.  Total average reward 137.636792.
Average reward for episode 191.200000.  Total average reward 138.172424.
Average reward for episode 188.400000.  Total average reward 138.674700.
Average reward for episode 200.000000.  Total average reward 139.287953.
Average reward for episode 200.000000.  Total average reward 139.895074.
Average reward for episode 198.000000.  Total average reward 140.476123.
Average reward for episode 192.000000.  Total average reward 140.991362.
Average reward for episode 200.000000.  Total average reward 141.581448.
Average reward for episode 186.800000.  Total average reward 142.033634.
Average reward for episode 194.000000.  Total average reward 142.553297.
Average reward for episode 183.400000.  Total avera

Average reward for episode 172.000000.  Total average reward 169.301011.
Average reward for episode 200.000000.  Total average reward 169.608001.
Average reward for episode 194.400000.  Total average reward 169.855921.
Average reward for episode 190.800000.  Total average reward 170.065361.
Average reward for episode 189.800000.  Total average reward 170.262708.
Average reward for episode 200.000000.  Total average reward 170.560081.
Average reward for episode 200.000000.  Total average reward 170.854480.
Average reward for episode 191.600000.  Total average reward 171.061935.
Average reward for episode 171.800000.  Total average reward 171.069316.
Average reward for episode 185.000000.  Total average reward 171.208623.
Average reward for episode 177.600000.  Total average reward 171.272536.
Average reward for episode 191.800000.  Total average reward 171.477811.
Average reward for episode 200.000000.  Total average reward 171.763033.
Average reward for episode 200.000000.  Total avera

Average reward for episode 193.000000.  Total average reward 186.325895.
Average reward for episode 200.000000.  Total average reward 186.462636.
Average reward for episode 177.800000.  Total average reward 186.376009.
Average reward for episode 166.200000.  Total average reward 186.174249.
Average reward for episode 192.200000.  Total average reward 186.234507.
Average reward for episode 190.400000.  Total average reward 186.276162.
Average reward for episode 188.000000.  Total average reward 186.293400.
Average reward for episode 180.400000.  Total average reward 186.234466.
Average reward for episode 178.000000.  Total average reward 186.152121.
Average reward for episode 179.200000.  Total average reward 186.082600.
Average reward for episode 198.400000.  Total average reward 186.205774.
Average reward for episode 200.000000.  Total average reward 186.343716.
Average reward for episode 195.200000.  Total average reward 186.432279.
Average reward for episode 183.800000.  Total avera

Average reward for episode 197.400000.  Total average reward 193.148727.
Average reward for episode 200.000000.  Total average reward 193.217240.
Average reward for episode 200.000000.  Total average reward 193.285068.
Average reward for episode 200.000000.  Total average reward 193.352217.
Average reward for episode 200.000000.  Total average reward 193.418695.
Average reward for episode 196.200000.  Total average reward 193.446508.
Average reward for episode 200.000000.  Total average reward 193.512043.
Average reward for episode 197.800000.  Total average reward 193.554922.
Average reward for episode 200.000000.  Total average reward 193.619373.
Average reward for episode 200.000000.  Total average reward 193.683180.
Average reward for episode 200.000000.  Total average reward 193.746348.
Average reward for episode 200.000000.  Total average reward 193.808884.
Average reward for episode 195.200000.  Total average reward 193.822795.
Average reward for episode 200.000000.  Total avera

Average reward for episode 200.000000.  Total average reward 196.960116.
Average reward for episode 200.000000.  Total average reward 196.990515.
Average reward for episode 200.000000.  Total average reward 197.020610.
Average reward for episode 200.000000.  Total average reward 197.050404.
Average reward for episode 200.000000.  Total average reward 197.079900.
Average reward for episode 200.000000.  Total average reward 197.109101.
Average reward for episode 200.000000.  Total average reward 197.138010.
Average reward for episode 200.000000.  Total average reward 197.166630.
Average reward for episode 200.000000.  Total average reward 197.194964.
Average reward for episode 200.000000.  Total average reward 197.223014.
Average reward for episode 200.000000.  Total average reward 197.250784.
Average reward for episode 200.000000.  Total average reward 197.278276.
Average reward for episode 200.000000.  Total average reward 197.305493.
Average reward for episode 200.000000.  Total avera

Average reward for episode 200.000000.  Total average reward 198.718236.
Average reward for episode 200.000000.  Total average reward 198.731054.
Average reward for episode 200.000000.  Total average reward 198.743743.
Average reward for episode 200.000000.  Total average reward 198.756306.
Average reward for episode 200.000000.  Total average reward 198.768743.
Average reward for episode 200.000000.  Total average reward 198.781055.
Average reward for episode 200.000000.  Total average reward 198.793245.
Average reward for episode 200.000000.  Total average reward 198.805312.
Average reward for episode 200.000000.  Total average reward 198.817259.
Average reward for episode 194.800000.  Total average reward 198.777087.
Average reward for episode 200.000000.  Total average reward 198.789316.
Average reward for episode 200.000000.  Total average reward 198.801423.
Average reward for episode 199.400000.  Total average reward 198.807408.
Average reward for episode 200.000000.  Total avera

Average reward for episode 179.000000.  Total average reward 198.884679.
Average reward for episode 175.200000.  Total average reward 198.647832.
Average reward for episode 200.000000.  Total average reward 198.661354.
Average reward for episode 200.000000.  Total average reward 198.674740.
Average reward for episode 200.000000.  Total average reward 198.687993.
Average reward for episode 200.000000.  Total average reward 198.701113.
Average reward for episode 197.000000.  Total average reward 198.684102.
Average reward for episode 200.000000.  Total average reward 198.697261.
Average reward for episode 200.000000.  Total average reward 198.710288.
Average reward for episode 200.000000.  Total average reward 198.723185.
Average reward for episode 200.000000.  Total average reward 198.735954.
Average reward for episode 200.000000.  Total average reward 198.748594.
Average reward for episode 200.000000.  Total average reward 198.761108.
Average reward for episode 175.400000.  Total avera

Average reward for episode 200.000000.  Total average reward 198.766713.
Average reward for episode 200.000000.  Total average reward 198.779046.
Average reward for episode 200.000000.  Total average reward 198.791256.
Average reward for episode 196.600000.  Total average reward 198.769343.
Average reward for episode 200.000000.  Total average reward 198.781650.
Average reward for episode 200.000000.  Total average reward 198.793833.
Average reward for episode 200.000000.  Total average reward 198.805895.
Average reward for episode 200.000000.  Total average reward 198.817836.
Average reward for episode 200.000000.  Total average reward 198.829658.
Average reward for episode 196.400000.  Total average reward 198.805361.
Average reward for episode 191.800000.  Total average reward 198.735307.
Average reward for episode 197.000000.  Total average reward 198.717954.
Average reward for episode 191.200000.  Total average reward 198.642775.
Average reward for episode 200.000000.  Total avera

Average reward for episode 200.000000.  Total average reward 199.218176.
Average reward for episode 200.000000.  Total average reward 199.225994.
Average reward for episode 200.000000.  Total average reward 199.233734.
Average reward for episode 200.000000.  Total average reward 199.241397.
Average reward for episode 200.000000.  Total average reward 199.248983.
Average reward for episode 200.000000.  Total average reward 199.256493.
Average reward for episode 200.000000.  Total average reward 199.263928.
Average reward for episode 200.000000.  Total average reward 199.271289.
Average reward for episode 200.000000.  Total average reward 199.278576.
Average reward for episode 200.000000.  Total average reward 199.285790.
Average reward for episode 200.000000.  Total average reward 199.292932.
Average reward for episode 200.000000.  Total average reward 199.300003.
Average reward for episode 200.000000.  Total average reward 199.307003.
Average reward for episode 200.000000.  Total avera

KeyboardInterrupt: 

As you can see, the network not only does much better than random actions, but achieves the goal of 200 points per episode, thus solving the task!