# Monte Carlo Policy Gradients for Cartpole Gym Environment

In this notebook we'll implement an agent <b>that plays Cartpole </b>

<img src="http://neuro-educator.com/wp-content/uploads/2017/09/DQN.gif" alt="Cartpole gif"/>

## 1. Import Libraries

In [1]:
import tensorflow as tf    # deep learning framework
import numpy as np         # used for matrix operations
import gym                 # used for getting the cartpole environment

## 2. Create Environment

In [2]:
env = gym.make('CartPole-v0')
env = env.unwrapped        # see gym documentation from openai.org
# Policy gradient has high variance, set seed for reproducability
env.seed(1)

[1]

## 3. Hyperparameters

In [3]:
## Evironment Hyperparameters
state_size = 4
action_size = env.action_space.n # get the number of possible actions

## Training Hyperparameters
max_episodes = 300 # increase if using gpu
learning_rate = 0.01 # learning rate alpha
gamma = 0.95 # discount rate gamma

## 4. Function to apply discount and return normalized rewards at end of an episode
This is necessary because we are using a Monte Carlo approach (so rewards are discounted after a full episode is complete and then policy is updated). If using a continuous approach (such as Temporal Difference or TD-learning) then the discount and update steps happen after each action leads to a new state.

The discounted cumulative expected reward is important to make sure that predictable rewards from the beginning of the game don't dominate over the less predictable long term future rewards. gamma is the relevant hyperparameter.
* if gamma is large the discount is small, so agent cares more about long term reward
* if gamma is small, discount is large, so agent cares more about short term reward

To calculate the discounted cumulative expected reward, simply sum all previous rewards multipled (discounted) by gamma to the exponent of the time step:

$$
\sum_{k=0}^{\infty }\gamma^{k}R_{t+k+1}\;where\;
\gamma\;\epsilon\;[0,1)
$$

In [4]:
def discount_and_normalize_rewards(episode_rewards):
    discounted_episode_rewards = np.zeros_like(episode_rewards)
    cumulative = 0.0
    for i in reversed(range(len(episode_rewards))):
        cumulative = cumulative * gamma + episode_rewards[i]
        discounted_episode_rewards[i] = cumulative
    
    mean = np.mean(discounted_episode_rewards)
    std = np.std(discounted_episode_rewards)
    discounted_episode_rewards = (discounted_episode_rewards - mean) / (std)
    
    return discounted_episode_rewards

## 5. Neural Network Model
NN Input: the current state (an array of 4 values)

NN Architecture: 3 fully connected layers, ReLU activation function. Output activation function is softmax used to make output a probability distribution.

NN Output: Action Distrubtion generated from softmax and NN processing

In [5]:
with tf.name_scope("inputs"):
    input_ = tf.placeholder(tf.float32, [None, state_size], name="input_")
    actions = tf.placeholder(tf.int32, [None, action_size], name="actions")
    discounted_episode_rewards_ = tf.placeholder(tf.float32, [None,], name="discounted_episode_rewards")
    
    # Add this placeholder for having this variable in tensorboard
    mean_reward_ = tf.placeholder(tf.float32 , name="mean_reward")

    with tf.name_scope("fc1"):
        fc1 = tf.contrib.layers.fully_connected(inputs = input_,
                                                num_outputs = 10,
                                                activation_fn=tf.nn.relu,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())

    with tf.name_scope("fc2"):
        fc2 = tf.contrib.layers.fully_connected(inputs = fc1,
                                                num_outputs = action_size,
                                                activation_fn= tf.nn.relu,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())
    
    with tf.name_scope("fc3"):
        fc3 = tf.contrib.layers.fully_connected(inputs = fc2,
                                                num_outputs = action_size,
                                                activation_fn= None,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())

    with tf.name_scope("softmax"):
        action_distribution = tf.nn.softmax(fc3)

    with tf.name_scope("loss"):
        # tf.nn.softmax_cross_entropy_with_logits computes the cross entropy of the result after applying the softmax function
        # If you have single-class labels, where an object can only belong to one class, you might now consider using 
        # tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. 
        neg_log_prob = tf.nn.softmax_cross_entropy_with_logits_v2(logits = fc3, labels = actions)
        loss = tf.reduce_mean(neg_log_prob * discounted_episode_rewards_) 
        
    
    with tf.name_scope("train"):
        train_opt = tf.train.AdamOptimizer(learning_rate).minimize(loss)

## 6. Tensorboard
Important metrics for RL are Loss (negative score) and the mean of the reward.

To launch tensorboard: ```tensorboard --logdir=<path_specified_below>```

In [8]:
# Setup TensorBoard Writer
writer = tf.summary.FileWriter("~/tensorboard/pg/1") # change to any path as long as user has access

## Losses
tf.summary.scalar("Loss", loss)

## Reward mean
tf.summary.scalar("Reward_mean", mean_reward_)

write_op = tf.summary.merge_all()

## 7. Train Agent
Pseudocode:
```
Create the network
maxReward = 0 # initialize and keep track of max reward
for episode in range(max_episodes):
    episode + 1
    reset environment
    reset stores (states, actions, rewards)
    
    for each step:
        choose action a
        perform action a and get r
        store s, a, r
        check if done:
            calculate sum reward
            calculate gamma Gt
            optimize max of score (min of loss or min of negative score)
```

In [9]:
allRewards = []
total_rewards = 0
maximumRewardRecorded = 0
episode = 0
episode_states, episode_actions, episode_rewards = [],[],[]

saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    for episode in range(max_episodes):
        
        episode_rewards_sum = 0

        # Launch the game
        state = env.reset()
        
        env.render()
           
        while True:
            
            # Choose action a, remember WE'RE NOT IN A DETERMINISTIC ENVIRONMENT, WE'RE OUTPUT PROBABILITIES.
            action_probability_distribution = sess.run(action_distribution, feed_dict={input_: state.reshape([1,4])})
            
            action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())  # select action w.r.t the actions prob

            # Perform a
            new_state, reward, done, info = env.step(action)

            # Store s, a, r
            episode_states.append(state)
                        
            # For actions because we output only one (the index) we need 2 (1 is for the action taken)
            # We need [0., 1.] (if we take right) not just the index
            action_ = np.zeros(action_size)
            action_[action] = 1
            
            episode_actions.append(action_)
            
            episode_rewards.append(reward)
            if done:
                # Calculate sum reward
                episode_rewards_sum = np.sum(episode_rewards)
                
                allRewards.append(episode_rewards_sum)
                
                total_rewards = np.sum(allRewards)
                
                # Mean reward
                mean_reward = np.divide(total_rewards, episode+1)
                
                
                maximumRewardRecorded = np.amax(allRewards)
                
                print("==========================================")
                print("Episode: ", episode)
                print("Reward: ", episode_rewards_sum)
                print("Mean Reward", mean_reward)
                print("Max reward so far: ", maximumRewardRecorded)
                
                # Calculate discounted reward
                discounted_episode_rewards = discount_and_normalize_rewards(episode_rewards)
                                
                # Feedforward, gradient and backpropagation
                loss_, _ = sess.run([loss, train_opt], feed_dict={input_: np.vstack(np.array(episode_states)),
                                                                 actions: np.vstack(np.array(episode_actions)),
                                                                 discounted_episode_rewards_: discounted_episode_rewards 
                                                                })
                
 
                                                                 
                # Write TF Summaries
                summary = sess.run(write_op, feed_dict={input_: np.vstack(np.array(episode_states)),
                                                                 actions: np.vstack(np.array(episode_actions)),
                                                                 discounted_episode_rewards_: discounted_episode_rewards,
                                                                    mean_reward_: mean_reward
                                                                })
                
               
                writer.add_summary(summary, episode)
                writer.flush()
                
            
                
                # Reset the transition stores
                episode_states, episode_actions, episode_rewards = [],[],[]
                
                break
            
            state = new_state
        
        # Save Model
        if episode % 100 == 0:
            saver.save(sess, "./models/model.ckpt")
            print("Model saved")

Episode:  0
Reward:  21.0
Mean Reward 21.0
Max reward so far:  21.0
Model saved
Episode:  1
Reward:  14.0
Mean Reward 17.5
Max reward so far:  21.0
Episode:  2
Reward:  18.0
Mean Reward 17.666666666666668
Max reward so far:  21.0
Episode:  3
Reward:  19.0
Mean Reward 18.0
Max reward so far:  21.0
Episode:  4
Reward:  14.0
Mean Reward 17.2
Max reward so far:  21.0
Episode:  5
Reward:  11.0
Mean Reward 16.166666666666668
Max reward so far:  21.0
Episode:  6
Reward:  11.0
Mean Reward 15.428571428571429
Max reward so far:  21.0
Episode:  7
Reward:  67.0
Mean Reward 21.875
Max reward so far:  67.0
Episode:  8
Reward:  12.0
Mean Reward 20.77777777777778
Max reward so far:  67.0
Episode:  9
Reward:  21.0
Mean Reward 20.8
Max reward so far:  67.0
Episode:  10
Reward:  17.0
Mean Reward 20.454545454545453
Max reward so far:  67.0
Episode:  11
Reward:  28.0
Mean Reward 21.083333333333332
Max reward so far:  67.0
Episode:  12
Reward:  12.0
Mean Reward 20.384615384615383
Max reward so far:  67.0
Ep

Episode:  75
Reward:  55.0
Mean Reward 19.25
Max reward so far:  67.0
Episode:  76
Reward:  9.0
Mean Reward 19.116883116883116
Max reward so far:  67.0
Episode:  77
Reward:  29.0
Mean Reward 19.243589743589745
Max reward so far:  67.0
Episode:  78
Reward:  14.0
Mean Reward 19.17721518987342
Max reward so far:  67.0
Episode:  79
Reward:  24.0
Mean Reward 19.2375
Max reward so far:  67.0
Episode:  80
Reward:  30.0
Mean Reward 19.37037037037037
Max reward so far:  67.0
Episode:  81
Reward:  13.0
Mean Reward 19.29268292682927
Max reward so far:  67.0
Episode:  82
Reward:  19.0
Mean Reward 19.289156626506024
Max reward so far:  67.0
Episode:  83
Reward:  15.0
Mean Reward 19.238095238095237
Max reward so far:  67.0
Episode:  84
Reward:  28.0
Mean Reward 19.341176470588234
Max reward so far:  67.0
Episode:  85
Reward:  17.0
Mean Reward 19.313953488372093
Max reward so far:  67.0
Episode:  86
Reward:  12.0
Mean Reward 19.229885057471265
Max reward so far:  67.0
Episode:  87
Reward:  18.0
Mean 

Episode:  150
Reward:  17.0
Mean Reward 20.258278145695364
Max reward so far:  67.0
Episode:  151
Reward:  11.0
Mean Reward 20.19736842105263
Max reward so far:  67.0
Episode:  152
Reward:  10.0
Mean Reward 20.130718954248366
Max reward so far:  67.0
Episode:  153
Reward:  23.0
Mean Reward 20.149350649350648
Max reward so far:  67.0
Episode:  154
Reward:  45.0
Mean Reward 20.309677419354838
Max reward so far:  67.0
Episode:  155
Reward:  11.0
Mean Reward 20.25
Max reward so far:  67.0
Episode:  156
Reward:  14.0
Mean Reward 20.21019108280255
Max reward so far:  67.0
Episode:  157
Reward:  17.0
Mean Reward 20.189873417721518
Max reward so far:  67.0
Episode:  158
Reward:  20.0
Mean Reward 20.18867924528302
Max reward so far:  67.0
Episode:  159
Reward:  26.0
Mean Reward 20.225
Max reward so far:  67.0
Episode:  160
Reward:  9.0
Mean Reward 20.15527950310559
Max reward so far:  67.0
Episode:  161
Reward:  18.0
Mean Reward 20.141975308641975
Max reward so far:  67.0
Episode:  162
Reward: 

Episode:  228
Reward:  19.0
Mean Reward 20.12227074235808
Max reward so far:  67.0
Episode:  229
Reward:  27.0
Mean Reward 20.152173913043477
Max reward so far:  67.0
Episode:  230
Reward:  11.0
Mean Reward 20.11255411255411
Max reward so far:  67.0
Episode:  231
Reward:  14.0
Mean Reward 20.086206896551722
Max reward so far:  67.0
Episode:  232
Reward:  28.0
Mean Reward 20.120171673819744
Max reward so far:  67.0
Episode:  233
Reward:  20.0
Mean Reward 20.11965811965812
Max reward so far:  67.0
Episode:  234
Reward:  17.0
Mean Reward 20.106382978723403
Max reward so far:  67.0
Episode:  235
Reward:  24.0
Mean Reward 20.122881355932204
Max reward so far:  67.0
Episode:  236
Reward:  16.0
Mean Reward 20.10548523206751
Max reward so far:  67.0
Episode:  237
Reward:  16.0
Mean Reward 20.08823529411765
Max reward so far:  67.0
Episode:  238
Reward:  13.0
Mean Reward 20.05857740585774
Max reward so far:  67.0
Episode:  239
Reward:  13.0
Mean Reward 20.029166666666665
Max reward so far:  67.

## 8. Try Out Trained Agent

In [12]:
with tf.Session() as sess:
    env.reset()
    rewards = []
    
    # Load the model
    saver.restore(sess, "./models/model.ckpt")

    for episode in range(10):
        state = env.reset()
        env.render()
        step = 0
        done = False
        total_rewards = 0
        print("****************************************************")
        print("EPISODE ", episode)
        while True:
            

            # Choose action a, remember WE'RE NOT IN A DETERMINISTIC ENVIRONMENT, WE'RE OUTPUT PROBABILITIES.
            action_probability_distribution = sess.run(action_distribution, feed_dict={input_: state.reshape([1,4])})
            #print(action_probability_distribution)
            action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())  # select action w.r.t the actions prob


            new_state, reward, done, info = env.step(action)

            total_rewards += reward

            if done:
                rewards.append(total_rewards)
                print ("Score", total_rewards)
                break
            state = new_state
    env.close()
    print ("Score over time: " +  str(sum(rewards)/10))

INFO:tensorflow:Restoring parameters from ./models/model.ckpt
****************************************************
EPISODE  0
Score 52.0
****************************************************
EPISODE  1
Score 16.0
****************************************************
EPISODE  2
Score 12.0
****************************************************
EPISODE  3
Score 21.0
****************************************************
EPISODE  4
Score 28.0
****************************************************
EPISODE  5
Score 17.0
****************************************************
EPISODE  6
Score 14.0
****************************************************
EPISODE  7
Score 18.0
****************************************************
EPISODE  8
Score 13.0
****************************************************
EPISODE  9
Score 9.0
Score over time: 20.0
