# Cartpole: REINFORCE Monte Carlo Policy Gradients

In this notebook we'll implement an agent <b>that plays Cartpole </b>

<img src="http://neuro-educator.com/wp-content/uploads/2017/09/DQN.gif" alt="Cartpole gif"/>


# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)
<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/docs/assets/img/DRLC%20Environments.png" alt="Deep Reinforcement Course"/>
<br>
<p>  Deep Reinforcement Learning Course is a free series of articles and videos tutorials 🆕 about Deep Reinforcement Learning, where **we'll learn the main algorithms (Q-learning, Deep Q Nets, Dueling Deep Q Nets, Policy Gradients, A2C, Proximal Policy Gradients…), and how to implement them with Tensorflow.**
<br><br>
    
📜The articles explain the architectures from the big picture to the mathematical details behind them.
<br>
📹 The videos explain how to build the agents with Tensorflow </b></p>
<br>
This course will give you a **solid foundation for understanding and implementing the future state of the art algorithms**. And, you'll build a strong professional portfolio by creating **agents that learn to play awesome environments**: Doom© 👹, Space invaders 👾, Outrun, Sonic the Hedgehog©, Michael Jackson’s Moonwalker, agents that will be able to navigate in 3D environments with DeepMindLab (Quake) and able to walk with Mujoco. 
<br><br>
</p> 

## 📚 The complete [Syllabus HERE](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)


## Any questions 👨‍💻
<p> If you have any questions, feel free to ask me: </p>
<p> 📧: <a href="mailto:hello@simoninithomas.com">hello@simoninithomas.com</a>  </p>
<p> Github: https://github.com/simoninithomas/Deep_reinforcement_learning_Course </p>
<p> 🌐 : https://simoninithomas.github.io/Deep_reinforcement_learning_Course/ </p>
<p> Twitter: <a href="https://twitter.com/ThomasSimonini">@ThomasSimonini</a> </p>
<p> Don't forget to <b> follow me on <a href="https://twitter.com/ThomasSimonini">twitter</a>, <a href="https://github.com/simoninithomas/Deep_reinforcement_learning_Course">github</a> and <a href="https://medium.com/@thomassimonini">Medium</a> to be alerted of the new articles that I publish </b></p>
    
## How to help  🙌
3 ways:
- **Clap our articles and like our videos a lot**:Clapping in Medium means that you really like our articles. And the more claps we have, the more our article is shared Liking our videos help them to be much more visible to the deep learning community.
- **Share and speak about our articles and videos**: By sharing our articles and videos you help us to spread the word. 
- **Improve our notebooks**: if you found a bug or **a better implementation** you can send a pull request.
<br>

## Important note 🤔
<b> You can run it on your computer but it's better to run it on GPU based services</b>, personally I use Microsoft Azure and their Deep Learning Virtual Machine (they offer 170$)
https://azuremarketplace.microsoft.com/en-us/marketplace/apps/microsoft-ads.dsvm-deep-learning
<br>
⚠️ I don't have any business relations with them. I just loved their excellent customer service.

If you have some troubles to use Microsoft Azure follow the explainations of this excellent article here (without last the part fast.ai): https://medium.com/@manikantayadunanda/setting-up-deeplearning-machine-and-fast-ai-on-azure-a22eb6bd6429

## Prerequisites 🏗️
Before diving on the notebook **you need to understand**:
- The foundations of Reinforcement learning (MC, TD, Rewards hypothesis...) [Article](https://medium.freecodecamp.org/an-introduction-to-reinforcement-learning-4339519de419)
- Policy gradients [Article](https://medium.freecodecamp.org/an-introduction-to-policy-gradients-with-cartpole-and-doom-495b5ef2207f)

## Step 1: Import the libraries 📚

In [1]:
import tensorflow as tf
import numpy as np
import gym

## Step 2: Create our environment 🎮
This time we use <a href="https://gym.openai.com/">OpenAI Gym</a> which has a lot of great environments.

In [2]:
env = gym.make('CartPole-v0')
env = env.unwrapped
# Policy gradient has high variance, seed for reproducability
env.seed(1)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


[1]

## Step 3: Set up our hyperparameters ⚗️

In [3]:
## ENVIRONMENT Hyperparameters
state_size = 4
action_size = env.action_space.n

## TRAINING Hyperparameters
max_episodes = 300
learning_rate = 0.01
gamma = 0.95 # Discount rate

## Step 4 : Define the preprocessing functions ⚙️
This function takes <b>the rewards and perform discounting.</b>

In [4]:
def discount_and_normalize_rewards(episode_rewards):
    discounted_episode_rewards = np.zeros_like(episode_rewards)
    cumulative = 0.0
    for i in reversed(range(len(episode_rewards))):
        cumulative = cumulative * gamma + episode_rewards[i]
        discounted_episode_rewards[i] = cumulative
    
    mean = np.mean(discounted_episode_rewards)
    std = np.std(discounted_episode_rewards)
    discounted_episode_rewards = (discounted_episode_rewards - mean) / (std)
    
    return discounted_episode_rewards

## Step 5: Create our Policy Gradient Neural Network model 🧠

<img src="https://raw.githubusercontent.com/simoninithomas/Deep_reinforcement_learning_Course/master/Policy%20Gradients/Cartpole/assets/catpole.png">

The idea is simple:
- Our state which is an array of 4 values will be used as an input.
- Our NN is 3 fully connected layers.
- Our output activation function is softmax that squashes the outputs to a probability distribution (for instance if we have 4, 2, 6 --> softmax --> (0.11731043, 0.01587624, 0.86681333)

In [5]:
with tf.name_scope("inputs"):
    input_ = tf.placeholder(tf.float32, [None, state_size], name="input_")
    actions = tf.placeholder(tf.int32, [None, action_size], name="actions")
    discounted_episode_rewards_ = tf.placeholder(tf.float32, [None,], name="discounted_episode_rewards")
    
    # Add this placeholder for having this variable in tensorboard
    mean_reward_ = tf.placeholder(tf.float32 , name="mean_reward")

    with tf.name_scope("fc1"):
        fc1 = tf.contrib.layers.fully_connected(inputs = input_,
                                                num_outputs = 10,
                                                activation_fn=tf.nn.relu,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())

    with tf.name_scope("fc2"):
        fc2 = tf.contrib.layers.fully_connected(inputs = fc1,
                                                num_outputs = action_size,
                                                activation_fn= tf.nn.relu,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())
    
    with tf.name_scope("fc3"):
        fc3 = tf.contrib.layers.fully_connected(inputs = fc2,
                                                num_outputs = action_size,
                                                activation_fn= None,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())

    with tf.name_scope("softmax"):
        action_distribution = tf.nn.softmax(fc3)

    with tf.name_scope("loss"):
        # tf.nn.softmax_cross_entropy_with_logits computes the cross entropy of the result after applying the softmax function
        # If you have single-class labels, where an object can only belong to one class, you might now consider using 
        # tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. 
        neg_log_prob = tf.nn.softmax_cross_entropy_with_logits_v2(logits = fc3, labels = actions)
        loss = tf.reduce_mean(neg_log_prob * discounted_episode_rewards_) 
        
    
    with tf.name_scope("train"):
        train_opt = tf.train.AdamOptimizer(learning_rate).minimize(loss)

## Step 6: Set up Tensorboard 📊
For more information about tensorboard, please watch this <a href="https://www.youtube.com/embed/eBbEDRsCmv4">excellent 30min tutorial</a> <br><br>
To launch tensorboard : `tensorboard --logdir=/tensorboard/pg/1`

In [6]:
# Setup TensorBoard Writer
writer = tf.summary.FileWriter("/tensorboard/pg/1")

## Losses
tf.summary.scalar("Loss", loss)

## Reward mean
tf.summary.scalar("Reward_mean", mean_reward_)

write_op = tf.summary.merge_all()

## Step 7: Train our Agent 🏃‍♂️

In [7]:
allRewards = []
total_rewards = 0
maximumRewardRecorded = 0
episode = 0
episode_states, episode_actions, episode_rewards = [],[],[]

saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    for episode in range(max_episodes):
        
        episode_rewards_sum = 0

        # Launch the game
        state = env.reset()
        
        env.render()
           
        while True:
            
            # Choose action a, remember WE'RE NOT IN A DETERMINISTIC ENVIRONMENT, WE'RE OUTPUT PROBABILITIES.
            action_probability_distribution = sess.run(action_distribution, feed_dict={input_: state.reshape([1,4])})
            
            action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())  # select action w.r.t the actions prob

            # Perform a
            new_state, reward, done, info = env.step(action)

            # Store s, a, r
            episode_states.append(state)
                        
            # For actions because we output only one (the index) we need 2 (1 is for the action taken)
            # We need [0., 1.] (if we take right) not just the index
            action_ = np.zeros(action_size)
            action_[action] = 1
            
            episode_actions.append(action_)
            
            episode_rewards.append(reward)
            if done:
                # Calculate sum reward
                episode_rewards_sum = np.sum(episode_rewards)
                
                allRewards.append(episode_rewards_sum)
                
                total_rewards = np.sum(allRewards)
                
                # Mean reward
                mean_reward = np.divide(total_rewards, episode+1)
                
                
                maximumRewardRecorded = np.amax(allRewards)
                
                print("==========================================")
                print("Episode: ", episode)
                print("Reward: ", episode_rewards_sum)
                print("Mean Reward", mean_reward)
                print("Max reward so far: ", maximumRewardRecorded)
                
                # Calculate discounted reward
                discounted_episode_rewards = discount_and_normalize_rewards(episode_rewards)
                                
                # Feedforward, gradient and backpropagation
                loss_, _ = sess.run([loss, train_opt], feed_dict={input_: np.vstack(np.array(episode_states)),
                                                                 actions: np.vstack(np.array(episode_actions)),
                                                                 discounted_episode_rewards_: discounted_episode_rewards 
                                                                })
                
 
                                                                 
                # Write TF Summaries
                summary = sess.run(write_op, feed_dict={input_: np.vstack(np.array(episode_states)),
                                                                 actions: np.vstack(np.array(episode_actions)),
                                                                 discounted_episode_rewards_: discounted_episode_rewards,
                                                                    mean_reward_: mean_reward
                                                                })
                
               
                writer.add_summary(summary, episode)
                writer.flush()
                
            
                
                # Reset the transition stores
                episode_states, episode_actions, episode_rewards = [],[],[]
                
                break
            
            state = new_state
        
        # Save Model
        if episode % 100 == 0:
            saver.save(sess, "./models/model.ckpt")
            print("Model saved")

Episode:  0
Reward:  10.0
Mean Reward 10.0
Max reward so far:  10.0
Model saved
Episode:  1
Reward:  10.0
Mean Reward 10.0
Max reward so far:  10.0
Episode:  2
Reward:  21.0
Mean Reward 13.6666666667
Max reward so far:  21.0
Episode:  3
Reward:  14.0
Mean Reward 13.75
Max reward so far:  21.0
Episode:  4
Reward:  21.0
Mean Reward 15.2
Max reward so far:  21.0
Episode:  5
Reward:  13.0
Mean Reward 14.8333333333
Max reward so far:  21.0
Episode:  6
Reward:  34.0
Mean Reward 17.5714285714
Max reward so far:  34.0
Episode:  7
Reward:  19.0
Mean Reward 17.75
Max reward so far:  34.0
Episode:  8
Reward:  12.0
Mean Reward 17.1111111111
Max reward so far:  34.0
Episode:  9
Reward:  31.0
Mean Reward 18.5
Max reward so far:  34.0
Episode:  10
Reward:  14.0
Mean Reward 18.0909090909
Max reward so far:  34.0
Episode:  11
Reward:  10.0
Mean Reward 17.4166666667
Max reward so far:  34.0
Episode:  12
Reward:  24.0
Mean Reward 17.9230769231
Max reward so far:  34.0
Episode:  13
Reward:  12.0
Mean Rewa

Episode:  73
Reward:  46.0
Mean Reward 21.5945945946
Max reward so far:  59.0
Episode:  74
Reward:  17.0
Mean Reward 21.5333333333
Max reward so far:  59.0
Episode:  75
Reward:  24.0
Mean Reward 21.5657894737
Max reward so far:  59.0
Episode:  76
Reward:  15.0
Mean Reward 21.4805194805
Max reward so far:  59.0
Episode:  77
Reward:  16.0
Mean Reward 21.4102564103
Max reward so far:  59.0
Episode:  78
Reward:  32.0
Mean Reward 21.5443037975
Max reward so far:  59.0
Episode:  79
Reward:  14.0
Mean Reward 21.45
Max reward so far:  59.0
Episode:  80
Reward:  17.0
Mean Reward 21.3950617284
Max reward so far:  59.0
Episode:  81
Reward:  39.0
Mean Reward 21.6097560976
Max reward so far:  59.0
Episode:  82
Reward:  14.0
Mean Reward 21.5180722892
Max reward so far:  59.0
Episode:  83
Reward:  40.0
Mean Reward 21.7380952381
Max reward so far:  59.0
Episode:  84
Reward:  24.0
Mean Reward 21.7647058824
Max reward so far:  59.0
Episode:  85
Reward:  15.0
Mean Reward 21.6860465116
Max reward so far: 

Episode:  144
Reward:  72.0
Mean Reward 26.9034482759
Max reward so far:  90.0
Episode:  145
Reward:  16.0
Mean Reward 26.8287671233
Max reward so far:  90.0
Episode:  146
Reward:  64.0
Mean Reward 27.0816326531
Max reward so far:  90.0
Episode:  147
Reward:  38.0
Mean Reward 27.1554054054
Max reward so far:  90.0
Episode:  148
Reward:  76.0
Mean Reward 27.4832214765
Max reward so far:  90.0
Episode:  149
Reward:  59.0
Mean Reward 27.6933333333
Max reward so far:  90.0
Episode:  150
Reward:  14.0
Mean Reward 27.6026490066
Max reward so far:  90.0
Episode:  151
Reward:  221.0
Mean Reward 28.875
Max reward so far:  221.0
Episode:  152
Reward:  91.0
Mean Reward 29.2810457516
Max reward so far:  221.0
Episode:  153
Reward:  56.0
Mean Reward 29.4545454545
Max reward so far:  221.0
Episode:  154
Reward:  117.0
Mean Reward 30.0193548387
Max reward so far:  221.0
Episode:  155
Reward:  65.0
Mean Reward 30.2435897436
Max reward so far:  221.0
Episode:  156
Reward:  118.0
Mean Reward 30.80254777

Episode:  212
Reward:  228.0
Mean Reward 63.0516431925
Max reward so far:  396.0
Episode:  213
Reward:  106.0
Mean Reward 63.2523364486
Max reward so far:  396.0
Episode:  214
Reward:  147.0
Mean Reward 63.6418604651
Max reward so far:  396.0
Episode:  215
Reward:  147.0
Mean Reward 64.0277777778
Max reward so far:  396.0
Episode:  216
Reward:  175.0
Mean Reward 64.5391705069
Max reward so far:  396.0
Episode:  217
Reward:  105.0
Mean Reward 64.7247706422
Max reward so far:  396.0
Episode:  218
Reward:  97.0
Mean Reward 64.8721461187
Max reward so far:  396.0
Episode:  219
Reward:  92.0
Mean Reward 64.9954545455
Max reward so far:  396.0
Episode:  220
Reward:  103.0
Mean Reward 65.1674208145
Max reward so far:  396.0
Episode:  221
Reward:  90.0
Mean Reward 65.2792792793
Max reward so far:  396.0
Episode:  222
Reward:  165.0
Mean Reward 65.7264573991
Max reward so far:  396.0
Episode:  223
Reward:  182.0
Mean Reward 66.2455357143
Max reward so far:  396.0
Episode:  224
Reward:  138.0
Me

Episode:  279
Reward:  195.0
Mean Reward 93.9214285714
Max reward so far:  533.0
Episode:  280
Reward:  181.0
Mean Reward 94.231316726
Max reward so far:  533.0
Episode:  281
Reward:  204.0
Mean Reward 94.6205673759
Max reward so far:  533.0
Episode:  282
Reward:  205.0
Mean Reward 95.0106007067
Max reward so far:  533.0
Episode:  283
Reward:  221.0
Mean Reward 95.4542253521
Max reward so far:  533.0
Episode:  284
Reward:  152.0
Mean Reward 95.6526315789
Max reward so far:  533.0
Episode:  285
Reward:  203.0
Mean Reward 96.027972028
Max reward so far:  533.0
Episode:  286
Reward:  737.0
Mean Reward 98.2613240418
Max reward so far:  737.0
Episode:  287
Reward:  173.0
Mean Reward 98.5208333333
Max reward so far:  737.0
Episode:  288
Reward:  134.0
Mean Reward 98.6435986159
Max reward so far:  737.0
Episode:  289
Reward:  163.0
Mean Reward 98.8655172414
Max reward so far:  737.0
Episode:  290
Reward:  148.0
Mean Reward 99.0343642612
Max reward so far:  737.0
Episode:  291
Reward:  205.0
M

In [12]:
with tf.Session() as sess:
    env.reset()
    rewards = []
    
    # Load the model
    saver.restore(sess, "./models/model.ckpt")

    for episode in range(10):
        state = env.reset()
        step = 0
        done = False
        total_rewards = 0
        print("****************************************************")
        print("EPISODE ", episode)

        while True:
            

            # Choose action a, remember WE'RE NOT IN A DETERMINISTIC ENVIRONMENT, WE'RE OUTPUT PROBABILITIES.
            action_probability_distribution = sess.run(action_distribution, feed_dict={input_: state.reshape([1,4])})
            #print(action_probability_distribution)
            action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())  # select action w.r.t the actions prob


            new_state, reward, done, info = env.step(action)

            total_rewards += reward

            if done:
                rewards.append(total_rewards)
                print ("Score", total_rewards)
                break
            state = new_state
    env.close()
    print ("Score over time: " +  str(sum(rewards)/10))

INFO:tensorflow:Restoring parameters from ./models/model.ckpt
****************************************************
EPISODE  0
Score 250.0
****************************************************
EPISODE  1
Score 286.0
****************************************************
EPISODE  2
Score 204.0
****************************************************
EPISODE  3
Score 191.0
****************************************************
EPISODE  4
Score 363.0
****************************************************
EPISODE  5
Score 216.0
****************************************************
EPISODE  6
Score 205.0
****************************************************
EPISODE  7
Score 271.0
****************************************************
EPISODE  8
Score 175.0
****************************************************
EPISODE  9
Score 128.0
Score over time: 228.9
