# Lunar Lander: REINFORCE Monte Carlo Policy Gradients

In this notebook we'll implement an agent <b>that plays Lunar Lander </b>

<img src="http://gym.openai.com/v2018-02-21/videos/LunarLander-v2-b5632e53-9dbb-4135-bc4c-bee948450d63/poster.jpg" alt="Lunar Lander"/>

## Lunar Lander
* [https://gym.openai.com/envs/LunarLander-v2/](https://gym.openai.com/envs/LunarLander-v2/)
* [https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py](https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py)
* [https://github.com/openai/gym/wiki/Leaderboard#lunarlander-v2](https://github.com/openai/gym/wiki/Leaderboard#lunarlander-v2)

# This solution is based on:
* [https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Cartpole/Cartpole%20REINFORCE%20Monte%20Carlo%20Policy%20Gradients.ipynb](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Cartpole/Cartpole%20REINFORCE%20Monte%20Carlo%20Policy%20Gradients.ipynb)

## How to run

```
cd Homework_Assignment_Week8
jupyter notebook Lunar\ Lander\ REINFORCE\ Monte\ Carlo\ Policy\ Gradients.ipynb
```


## Step 1: Import the libraries 📚

In [1]:
import tensorflow as tf
import numpy as np
import gym

## Step 2: Create our environment 🎮
This time we use <a href="https://gym.openai.com/">OpenAI Gym</a> which has a lot of great environments.

In [2]:
env = gym.make('LunarLander-v2')
env = env.unwrapped
# Policy gradient has high variance, seed for reproducability
env.seed(1)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


[1]

## Step 3: Set up our hyperparameters ⚗️

In [3]:
## ENVIRONMENT Hyperparameters
state_size = 8
action_size = env.action_space.n

## TRAINING Hyperparameters
max_episodes = 300
learning_rate = 0.01
gamma = 0.95 # Discount rate

## Step 4 : Define the preprocessing functions ⚙️
This function takes <b>the rewards and perform discounting.</b>

In [4]:
def discount_and_normalize_rewards(episode_rewards):
    discounted_episode_rewards = np.zeros_like(episode_rewards)
    cumulative = 0.0
    for i in reversed(range(len(episode_rewards))):
        cumulative = cumulative * gamma + episode_rewards[i]
        discounted_episode_rewards[i] = cumulative
    
    mean = np.mean(discounted_episode_rewards)
    std = np.std(discounted_episode_rewards)
    discounted_episode_rewards = (discounted_episode_rewards - mean) / (std)
    
    return discounted_episode_rewards

## Step 5: Create our Policy Gradient Neural Network model 🧠

In [5]:
with tf.name_scope("inputs"):
    input_ = tf.placeholder(tf.float32, [None, state_size], name="input_")
    actions = tf.placeholder(tf.int32, [None, action_size], name="actions")
    discounted_episode_rewards_ = tf.placeholder(tf.float32, [None,], name="discounted_episode_rewards")
    
    # Add this placeholder for having this variable in tensorboard
    mean_reward_ = tf.placeholder(tf.float32 , name="mean_reward")

    with tf.name_scope("fc1"):
        fc1 = tf.contrib.layers.fully_connected(inputs = input_,
                                                num_outputs = 10,
                                                activation_fn=tf.nn.relu,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())

    with tf.name_scope("fc2"):
        fc2 = tf.contrib.layers.fully_connected(inputs = fc1,
                                                num_outputs = action_size,
                                                activation_fn= tf.nn.relu,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())
    
    with tf.name_scope("fc3"):
        fc3 = tf.contrib.layers.fully_connected(inputs = fc2,
                                                num_outputs = action_size,
                                                activation_fn= None,
                                                weights_initializer=tf.contrib.layers.xavier_initializer())

    with tf.name_scope("softmax"):
        action_distribution = tf.nn.softmax(fc3)

    with tf.name_scope("loss"):
        # tf.nn.softmax_cross_entropy_with_logits computes the cross entropy of the result after applying the softmax function
        # If you have single-class labels, where an object can only belong to one class, you might now consider using 
        # tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. 
        neg_log_prob = tf.nn.softmax_cross_entropy_with_logits(logits = fc3, labels = actions)
        loss = tf.reduce_mean(neg_log_prob * discounted_episode_rewards_) 
        
    
    with tf.name_scope("train"):
        train_opt = tf.train.AdamOptimizer(learning_rate).minimize(loss)

## Step 6: Set up Tensorboard 📊
For more information about tensorboard, please watch this <a href="https://www.youtube.com/embed/eBbEDRsCmv4">excellent 30min tutorial</a> <br><br>
To launch tensorboard : `tensorboard --logdir=/tensorboard/pg/1`

In [None]:
# Setup TensorBoard Writer
writer = tf.summary.FileWriter("/tensorboard/pg/1")

## Losses
tf.summary.scalar("Loss", loss)

## Reward mean
tf.summary.scalar("Reward_mean", mean_reward_)

write_op = tf.summary.merge_all()

## Step 7: Train our Agent 🏃‍♂️

In [6]:
allRewards = []
total_rewards = 0
maximumRewardRecorded = 0
episode = 0
episode_states, episode_actions, episode_rewards = [],[],[]

saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    for episode in range(max_episodes):
        
        episode_rewards_sum = 0

        # Launch the game
        state = env.reset()
        
        env.render()
           
        while True:
            
            # Choose action a, remember WE'RE NOT IN A DETERMINISTIC ENVIRONMENT, WE'RE OUTPUT PROBABILITIES.
            action_probability_distribution = sess.run(action_distribution, feed_dict={input_: state.reshape([1,8])})
            
            # select action w.r.t the actions prob
            action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())

            # Perform a
            new_state, reward, done, info = env.step(action)

            # Store s, a, r
            episode_states.append(state)
                        
            # For actions because we output only one (the index) we need 2 (1 is for the action taken)
            # We need [0., 1.] (if we take right) not just the index
            action_ = np.zeros(action_size)
            action_[action] = 1
            
            episode_actions.append(action_)
            
            episode_rewards.append(reward)
            if done:
                # Calculate sum reward
                episode_rewards_sum = np.sum(episode_rewards)
                
                allRewards.append(episode_rewards_sum)
                
                total_rewards = np.sum(allRewards)
                
                # Mean reward
                mean_reward = np.divide(total_rewards, episode+1)
                
                
                maximumRewardRecorded = np.amax(allRewards)
                
                print("==========================================")
                print("Episode: ", episode)
                print("Reward: ", episode_rewards_sum)
                print("Mean Reward", mean_reward)
                print("Max reward so far: ", maximumRewardRecorded)
                
                # Calculate discounted reward
                discounted_episode_rewards = discount_and_normalize_rewards(episode_rewards)
                                
                # Feedforward, gradient and backpropagation
                loss_, _ = sess.run([loss, train_opt], feed_dict={input_: np.vstack(np.array(episode_states)),
                                                                 actions: np.vstack(np.array(episode_actions)),
                                                                 discounted_episode_rewards_: discounted_episode_rewards 
                                                                })
                
 
                                                                 
                # Write TF Summaries
                #summary = sess.run(write_op, feed_dict={input_: np.vstack(np.array(episode_states)),
                #                                                 actions: np.vstack(np.array(episode_actions)),
                #                                                 discounted_episode_rewards_: discounted_episode_rewards,
                #                                                    mean_reward_: mean_reward
                #                                                })
                
               
                #writer.add_summary(summary, episode)
                #writer.flush()
                
            
                
                # Reset the transition stores
                episode_states, episode_actions, episode_rewards = [],[],[]
                
                break
            
            state = new_state
        
        # Save Model
        if episode % 100 == 0:
            saver.save(sess, "./models/model.ckpt")
            print("Model saved")

Episode:  0
Reward:  -302.762409694
Mean Reward -302.762409694
Max reward so far:  -302.762409694
Model saved
Episode:  1
Reward:  -137.125651132
Mean Reward -219.944030413
Max reward so far:  -137.125651132
Episode:  2
Reward:  -255.416011056
Mean Reward -231.768023961
Max reward so far:  -137.125651132
Episode:  3
Reward:  -146.269564031
Mean Reward -210.393408978
Max reward so far:  -137.125651132
Episode:  4
Reward:  -511.183935194
Mean Reward -270.551514221
Max reward so far:  -137.125651132
Episode:  5
Reward:  -701.296220035
Mean Reward -342.342298524
Max reward so far:  -137.125651132
Episode:  6
Reward:  -209.002647548
Mean Reward -323.293776956
Max reward so far:  -137.125651132
Episode:  7
Reward:  -13.3078446804
Mean Reward -284.545535421
Max reward so far:  -13.3078446804
Episode:  8
Reward:  -344.340835517
Mean Reward -291.189457654
Max reward so far:  -13.3078446804
Episode:  9
Reward:  -496.109205775
Mean Reward -311.681432466
Max reward so far:  -13.3078446804
Episode:

Episode:  60
Reward:  -163.21356072
Mean Reward -282.42047706
Max reward so far:  -13.3078446804
Episode:  61
Reward:  -76.8689003683
Mean Reward -279.105129049
Max reward so far:  -13.3078446804
Episode:  62
Reward:  -402.6325683
Mean Reward -281.065882053
Max reward so far:  -13.3078446804
Episode:  63
Reward:  -159.331692383
Mean Reward -279.163785339
Max reward so far:  -13.3078446804
Episode:  64
Reward:  -47.3474093222
Mean Reward -275.597379554
Max reward so far:  -13.3078446804
Episode:  65
Reward:  -155.960522521
Mean Reward -273.784699902
Max reward so far:  -13.3078446804
Episode:  66
Reward:  -40.4284576522
Mean Reward -270.301770913
Max reward so far:  -13.3078446804
Episode:  67
Reward:  -161.436996753
Mean Reward -268.700818352
Max reward so far:  -13.3078446804
Episode:  68
Reward:  -258.843014666
Mean Reward -268.557951632
Max reward so far:  -13.3078446804
Episode:  69
Reward:  -432.987749365
Mean Reward -270.906948742
Max reward so far:  -13.3078446804
Episode:  70
R

Episode:  120
Reward:  -96.8525955229
Mean Reward -244.894097863
Max reward so far:  -13.3078446804
Episode:  121
Reward:  -205.67722561
Mean Reward -244.57264809
Max reward so far:  -13.3078446804
Episode:  122
Reward:  -266.491399299
Mean Reward -244.750849319
Max reward so far:  -13.3078446804
Episode:  123
Reward:  -434.761184679
Mean Reward -246.283190734
Max reward so far:  -13.3078446804
Episode:  124
Reward:  -274.981977375
Mean Reward -246.512781027
Max reward so far:  -13.3078446804
Episode:  125
Reward:  -238.469845331
Mean Reward -246.448948204
Max reward so far:  -13.3078446804
Episode:  126
Reward:  -171.797113957
Mean Reward -245.861138485
Max reward so far:  -13.3078446804
Episode:  127
Reward:  -268.789647029
Mean Reward -246.040267458
Max reward so far:  -13.3078446804
Episode:  128
Reward:  -245.489364415
Mean Reward -246.035996892
Max reward so far:  -13.3078446804
Episode:  129
Reward:  -173.953523438
Mean Reward -245.481516327
Max reward so far:  -13.3078446804
Ep

Episode:  178
Reward:  -361.906045639
Mean Reward -241.823630875
Max reward so far:  -13.3078446804
Episode:  179
Reward:  -448.661826026
Mean Reward -242.972731959
Max reward so far:  -13.3078446804
Episode:  180
Reward:  -416.139334898
Mean Reward -243.929453522
Max reward so far:  -13.3078446804
Episode:  181
Reward:  -167.594842579
Mean Reward -243.510032583
Max reward so far:  -13.3078446804
Episode:  182
Reward:  -464.440461916
Mean Reward -244.717302689
Max reward so far:  -13.3078446804
Episode:  183
Reward:  -171.097898927
Mean Reward -244.317197234
Max reward so far:  -13.3078446804
Episode:  184
Reward:  -387.562055758
Mean Reward -245.091493766
Max reward so far:  -13.3078446804
Episode:  185
Reward:  -151.775118186
Mean Reward -244.589792822
Max reward so far:  -13.3078446804
Episode:  186
Reward:  -261.952733281
Mean Reward -244.682642771
Max reward so far:  -13.3078446804
Episode:  187
Reward:  -137.289900011
Mean Reward -244.111404778
Max reward so far:  -13.3078446804


Episode:  238
Reward:  -147.446827311
Mean Reward -250.482536952
Max reward so far:  -13.3078446804
Episode:  239
Reward:  -133.462934127
Mean Reward -249.994955274
Max reward so far:  -13.3078446804
Episode:  240
Reward:  -316.142980271
Mean Reward -250.269428406
Max reward so far:  -13.3078446804
Episode:  241
Reward:  -141.390197586
Mean Reward -249.819514229
Max reward so far:  -13.3078446804
Episode:  242
Reward:  -204.836499167
Mean Reward -249.634398941
Max reward so far:  -13.3078446804
Episode:  243
Reward:  -203.067056534
Mean Reward -249.443549177
Max reward so far:  -13.3078446804
Episode:  244
Reward:  -373.713306705
Mean Reward -249.950772677
Max reward so far:  -13.3078446804
Episode:  245
Reward:  -143.507167536
Mean Reward -249.518075095
Max reward so far:  -13.3078446804
Episode:  246
Reward:  -116.803343538
Mean Reward -248.98076849
Max reward so far:  -13.3078446804
Episode:  247
Reward:  -120.998496705
Mean Reward -248.464710942
Max reward so far:  -13.3078446804
E

Episode:  298
Reward:  -216.345179745
Mean Reward -244.539122811
Max reward so far:  -13.3078446804
Episode:  299
Reward:  -397.185405387
Mean Reward -245.047943753
Max reward so far:  -13.3078446804


In [7]:
with tf.Session() as sess:
    env.reset()
    rewards = []
    
    # Load the model
    saver.restore(sess, "./models/model.ckpt")

    for episode in range(10):
        state = env.reset()
        step = 0
        done = False
        total_rewards = 0
        print("****************************************************")
        print("EPISODE ", episode)

        while True:
            

            # Choose action a, remember WE'RE NOT IN A DETERMINISTIC ENVIRONMENT, WE'RE OUTPUT PROBABILITIES.
            action_probability_distribution = sess.run(action_distribution, feed_dict={input_: state.reshape([1,8])})
            #print(action_probability_distribution)
            action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())  # select action w.r.t the actions prob


            new_state, reward, done, info = env.step(action)

            total_rewards += reward

            if done:
                rewards.append(total_rewards)
                print ("Score", total_rewards)
                break
            state = new_state
    env.close()
    print ("Score over time: " +  str(sum(rewards)/10))

INFO:tensorflow:Restoring parameters from ./models/model.ckpt
****************************************************
EPISODE  0
Score -172.734114093
****************************************************
EPISODE  1
Score -365.023196326
****************************************************
EPISODE  2
Score -308.516804824
****************************************************
EPISODE  3
Score -536.953320571
****************************************************
EPISODE  4
Score -229.889098342
****************************************************
EPISODE  5
Score -173.094697606
****************************************************
EPISODE  6
Score -235.598483917
****************************************************
EPISODE  7
Score -155.456255468
****************************************************
EPISODE  8
Score -208.079510543
****************************************************
EPISODE  9
Score -111.280167542
Score over time: -249.662564923
