<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Mountain-Car:-REINFORCE-Monte-Carlo-Policy-Gradients" data-toc-modified-id="Mountain-Car:-REINFORCE-Monte-Carlo-Policy-Gradients-1">Mountain Car: REINFORCE Monte Carlo Policy Gradients</a></span></li><li><span><a href="#This-is-a-notebook-from-Deep-Reinforcement-Learning-Course-with-Tensorflow" data-toc-modified-id="This-is-a-notebook-from-Deep-Reinforcement-Learning-Course-with-Tensorflow-2">This is a notebook from <a href="https://simoninithomas.github.io/Deep_reinforcement_learning_Course/" target="_blank">Deep Reinforcement Learning Course with Tensorflow</a></a></span><ul class="toc-item"><li><span><a href="#Step-1:-Import-the-libraries" data-toc-modified-id="Step-1:-Import-the-libraries-2.1">Step 1: Import the libraries</a></span></li><li><span><a href="#Step-2:-Create-our-environment" data-toc-modified-id="Step-2:-Create-our-environment-2.2">Step 2: Create our environment</a></span></li><li><span><a href="#Step-3:-Set-up-our-hyperparameters" data-toc-modified-id="Step-3:-Set-up-our-hyperparameters-2.3">Step 3: Set up our hyperparameters</a></span></li><li><span><a href="#Step-4-:-Define-the-preprocessing-functions️" data-toc-modified-id="Step-4-:-Define-the-preprocessing-functions️-2.4">Step 4 : Define the preprocessing functions️</a></span></li><li><span><a href="#Step-5:-Create-our-Policy-Gradient-Neural-Network-model" data-toc-modified-id="Step-5:-Create-our-Policy-Gradient-Neural-Network-model-2.5">Step 5: Create our Policy Gradient Neural Network model</a></span></li><li><span><a href="#Step-6:-Set-up-Tensorboard" data-toc-modified-id="Step-6:-Set-up-Tensorboard-2.6">Step 6: Set up Tensorboard</a></span></li><li><span><a href="#Step-7:-Train-our-Agent" data-toc-modified-id="Step-7:-Train-our-Agent-2.7">Step 7: Train our Agent</a></span></li><li><span><a href="#Step-8:-Evaluate-our-trained-model" data-toc-modified-id="Step-8:-Evaluate-our-trained-model-2.8">Step 8: Evaluate our trained model</a></span></li><li><span><a href="#Report" data-toc-modified-id="Report-2.9">Report</a></span></li></ul></li></ul></div>

# Mountain Car: REINFORCE Monte Carlo Policy Gradients

In this notebook we'll implement an agent that plays <b> MountainCar-v0 </b>
<video controls src="./assets/mountain_car.mp4" />

# This is a notebook from [Deep Reinforcement Learning Course with Tensorflow](https://simoninithomas.github.io/Deep_reinforcement_learning_Course/)

The original notebook, with a solution for CartPole is [here](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Policy%20Gradients/Cartpole/Cartpole%20REINFORCE%20Monte%20Carlo%20Policy%20Gradients.ipynb)


## Step 1: Import the libraries

In [1]:
import tensorflow as tf
import numpy as np
import gym

## Step 2: Create our environment
This time we use <a href="https://gym.openai.com/">OpenAI Gym</a> which has a lot of great environments.

In [2]:
# env_name = 'CartPole-v0'
env_name = 'MountainCar-v0'
# env = gym.make('MountainCar-v0')
env = gym.make(env_name)
env = env.unwrapped
# Policy gradient has high variance, seed for reproducability
env.seed(1);

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


## Step 3: Set up our hyperparameters 

In [30]:
## ENVIRONMENT Hyperparameters
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
# print(action_size, state_size)
## TRAINING Hyperparameters
max_episodes = 300
learning_rate = 0.01
STEP_MULTIPLE = 3.0
gamma = 0.95 # Discount rate

## Step 4 : Define the preprocessing functions️
This function takes <b>the rewards and perform discounting.</b>

In [4]:
def discount_and_normalize_rewards(episode_rewards):
    discounted_episode_rewards = np.zeros_like(episode_rewards)
    cumulative = 0.0
    for i in reversed(range(len(episode_rewards))):
        cumulative = cumulative * gamma + episode_rewards[i]
        discounted_episode_rewards[i] = cumulative
    
    mean = np.mean(discounted_episode_rewards)
    std = np.std(discounted_episode_rewards)
    discounted_episode_rewards = (discounted_episode_rewards - mean) / (std)
    
    return discounted_episode_rewards

## Step 5: Create our Policy Gradient Neural Network model

<img src="./assets/mountain_car.jpeg">

The idea is simple:
- Our state which is an array of 2 values, **position** and **velocity**, which will be used as an input.
- Our NN is 3 fully connected layers.
- Our output activation function is **softmax** that squashes the outputs to a probability distribution:
    - for instance: $ softmax(4,\ 2,\ 6) \rightarrow (0.117,\ 0.016,\ 0.867) $

In [5]:
with tf.device("/device:GPU:1"):
    with tf.name_scope("inputs"):
        input_ = tf.placeholder(tf.float32, [None, state_size], name="input_")
        actions = tf.placeholder(tf.int32, [None, action_size], name="actions")
        discounted_episode_rewards_ = tf.placeholder(tf.float32, [None,], name="discounted_episode_rewards")

        # Add this placeholder for having this variable in tensorboard
        mean_reward_ = tf.placeholder(tf.float32 , name="mean_reward")

        with tf.name_scope("fc1"):
            fc1 = tf.contrib.layers.fully_connected(inputs = input_,
                                                    num_outputs = 10,
                                                    activation_fn=tf.nn.relu,
                                                    weights_initializer=tf.contrib.layers.xavier_initializer())

        with tf.name_scope("fc2"):
            fc2 = tf.contrib.layers.fully_connected(inputs = fc1,
                                                    num_outputs = action_size,
                                                    activation_fn= tf.nn.relu,
                                                    weights_initializer=tf.contrib.layers.xavier_initializer())

        with tf.name_scope("fc3"):
            fc3 = tf.contrib.layers.fully_connected(inputs = fc2,
                                                    num_outputs = action_size,
                                                    activation_fn= None,
                                                    weights_initializer=tf.contrib.layers.xavier_initializer())

        with tf.name_scope("softmax"):
            action_distribution = tf.nn.softmax(fc3)

        with tf.name_scope("loss"):
            # tf.nn.softmax_cross_entropy_with_logits computes the cross entropy of the result after applying the softmax function
            # If you have single-class labels, where an object can only belong to one class, you might now consider using 
            # tf.nn.sparse_softmax_cross_entropy_with_logits so that you don't have to convert your labels to a dense one-hot array. 
            neg_log_prob = tf.nn.softmax_cross_entropy_with_logits_v2(logits = fc3, labels = actions)
            loss = tf.reduce_mean(neg_log_prob * discounted_episode_rewards_) 


        with tf.name_scope("train"):
            train_opt = tf.train.AdamOptimizer(learning_rate).minimize(loss)

## Step 6: Set up Tensorboard
For more information about tensorboard, please watch this <a href="https://www.youtube.com/embed/eBbEDRsCmv4">excellent 30min tutorial</a> <br><br>
To launch tensorboard : `tensorboard --logdir=/tensorboard/pg/1`

In [8]:
# Setup TensorBoard Writer
# writer = tf.summary.FileWriter("/tensorboard/pg/1")
!rm -Rf ./tensorboard
writer = tf.summary.FileWriter("./tensorboard/pg/1")

## Losses
tf.summary.scalar("Loss", loss)

## Reward mean
tf.summary.scalar("Reward_mean", mean_reward_)

write_op = tf.summary.merge_all()

## Step 7: Train our Agent 

In [25]:
allRewards = []
total_rewards = 0
maximumRewardRecorded = 0
episode = 0
episode_states, episode_actions, episode_rewards = [],[],[]

saver = tf.train.Saver()

import datetime

step_max = 20000
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    old_position, old_velocity = None, None
    for episode in range(max_episodes + 10):
        
        episode_rewards_sum = 0

        # Launch the game
        state = env.reset()
        start_time = datetime.datetime.now()
        print("==========================================")
        print(f"The starting state is : {state}")
        
        #env.render()
        counter = 0
        episode_max_pos, episode_min_pos = float("-2.0"), float("2.0")
        direction_change_counter = 0
        fail = False
        while True:
            # Choose action a, remember WE'RE NOT IN A DETERMINISTIC ENVIRONMENT, WE'RE OUTPUT PROBABILITIES.

            action_probability_distribution = sess.run(action_distribution, feed_dict={input_: state.reshape([1,state_size])})
            action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())  # select action w.r.t the actions prob


            new_state, reward, done, info = env.step(action)

            if old_position is None:
                old_position, old_velocity = new_state
            else:
                old_position, old_velocity = position, velocity
                
            position, velocity = new_state
            velocity_sign = velocity * old_velocity
            
            bonus = 0.0
            if velocity_sign < 0.0:
                new_record = False
                direction_change_counter += 1
                if position > episode_max_pos:
                    episode_max_pos = position
                    new_record = True
                elif position < episode_min_pos:
                    episode_min_pos = position
                    new_record = True

                if new_record:
                    bonus = 10.0  # bonus for gaining potential energy
                else:
                    bonus = -2.0  # penalty for wasting potential energy

            reward += bonus
            
            counter += 1
            #if counter == 10:
            #    break
            # Store s, a, r
            episode_states.append(state)
                        
            # For actions because we output only one (the index) we need 2 (1 is for the action taken)
            # We need [0., 1.] (if we take right) not just the index
            action_ = np.zeros(action_size)
            action_[action] = 1
            
            episode_actions.append(action_)
            
            episode_rewards.append(reward)
           
            if counter % 1000 == 0:
                print(f"Step: {counter}")
        
            if counter >= step_max:
                # Bad Ending
                if episode <= max_episodes:
                    done = True
                    fail = True
                else:
                    step_max = 1000000
                    
                
            if done:
               
                if counter < step_max / STEP_MULTIPLE :
                    step_max = counter * STEP_MULTIPLE
                    print(f"The new step_max is : {step_max}")
                    
                # Calculate sum reward
                episode_rewards_sum = np.sum(episode_rewards)
                
                allRewards.append(episode_rewards_sum)
                
                total_rewards = np.sum(allRewards)
                
                # Mean reward
                mean_reward = np.divide(total_rewards, episode+1)
                
                
                maximumRewardRecorded = np.amax(allRewards)
                end_time = datetime.datetime.now()
                print("==========================================")
                print(f"Total Time: {end_time - start_time}")
                print(f"Step max: {step_max}")
                print(f"Fail: {fail}")
                print("Episode: ", episode)
                print("Reward: ", episode_rewards_sum)
                print("Mean Reward", mean_reward)
                print("Max reward so far: ", maximumRewardRecorded)
                
                # Calculate discounted reward
                discounted_episode_rewards = discount_and_normalize_rewards(episode_rewards)
                                
                # Feedforward, gradient and backpropagation
                loss_, _ = sess.run([loss, train_opt], feed_dict={input_: np.vstack(np.array(episode_states)),
                                                                 actions: np.vstack(np.array(episode_actions)),
                                                                 discounted_episode_rewards_: discounted_episode_rewards 
                                                                })
                
 
                                                                 
                # Write TF Summaries
                summary = sess.run(write_op, feed_dict={input_: np.vstack(np.array(episode_states)),
                                                                 actions: np.vstack(np.array(episode_actions)),
                                                                 discounted_episode_rewards_: discounted_episode_rewards,
                                                                    mean_reward_: mean_reward
                                                                })
                
               
                writer.add_summary(summary, episode)
                writer.flush()
                
            
                
                # Reset the transition stores
                episode_states, episode_actions, episode_rewards = [],[],[]
                
                break
            
            state = new_state
        
        # Save Model
        if episode % 100 == 0:
            saver.save(sess, "./models/model.ckpt")
            print("Model saved")

The starting state is : [-0.44399565  0.        ]
Step: 1000
Step: 2000
Step: 3000
The new step_max is : 9876.0
Total Time: 0:00:02.610894
Step max: 9876.0
Fail: False
Episode:  0
Reward:  -3132.0
Mean Reward -3132.0
Max reward so far:  -3132.0
Model saved
The starting state is : [-0.55554355  0.        ]
Step: 1000
Step: 2000
Step: 3000
Step: 4000
Step: 5000
Step: 6000
Step: 7000
Total Time: 0:00:06.439067
Step max: 9876.0
Fail: False
Episode:  1
Reward:  -7990.0
Mean Reward -5561.0
Max reward so far:  -3132.0
The starting state is : [-0.42067916  0.        ]
Step: 1000
Step: 2000
The new step_max is : 7857.0
Total Time: 0:00:02.098376
Step max: 7857.0
Fail: False
Episode:  2
Reward:  -2447.0
Mean Reward -4523.0
Max reward so far:  -2447.0
The starting state is : [-0.4811911  0.       ]
Step: 1000
Step: 2000
Step: 3000
Total Time: 0:00:02.545705
Step max: 7857.0
Fail: False
Episode:  3
Reward:  -2946.0
Mean Reward -4128.75
Max reward so far:  -2447.0
The starting state is : [-0.587829

Total Time: 0:00:02.313672
Step max: 2673.0
Fail: True
Episode:  26
Reward:  -2477.0
Mean Reward -3203.222222222222
Max reward so far:  -707.0
The starting state is : [-0.51836068  0.        ]
Step: 1000
Step: 2000
Total Time: 0:00:02.118040
Step max: 2673.0
Fail: False
Episode:  27
Reward:  -1976.0
Mean Reward -3159.3928571428573
Max reward so far:  -707.0
The starting state is : [-0.46325767  0.        ]
Step: 1000
Step: 2000
Total Time: 0:00:02.049769
Step max: 2673.0
Fail: False
Episode:  28
Reward:  -2033.0
Mean Reward -3120.551724137931
Max reward so far:  -707.0
The starting state is : [-0.54246275  0.        ]
Step: 1000
Step: 2000
Total Time: 0:00:02.338552
Step max: 2673.0
Fail: True
Episode:  29
Reward:  -2583.0
Mean Reward -3102.633333333333
Max reward so far:  -707.0
The starting state is : [-0.55034325  0.        ]
Step: 1000
Step: 2000
Total Time: 0:00:02.386988
Step max: 2673.0
Fail: True
Episode:  30
Reward:  -2555.0
Mean Reward -3084.967741935484
Max reward so far:  -

Step: 1000
Total Time: 0:00:01.339872
Step max: 2538.0
Fail: False
Episode:  54
Reward:  -1458.0
Mean Reward -2525.690909090909
Max reward so far:  -658.0
The starting state is : [-0.45335228  0.        ]
Step: 1000
Total Time: 0:00:01.386228
Step max: 2538.0
Fail: False
Episode:  55
Reward:  -1393.0
Mean Reward -2505.464285714286
Max reward so far:  -658.0
The starting state is : [-0.55988485  0.        ]
The new step_max is : 2133.0
Total Time: 0:00:00.585721
Step max: 2133.0
Fail: False
Episode:  56
Reward:  -573.0
Mean Reward -2471.561403508772
Max reward so far:  -573.0
The starting state is : [-0.43743306  0.        ]
Step: 1000
Step: 2000
Total Time: 0:00:01.744693
Step max: 2133.0
Fail: True
Episode:  57
Reward:  -1981.0
Mean Reward -2463.103448275862
Max reward so far:  -573.0
The starting state is : [-0.49123936  0.        ]
Step: 1000
Step: 2000
Total Time: 0:00:01.696170
Step max: 2133.0
Fail: True
Episode:  58
Reward:  -1787.0
Mean Reward -2451.64406779661
Max reward so fa

Step: 1000
Total Time: 0:00:01.383326
Step max: 1530.0
Fail: False
Episode:  82
Reward:  -1227.0
Mean Reward -2129.9397590361446
Max reward so far:  -390.0
The starting state is : [-0.49380311  0.        ]
Step: 1000
Total Time: 0:00:01.351499
Step max: 1530.0
Fail: True
Episode:  83
Reward:  -1358.0
Mean Reward -2120.75
Max reward so far:  -390.0
The starting state is : [-0.47146982  0.        ]
Total Time: 0:00:00.682680
Step max: 1530.0
Fail: False
Episode:  84
Reward:  -556.0
Mean Reward -2102.3411764705884
Max reward so far:  -390.0
The starting state is : [-0.40139574  0.        ]
Step: 1000
Total Time: 0:00:01.422627
Step max: 1530.0
Fail: True
Episode:  85
Reward:  -1470.0
Mean Reward -2094.9883720930234
Max reward so far:  -390.0
The starting state is : [-0.5397996  0.       ]
Step: 1000
Total Time: 0:00:01.051688
Step max: 1530.0
Fail: False
Episode:  86
Reward:  -907.0
Mean Reward -2081.3333333333335
Max reward so far:  -390.0
The starting state is : [-0.4525331  0.       ]


Total Time: 0:00:00.812724
Step max: 1530.0
Fail: False
Episode:  111
Reward:  -708.0
Mean Reward -1853.0625
Max reward so far:  -390.0
The starting state is : [-0.48352402  0.        ]
Step: 1000
Total Time: 0:00:01.119791
Step max: 1530.0
Fail: False
Episode:  112
Reward:  -1038.0
Mean Reward -1845.8495575221239
Max reward so far:  -390.0
The starting state is : [-0.56499681  0.        ]
Total Time: 0:00:00.797592
Step max: 1530.0
Fail: False
Episode:  113
Reward:  -698.0
Mean Reward -1835.780701754386
Max reward so far:  -390.0
The starting state is : [-0.5253405  0.       ]
Step: 1000
Total Time: 0:00:01.084820
Step max: 1530.0
Fail: False
Episode:  114
Reward:  -1020.0
Mean Reward -1828.6869565217391
Max reward so far:  -390.0
The starting state is : [-0.47645211  0.        ]
Step: 1000
Total Time: 0:00:01.460186
Step max: 1530.0
Fail: True
Episode:  115
Reward:  -1338.0
Mean Reward -1824.4568965517242
Max reward so far:  -390.0
The starting state is : [-0.45838318  0.        ]
St

Step: 1000
Total Time: 0:00:01.346997
Step max: 1530.0
Fail: True
Episode:  140
Reward:  -1472.0
Mean Reward -1725.8794326241134
Max reward so far:  -390.0
The starting state is : [-0.45816373  0.        ]
Step: 1000
Total Time: 0:00:01.359199
Step max: 1530.0
Fail: True
Episode:  141
Reward:  -1292.0
Mean Reward -1722.8239436619717
Max reward so far:  -390.0
The starting state is : [-0.53885492  0.        ]
Step: 1000
Total Time: 0:00:01.254535
Step max: 1530.0
Fail: True
Episode:  142
Reward:  -1428.0
Mean Reward -1720.7622377622379
Max reward so far:  -390.0
The starting state is : [-0.56119708  0.        ]
Step: 1000
Total Time: 0:00:01.357782
Step max: 1530.0
Fail: True
Episode:  143
Reward:  -1338.0
Mean Reward -1718.1041666666667
Max reward so far:  -390.0
The starting state is : [-0.5610319  0.       ]
Step: 1000
Total Time: 0:00:01.345274
Step max: 1530.0
Fail: True
Episode:  144
Reward:  -1290.0
Mean Reward -1715.151724137931
Max reward so far:  -390.0
The starting state is :

The starting state is : [-0.50443495  0.        ]
Step: 1000
Total Time: 0:00:01.159350
Step max: 1530.0
Fail: False
Episode:  169
Reward:  -1169.0
Mean Reward -1608.0
Max reward so far:  -390.0
The starting state is : [-0.48952591  0.        ]
Total Time: 0:00:00.595108
Step max: 1530.0
Fail: False
Episode:  170
Reward:  -488.0
Mean Reward -1601.4502923976609
Max reward so far:  -390.0
The starting state is : [-0.53637612  0.        ]
Step: 1000
Total Time: 0:00:01.100511
Step max: 1530.0
Fail: False
Episode:  171
Reward:  -958.0
Mean Reward -1597.7093023255813
Max reward so far:  -390.0
The starting state is : [-0.41743071  0.        ]
Step: 1000
Total Time: 0:00:00.980730
Step max: 1530.0
Fail: False
Episode:  172
Reward:  -817.0
Mean Reward -1593.1965317919075
Max reward so far:  -390.0
The starting state is : [-0.59521807  0.        ]
Total Time: 0:00:00.732875
Step max: 1530.0
Fail: False
Episode:  173
Reward:  -725.0
Mean Reward -1588.2068965517242
Max reward so far:  -390.0
The

Total Time: 0:00:00.798879
Step max: 1530.0
Fail: False
Episode:  198
Reward:  -870.0
Mean Reward -1481.7788944723618
Max reward so far:  -390.0
The starting state is : [-0.43854186  0.        ]
Total Time: 0:00:00.746090
Step max: 1530.0
Fail: False
Episode:  199
Reward:  -632.0
Mean Reward -1477.53
Max reward so far:  -390.0
The starting state is : [-0.45739853  0.        ]
Total Time: 0:00:00.675663
Step max: 1530.0
Fail: False
Episode:  200
Reward:  -528.0
Mean Reward -1472.8059701492537
Max reward so far:  -390.0
Model saved
The starting state is : [-0.55013102  0.        ]
Total Time: 0:00:00.679203
Step max: 1530.0
Fail: False
Episode:  201
Reward:  -628.0
Mean Reward -1468.6237623762377
Max reward so far:  -390.0
The starting state is : [-0.49169881  0.        ]
Step: 1000
Total Time: 0:00:01.298883
Step max: 1530.0
Fail: True
Episode:  202
Reward:  -1380.0
Mean Reward -1468.1871921182267
Max reward so far:  -390.0
The starting state is : [-0.4141324  0.       ]
Total Time: 0:0

Step: 1000
Total Time: 0:00:01.118460
Step max: 1266.0
Fail: True
Episode:  227
Reward:  -948.0
Mean Reward -1387.5219298245613
Max reward so far:  -314.0
The starting state is : [-0.49022411  0.        ]
Step: 1000
Total Time: 0:00:00.950240
Step max: 1266.0
Fail: False
Episode:  228
Reward:  -878.0
Mean Reward -1385.296943231441
Max reward so far:  -314.0
The starting state is : [-0.48867124  0.        ]
Total Time: 0:00:00.692444
Step max: 1266.0
Fail: False
Episode:  229
Reward:  -653.0
Mean Reward -1382.1130434782608
Max reward so far:  -314.0
The starting state is : [-0.54502041  0.        ]
Total Time: 0:00:00.800890
Step max: 1266.0
Fail: False
Episode:  230
Reward:  -804.0
Mean Reward -1379.6103896103896
Max reward so far:  -314.0
The starting state is : [-0.42032991  0.        ]
Step: 1000
Total Time: 0:00:01.089246
Step max: 1266.0
Fail: False
Episode:  231
Reward:  -1078.0
Mean Reward -1378.3103448275863
Max reward so far:  -314.0
The starting state is : [-0.43280275  0.   

Total Time: 0:00:00.505271
Step max: 1098.0
Fail: False
Episode:  256
Reward:  -449.0
Mean Reward -1310.295719844358
Max reward so far:  -286.0
The starting state is : [-0.42714759  0.        ]
Total Time: 0:00:00.415900
Step max: 1098.0
Fail: False
Episode:  257
Reward:  -329.0
Mean Reward -1306.4922480620155
Max reward so far:  -286.0
The starting state is : [-0.51314677  0.        ]
Total Time: 0:00:00.730418
Step max: 1098.0
Fail: False
Episode:  258
Reward:  -599.0
Mean Reward -1303.7606177606178
Max reward so far:  -286.0
The starting state is : [-0.59609258  0.        ]
Total Time: 0:00:00.735704
Step max: 1098.0
Fail: False
Episode:  259
Reward:  -677.0
Mean Reward -1301.35
Max reward so far:  -286.0
The starting state is : [-0.4300062  0.       ]
Total Time: 0:00:00.451717
Step max: 1098.0
Fail: False
Episode:  260
Reward:  -430.0
Mean Reward -1298.0114942528735
Max reward so far:  -286.0
The starting state is : [-0.53376088  0.        ]
Total Time: 0:00:00.709947
Step max: 10

Total Time: 0:00:00.688857
Step max: 1083.0
Fail: False
Episode:  285
Reward:  -609.0
Mean Reward -1238.472027972028
Max reward so far:  -281.0
The starting state is : [-0.49538876  0.        ]
Step: 1000
Total Time: 0:00:00.924837
Step max: 1083.0
Fail: True
Episode:  286
Reward:  -941.0
Mean Reward -1237.4355400696863
Max reward so far:  -281.0
The starting state is : [-0.45636679  0.        ]
Total Time: 0:00:00.474783
Step max: 1083.0
Fail: False
Episode:  287
Reward:  -397.0
Mean Reward -1234.517361111111
Max reward so far:  -281.0
The starting state is : [-0.57108465  0.        ]
Total Time: 0:00:00.742993
Step max: 1083.0
Fail: False
Episode:  288
Reward:  -775.0
Mean Reward -1232.9273356401384
Max reward so far:  -281.0
The starting state is : [-0.40264934  0.        ]
Total Time: 0:00:00.396076
Step max: 1083.0
Fail: False
Episode:  289
Reward:  -383.0
Mean Reward -1229.996551724138
Max reward so far:  -281.0
The starting state is : [-0.45303761  0.        ]
Step: 1000
Total T

## Step 8: Evaluate our trained model

Load our model and see if it generalizes well by solving 10 random games and averaging the score

In [28]:
with tf.Session() as sess:
    env.reset()
    rewards = []
    
    # Load the model
    saver.restore(sess, "./models/model.ckpt")

    for episode in range(10):
        state = env.reset()
        step = 0
        done = False
        total_rewards = 0
        print("****************************************************")
        print("EPISODE ", episode)

        while True:
            

            # Choose action a, remember WE'RE NOT IN A DETERMINISTIC ENVIRONMENT, WE'RE OUTPUT PROBABILITIES.
            action_probability_distribution = sess.run(action_distribution, feed_dict={input_: state.reshape([1, state_size])})
            #print(action_probability_distribution)
            action = np.random.choice(range(action_probability_distribution.shape[1]), p=action_probability_distribution.ravel())  # select action w.r.t the actions prob


            new_state, reward, done, info = env.step(action)

            total_rewards += reward

            if done:
                rewards.append(total_rewards)
                print ("Score", total_rewards)
                break
            state = new_state
    env.close()
    print ("Score over time: " +  str(sum(rewards)/10))

INFO:tensorflow:Restoring parameters from ./models/model.ckpt
****************************************************
EPISODE  0
Score -698.0
****************************************************
EPISODE  1
Score -911.0
****************************************************
EPISODE  2
Score -661.0
****************************************************
EPISODE  3
Score -728.0
****************************************************
EPISODE  4
Score -929.0
****************************************************
EPISODE  5
Score -743.0
****************************************************
EPISODE  6
Score -970.0
****************************************************
EPISODE  7
Score -986.0
****************************************************
EPISODE  8
Score -614.0
****************************************************
EPISODE  9
Score -420.0
Score over time: -766.0


## Report

1.  base run with CartPole environment
2.  changed environment to MountainCar
3.  Changed Neural Network input to match new environment state_space dimensions
4.  Fitness Function Experiments:
     1. score initially improved but was stuck throughout the rest of training. Not very promising
	 2. Designed a new metric, **potential energy (PE)**
         1.  Successfully improving PE during a direction change grants a bonus of +10 reward
         2.  Failure to improve PE during a direction change provides a penalty of -2 reward
	 3. Added a **step limit multiplier** hyperparameter to the training that constrained training episode duration to be a multiple of our fastest training episode. Initial multiple was 1.5. 
         1.  This combined with **experiment B** definitely improved the score further during training. 
         2.  Post training evaluation results were not great.  Rewards were constantly in the negative thousands (~ -5000) 
	 4. We found that the last set of training episodes had very short training times due to the lower step limit multiplier (1.5). To loosen this constraint, we increased the step limit multiplier from 1.5 to 3.0.
         1.  Rewards constantly improved during training as before. 
         2.  Post training evaluation results were much better. Rewards were averaging ~-500. So this change led to an order of magnitude improvement in our evaluation testing.
