## Assignment 2: Lunar Lander- Task 2

Student Name- Vidushi Jain <br/>
Student Number- 18200009 

## Part 2. Use the DeepQLearning reinforcement learning algorithm to train an agent to play the Lunar Lander <br/>
• Select sensible hyper-parameters<br/>
• Perform a suitable evaluation experiment to determine how effective the model trained is </br>

In [2]:
rl_model_reward_comparisons = dict()
rl_model_time_comparisons = dict()

In [3]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import BoltzmannQPolicy, EpsGreedyQPolicy, LinearAnnealedPolicy
from rl.memory import SequentialMemory
import time



# Path environment changed to make things work properly
# export DYLD_FALLBACK_LIBRARY_PATH=$DYLD_FALLBACK_LIBRARY_PATH:/usr/lib


ENV_NAME = 'LunarLander-v2'


# Get the environment and extract the number of actions.
#env = gym.make(ENV_NAME)
env = gym.make('LunarLander-v2')
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

Using TensorFlow backend.


# Model 1 
<a id="task2_model1"></a>
<b>Model Architecture</b>

1) We have created a Sequential model with three hidden layers with 16 neurons each. The input will be a 1 x state space vector and there will be an output neuron for each possible action that will predict the Q value of that action for each step. By taking the argmax of the outputs, we can choose the action with the highest Q value.<br/>
2) We have created a SequentialMemory object of class called rl.memory.SequentialMemory. This provides a fast and efficient data structure for storing the agent's experience. In that we have specified the maximum size for this memory object as 50000. As new experiences are added to this memory and it becomes full, old experiences are forgotten. <br/>
3) We have used EpsGreedyQPolicy which we can use to balance exploration and exploitation. We can set the value of ϵ in this policy.If a random number is selected which is less than this value, an action is chosen completely at random Otherwise the best action is choosen. This step allows some random exploration of the value of various actions in various states, and can be scaled back over time to allow the algorithm to concentrate more on exploiting the best strategies that it has found. In this model, we have used the default value of <b>eps</b> which is <b>0.1</b> <br/>
4) After our model,memory and policy are defined, we create a deep Q network Agent and send that agent with those objects. <br/>
5) Finally the model is compiled using a mean-squared error loss function with the Adam optimizer. <b>The learning rate of Adam optimizer is set to 0.0001.</b> 



In [4]:
# Next, we build a very simple model.
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=50000, window_length=1)
policy = EpsGreedyQPolicy()
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=15,
               target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=0.0001), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.

start=time.time()

dqn.fit(env, nb_steps=100000, visualize=False, verbose=2)

end = time.time()
timetaken=end - start
rl_model_time_comparisons['Model 1'] = timetaken

# After training is done, we save the final weights.
dqn.save_weights('dqn_{}_weights_model1.h5f'.format(ENV_NAME), overwrite=True)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 8)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                144       
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
activation_2 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 16)                272       
_________________________________________________________________
activation_3 (Activation)    (None, 16)                0         
__________



    83/100000: episode: 1, duration: 1.594s, episode steps: 83, steps per second: 52, episode reward: -555.523, mean reward: -6.693 [-100.000, 2.843], mean action: 1.458 [1.000, 3.000], mean observation: 0.141 [-3.516, 3.300], loss: 10.181509, mean_absolute_error: 0.996055, mean_q: 0.192844
   157/100000: episode: 2, duration: 0.385s, episode steps: 74, steps per second: 192, episode reward: -425.596, mean reward: -5.751 [-100.000, 1.874], mean action: 1.297 [0.000, 3.000], mean observation: 0.161 [-1.476, 2.642], loss: 59.335529, mean_absolute_error: 1.477775, mean_q: 0.153759
   214/100000: episode: 3, duration: 0.296s, episode steps: 57, steps per second: 193, episode reward: -328.833, mean reward: -5.769 [-100.000, -1.098], mean action: 0.807 [0.000, 3.000], mean observation: 0.065 [-1.662, 1.988], loss: 73.101723, mean_absolute_error: 1.507687, mean_q: 0.088515
   271/100000: episode: 4, duration: 0.284s, episode steps: 57, steps per second: 201, episode reward: -262.763, mean rew

  2076/100000: episode: 29, duration: 0.275s, episode steps: 53, steps per second: 193, episode reward: -206.153, mean reward: -3.890 [-100.000, 59.571], mean action: 1.943 [0.000, 3.000], mean observation: -0.234 [-4.946, 1.386], loss: 64.485588, mean_absolute_error: 1.655640, mean_q: -0.644422
  2150/100000: episode: 30, duration: 0.374s, episode steps: 74, steps per second: 198, episode reward: -227.960, mean reward: -3.081 [-100.000, 70.240], mean action: 1.392 [0.000, 3.000], mean observation: -0.240 [-4.327, 1.413], loss: 57.721172, mean_absolute_error: 1.681816, mean_q: -0.690200
  2215/100000: episode: 31, duration: 0.352s, episode steps: 65, steps per second: 185, episode reward: -562.688, mean reward: -8.657 [-100.000, -2.002], mean action: 2.769 [1.000, 3.000], mean observation: -0.183 [-3.998, 1.405], loss: 82.631088, mean_absolute_error: 1.880725, mean_q: -0.753196
  2275/100000: episode: 32, duration: 0.315s, episode steps: 60, steps per second: 191, episode reward: -261.

  4084/100000: episode: 57, duration: 0.278s, episode steps: 55, steps per second: 198, episode reward: -132.768, mean reward: -2.414 [-100.000, 10.549], mean action: 0.582 [0.000, 3.000], mean observation: -0.086 [-1.779, 4.927], loss: 87.052834, mean_absolute_error: 8.130480, mean_q: -9.051161
  4159/100000: episode: 58, duration: 0.365s, episode steps: 75, steps per second: 205, episode reward: -156.740, mean reward: -2.090 [-100.000, 11.148], mean action: 0.200 [0.000, 3.000], mean observation: 0.113 [-1.705, 5.446], loss: 59.365658, mean_absolute_error: 8.347437, mean_q: -9.608255
  4227/100000: episode: 59, duration: 0.340s, episode steps: 68, steps per second: 200, episode reward: -132.905, mean reward: -1.954 [-100.000, 6.299], mean action: 0.706 [0.000, 3.000], mean observation: -0.105 [-1.720, 6.020], loss: 70.062195, mean_absolute_error: 8.768688, mean_q: -10.132042
  4295/100000: episode: 60, duration: 0.348s, episode steps: 68, steps per second: 196, episode reward: -131.1

  6279/100000: episode: 85, duration: 0.428s, episode steps: 88, steps per second: 206, episode reward: -115.463, mean reward: -1.312 [-100.000, 7.933], mean action: 0.170 [0.000, 3.000], mean observation: -0.005 [-6.500, 1.481], loss: 37.285862, mean_absolute_error: 23.635632, mean_q: -29.551115
  6349/100000: episode: 86, duration: 0.356s, episode steps: 70, steps per second: 196, episode reward: -132.065, mean reward: -1.887 [-100.000, 23.539], mean action: 0.157 [0.000, 3.000], mean observation: -0.024 [-2.450, 1.408], loss: 30.179213, mean_absolute_error: 24.101496, mean_q: -30.216148
  6428/100000: episode: 87, duration: 0.389s, episode steps: 79, steps per second: 203, episode reward: -156.129, mean reward: -1.976 [-100.000, 6.772], mean action: 0.658 [0.000, 3.000], mean observation: -0.108 [-1.853, 5.571], loss: 27.188635, mean_absolute_error: 24.437899, mean_q: -30.755270
  6495/100000: episode: 88, duration: 0.329s, episode steps: 67, steps per second: 204, episode reward: -

  8317/100000: episode: 113, duration: 0.406s, episode steps: 82, steps per second: 202, episode reward: -108.981, mean reward: -1.329 [-100.000, 10.081], mean action: 0.549 [0.000, 3.000], mean observation: 0.010 [-1.786, 1.442], loss: 24.796783, mean_absolute_error: 36.876442, mean_q: -47.626953
  8379/100000: episode: 114, duration: 0.304s, episode steps: 62, steps per second: 204, episode reward: -118.533, mean reward: -1.912 [-100.000, 15.753], mean action: 0.371 [0.000, 3.000], mean observation: -0.022 [-1.844, 1.401], loss: 23.468367, mean_absolute_error: 37.174957, mean_q: -48.049690
  8433/100000: episode: 115, duration: 0.271s, episode steps: 54, steps per second: 199, episode reward: -129.202, mean reward: -2.393 [-100.000, 5.729], mean action: 1.074 [0.000, 3.000], mean observation: -0.155 [-1.748, 5.810], loss: 12.909849, mean_absolute_error: 37.615513, mean_q: -48.696434
  8528/100000: episode: 116, duration: 0.471s, episode steps: 95, steps per second: 202, episode rewar

 10256/100000: episode: 141, duration: 0.371s, episode steps: 73, steps per second: 197, episode reward: -149.652, mean reward: -2.050 [-100.000, 10.174], mean action: 1.370 [0.000, 3.000], mean observation: 0.083 [-6.060, 1.412], loss: 21.176886, mean_absolute_error: 43.995480, mean_q: -57.050247
 10322/100000: episode: 142, duration: 0.339s, episode steps: 66, steps per second: 195, episode reward: -162.953, mean reward: -2.469 [-100.000, 8.682], mean action: 1.667 [0.000, 3.000], mean observation: 0.120 [-5.862, 1.407], loss: 10.454019, mean_absolute_error: 43.931377, mean_q: -56.983013
 10401/100000: episode: 143, duration: 0.398s, episode steps: 79, steps per second: 198, episode reward: -134.475, mean reward: -1.702 [-100.000, 5.797], mean action: 1.418 [0.000, 3.000], mean observation: -0.011 [-1.786, 6.286], loss: 11.614863, mean_absolute_error: 44.189232, mean_q: -57.383869
 10462/100000: episode: 144, duration: 0.319s, episode steps: 61, steps per second: 191, episode reward:

 15333/100000: episode: 169, duration: 0.607s, episode steps: 116, steps per second: 191, episode reward: -120.511, mean reward: -1.039 [-100.000, 9.695], mean action: 1.836 [0.000, 3.000], mean observation: -0.032 [-1.382, 1.402], loss: 13.025053, mean_absolute_error: 42.841099, mean_q: -55.676384
 16201/100000: episode: 170, duration: 6.006s, episode steps: 868, steps per second: 145, episode reward: -239.040, mean reward: -0.275 [-100.000, 15.342], mean action: 1.566 [0.000, 3.000], mean observation: -0.062 [-0.949, 1.406], loss: 12.930632, mean_absolute_error: 41.796783, mean_q: -54.310062
 16265/100000: episode: 171, duration: 0.338s, episode steps: 64, steps per second: 189, episode reward: -47.252, mean reward: -0.738 [-100.000, 19.647], mean action: 1.766 [0.000, 3.000], mean observation: 0.099 [-3.263, 1.385], loss: 9.064602, mean_absolute_error: 41.380905, mean_q: -53.829929
 16384/100000: episode: 172, duration: 0.626s, episode steps: 119, steps per second: 190, episode rewa

 25280/100000: episode: 197, duration: 1.621s, episode steps: 287, steps per second: 177, episode reward: -205.655, mean reward: -0.717 [-100.000, 4.370], mean action: 1.498 [0.000, 3.000], mean observation: 0.048 [-1.002, 1.411], loss: 9.367358, mean_absolute_error: 35.914593, mean_q: -46.629635
 25453/100000: episode: 198, duration: 0.930s, episode steps: 173, steps per second: 186, episode reward: -183.037, mean reward: -1.058 [-100.000, 5.538], mean action: 1.780 [0.000, 3.000], mean observation: 0.006 [-1.238, 1.409], loss: 10.585162, mean_absolute_error: 35.633823, mean_q: -46.197010
 25546/100000: episode: 199, duration: 0.496s, episode steps: 93, steps per second: 188, episode reward: -9.573, mean reward: -0.103 [-100.000, 24.997], mean action: 1.849 [0.000, 3.000], mean observation: 0.058 [-1.595, 1.392], loss: 17.647081, mean_absolute_error: 35.988674, mean_q: -46.648537
 25630/100000: episode: 200, duration: 0.445s, episode steps: 84, steps per second: 189, episode reward: -

 45456/100000: episode: 225, duration: 6.188s, episode steps: 1000, steps per second: 162, episode reward: -141.280, mean reward: -0.141 [-13.727, 16.957], mean action: 1.455 [0.000, 3.000], mean observation: 0.044 [-0.926, 1.654], loss: 8.709590, mean_absolute_error: 19.462120, mean_q: -24.592793
 46456/100000: episode: 226, duration: 6.550s, episode steps: 1000, steps per second: 153, episode reward: -130.848, mean reward: -0.131 [-6.457, 4.875], mean action: 1.445 [0.000, 3.000], mean observation: 0.049 [-0.810, 1.505], loss: 8.814354, mean_absolute_error: 18.642706, mean_q: -23.478601
 47076/100000: episode: 227, duration: 3.714s, episode steps: 620, steps per second: 167, episode reward: -46.196, mean reward: -0.075 [-100.000, 17.633], mean action: 1.555 [0.000, 3.000], mean observation: 0.042 [-1.174, 1.549], loss: 7.738500, mean_absolute_error: 17.990036, mean_q: -22.579498
 48052/100000: episode: 228, duration: 6.397s, episode steps: 976, steps per second: 153, episode reward: 

 59234/100000: episode: 253, duration: 2.649s, episode steps: 453, steps per second: 171, episode reward: -221.397, mean reward: -0.489 [-100.000, 14.416], mean action: 1.541 [0.000, 3.000], mean observation: 0.063 [-2.093, 1.544], loss: 7.602507, mean_absolute_error: 5.688910, mean_q: -6.202501
 59694/100000: episode: 254, duration: 2.805s, episode steps: 460, steps per second: 164, episode reward: -68.060, mean reward: -0.148 [-100.000, 20.896], mean action: 1.750 [0.000, 3.000], mean observation: 0.079 [-2.138, 1.389], loss: 6.933789, mean_absolute_error: 5.431507, mean_q: -5.840031
 60112/100000: episode: 255, duration: 2.514s, episode steps: 418, steps per second: 166, episode reward: -197.660, mean reward: -0.473 [-100.000, 24.124], mean action: 1.773 [0.000, 3.000], mean observation: 0.046 [-1.879, 1.517], loss: 7.085709, mean_absolute_error: 5.081927, mean_q: -5.389617
 60776/100000: episode: 256, duration: 4.082s, episode steps: 664, steps per second: 163, episode reward: -120

 82076/100000: episode: 281, duration: 7.176s, episode steps: 1000, steps per second: 139, episode reward: -83.434, mean reward: -0.083 [-3.132, 4.667], mean action: 1.525 [0.000, 3.000], mean observation: 0.030 [-0.685, 1.448], loss: 3.470903, mean_absolute_error: 8.055977, mean_q: 5.512197
 82971/100000: episode: 282, duration: 5.898s, episode steps: 895, steps per second: 152, episode reward: -157.168, mean reward: -0.176 [-100.000, 12.223], mean action: 1.506 [0.000, 3.000], mean observation: 0.042 [-0.668, 1.442], loss: 3.154774, mean_absolute_error: 8.435651, mean_q: 6.383894
 83971/100000: episode: 283, duration: 6.290s, episode steps: 1000, steps per second: 159, episode reward: -64.783, mean reward: -0.065 [-3.424, 4.202], mean action: 1.414 [0.000, 3.000], mean observation: 0.051 [-0.730, 1.456], loss: 3.583617, mean_absolute_error: 9.000367, mean_q: 6.921027
 84971/100000: episode: 284, duration: 7.028s, episode steps: 1000, steps per second: 142, episode reward: -80.813, me

## Evaluation Result

We are testing the above model for 50 episodes and then looking at the mean reward value

In [5]:
history = dqn.test(env, nb_episodes=50, visualize=False)
rewards = np.array(history.history['episode_reward'])
print(("Test rewards (#episodes={}): mean={:>5.2f}, std={:>5.2f}, "
           "min={:>5.2f}, max={:>5.2f}")
                  .format(len(rewards),
                  rewards.mean(),
                  rewards.std(),
                  rewards.min(),
                  rewards.max()))

rl_model_reward_comparisons["Model 1"] = rewards.mean()

Testing for 50 episodes ...
Episode 1: reward: -56.594, steps: 1000
Episode 2: reward: -78.799, steps: 1000
Episode 3: reward: -74.998, steps: 1000
Episode 4: reward: -54.537, steps: 1000
Episode 5: reward: -45.532, steps: 1000
Episode 6: reward: -87.722, steps: 1000
Episode 7: reward: -90.790, steps: 1000
Episode 8: reward: -88.762, steps: 1000
Episode 9: reward: -99.000, steps: 1000
Episode 10: reward: -73.811, steps: 1000
Episode 11: reward: -83.323, steps: 1000
Episode 12: reward: -48.407, steps: 1000
Episode 13: reward: -120.300, steps: 1000
Episode 14: reward: -95.077, steps: 1000
Episode 15: reward: -100.937, steps: 1000
Episode 16: reward: -81.949, steps: 1000
Episode 17: reward: -48.239, steps: 1000
Episode 18: reward: -68.205, steps: 1000
Episode 19: reward: -113.001, steps: 1000
Episode 20: reward: -117.280, steps: 1000
Episode 21: reward: -83.048, steps: 1000
Episode 22: reward: -73.715, steps: 1000
Episode 23: reward: -102.671, steps: 1000
Episode 24: reward: -49.877, step

## Model 2

<b> Model Architecture</b>

We are using the similar architecture and process as [Model 1](#task2_model1) but with different hyper-parameters <br/> 
<b>In this Model, we are changing value of epsilon to 0.2 and learning rate of Adam optimizer is 0.0001 </b>

In [6]:
# Get the environment and extract the number of actions.
#env = gym.make(ENV_NAME)
env = gym.make('LunarLander-v2')
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

In [7]:
# Next, we build a very simple model.
model2 = Sequential()
model2.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model2.add(Dense(16))
model2.add(Activation('relu'))
model2.add(Dense(16))
model2.add(Activation('relu'))
model2.add(Dense(16))
model2.add(Activation('relu'))
model2.add(Dense(nb_actions))
model2.add(Activation('linear'))
print(model2.summary())

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=50000, window_length=1)
policy = EpsGreedyQPolicy(eps=.2)
dqn2 = DQNAgent(model=model2, nb_actions=nb_actions, memory=memory, nb_steps_warmup=15,
               target_model_update=1e-2, policy=policy)
dqn2.compile(Adam(lr=0.0001), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.

start = time.time()
dqn2.fit(env, nb_steps=100000, visualize=False, verbose=2)

end = time.time()
timetaken=end - start
rl_model_time_comparisons['Model 2'] = timetaken

# After training is done, we save the final weights.
dqn2.save_weights('dqn_{}_weights_model2.h5f'.format(ENV_NAME), overwrite=True)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 8)                 0         
_________________________________________________________________
dense_5 (Dense)              (None, 16)                144       
_________________________________________________________________
activation_5 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 16)                272       
_________________________________________________________________
activation_6 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 16)                272       
_________________________________________________________________
activation_7 (Activation)    (None, 16)                0         
__________



    85/100000: episode: 1, duration: 1.970s, episode steps: 85, steps per second: 43, episode reward: -496.174, mean reward: -5.837 [-100.000, 2.760], mean action: 1.506 [0.000, 3.000], mean observation: 0.135 [-1.644, 5.922], loss: 7.509359, mean_absolute_error: 0.862144, mean_q: 0.176180
   152/100000: episode: 2, duration: 0.401s, episode steps: 67, steps per second: 167, episode reward: -514.546, mean reward: -7.680 [-100.000, -0.298], mean action: 1.209 [0.000, 3.000], mean observation: 0.101 [-1.850, 6.149], loss: 45.625061, mean_absolute_error: 1.391521, mean_q: 0.161303
   210/100000: episode: 3, duration: 0.316s, episode steps: 58, steps per second: 183, episode reward: -95.427, mean reward: -1.645 [-100.000, 26.276], mean action: 1.431 [0.000, 3.000], mean observation: 0.036 [-1.762, 1.393], loss: 72.702736, mean_absolute_error: 1.508517, mean_q: 0.076881
   271/100000: episode: 4, duration: 0.356s, episode steps: 61, steps per second: 171, episode reward: -264.963, mean rewa

  2135/100000: episode: 29, duration: 0.407s, episode steps: 69, steps per second: 170, episode reward: -99.508, mean reward: -1.442 [-100.000, 10.773], mean action: 0.986 [0.000, 3.000], mean observation: -0.078 [-1.606, 1.404], loss: 98.012276, mean_absolute_error: 1.941849, mean_q: -0.798363
  2222/100000: episode: 30, duration: 0.534s, episode steps: 87, steps per second: 163, episode reward: -377.733, mean reward: -4.342 [-100.000, 1.458], mean action: 0.874 [0.000, 3.000], mean observation: -0.049 [-2.216, 1.453], loss: 74.833084, mean_absolute_error: 1.905586, mean_q: -0.885092
  2296/100000: episode: 31, duration: 0.442s, episode steps: 74, steps per second: 167, episode reward: -446.168, mean reward: -6.029 [-100.000, -1.463], mean action: 0.946 [0.000, 3.000], mean observation: -0.125 [-3.406, 1.458], loss: 73.009392, mean_absolute_error: 1.998756, mean_q: -0.988584
  2350/100000: episode: 32, duration: 0.327s, episode steps: 54, steps per second: 165, episode reward: -328.99

  4418/100000: episode: 57, duration: 0.603s, episode steps: 106, steps per second: 176, episode reward: -112.964, mean reward: -1.066 [-100.000, 10.512], mean action: 1.840 [0.000, 3.000], mean observation: -0.110 [-0.988, 1.856], loss: 40.884914, mean_absolute_error: 12.997679, mean_q: -16.047676
  4510/100000: episode: 58, duration: 0.499s, episode steps: 92, steps per second: 184, episode reward: -67.199, mean reward: -0.730 [-100.000, 22.486], mean action: 1.674 [0.000, 3.000], mean observation: 0.191 [-1.472, 1.518], loss: 52.290798, mean_absolute_error: 14.032763, mean_q: -17.275284
  4568/100000: episode: 59, duration: 0.313s, episode steps: 58, steps per second: 185, episode reward: -128.928, mean reward: -2.223 [-100.000, 5.244], mean action: 1.931 [0.000, 3.000], mean observation: -0.053 [-1.712, 5.347], loss: 39.986607, mean_absolute_error: 14.633445, mean_q: -18.183823
  4648/100000: episode: 60, duration: 0.432s, episode steps: 80, steps per second: 185, episode reward: -

  9066/100000: episode: 85, duration: 0.724s, episode steps: 133, steps per second: 184, episode reward: -83.944, mean reward: -0.631 [-100.000, 9.403], mean action: 1.466 [0.000, 3.000], mean observation: 0.073 [-1.144, 1.533], loss: 9.914922, mean_absolute_error: 39.745251, mean_q: -51.576157
  9382/100000: episode: 86, duration: 1.881s, episode steps: 316, steps per second: 168, episode reward: -119.606, mean reward: -0.379 [-100.000, 12.088], mean action: 1.541 [0.000, 3.000], mean observation: 0.049 [-0.910, 1.431], loss: 13.881024, mean_absolute_error: 40.305462, mean_q: -52.244370
  9816/100000: episode: 87, duration: 2.659s, episode steps: 434, steps per second: 163, episode reward: -84.966, mean reward: -0.196 [-100.000, 9.674], mean action: 1.525 [0.000, 3.000], mean observation: 0.117 [-1.699, 2.040], loss: 10.655510, mean_absolute_error: 40.643757, mean_q: -52.662010
 10027/100000: episode: 88, duration: 1.186s, episode steps: 211, steps per second: 178, episode reward: -13

 26623/100000: episode: 113, duration: 7.764s, episode steps: 1000, steps per second: 129, episode reward: -37.292, mean reward: -0.037 [-4.928, 5.097], mean action: 1.626 [0.000, 3.000], mean observation: 0.106 [-0.686, 1.450], loss: 6.857171, mean_absolute_error: 19.770977, mean_q: -24.966337
 27623/100000: episode: 114, duration: 7.374s, episode steps: 1000, steps per second: 136, episode reward: -86.517, mean reward: -0.087 [-4.823, 5.022], mean action: 1.622 [0.000, 3.000], mean observation: 0.093 [-0.381, 1.412], loss: 6.031517, mean_absolute_error: 18.495363, mean_q: -23.264389
 27982/100000: episode: 115, duration: 2.127s, episode steps: 359, steps per second: 169, episode reward: -28.499, mean reward: -0.079 [-100.000, 13.435], mean action: 1.571 [0.000, 3.000], mean observation: 0.059 [-1.051, 1.658], loss: 8.058652, mean_absolute_error: 17.620312, mean_q: -22.061445
 28982/100000: episode: 116, duration: 8.551s, episode steps: 1000, steps per second: 117, episode reward: -80

 43133/100000: episode: 141, duration: 6.844s, episode steps: 1000, steps per second: 146, episode reward: -36.799, mean reward: -0.037 [-4.673, 5.719], mean action: 1.509 [0.000, 3.000], mean observation: 0.083 [-0.991, 1.520], loss: 5.838775, mean_absolute_error: 7.635527, mean_q: -8.629654
 43497/100000: episode: 142, duration: 2.261s, episode steps: 364, steps per second: 161, episode reward: -3.802, mean reward: -0.010 [-100.000, 15.114], mean action: 1.736 [0.000, 3.000], mean observation: 0.101 [-0.981, 1.422], loss: 5.559634, mean_absolute_error: 7.207600, mean_q: -8.019509
 43854/100000: episode: 143, duration: 2.121s, episode steps: 357, steps per second: 168, episode reward: 20.042, mean reward: 0.056 [-100.000, 12.346], mean action: 1.697 [0.000, 3.000], mean observation: 0.108 [-1.010, 1.386], loss: 4.241858, mean_absolute_error: 6.817430, mean_q: -7.501275
 44138/100000: episode: 144, duration: 1.655s, episode steps: 284, steps per second: 172, episode reward: -98.783, me

 56842/100000: episode: 169, duration: 0.954s, episode steps: 170, steps per second: 178, episode reward: -180.485, mean reward: -1.062 [-100.000, 7.516], mean action: 1.865 [0.000, 3.000], mean observation: 0.036 [-2.159, 1.421], loss: 4.626225, mean_absolute_error: 4.561907, mean_q: -0.181407
 57098/100000: episode: 170, duration: 1.487s, episode steps: 256, steps per second: 172, episode reward: -105.982, mean reward: -0.414 [-100.000, 16.810], mean action: 1.887 [0.000, 3.000], mean observation: 0.063 [-1.125, 1.837], loss: 4.253767, mean_absolute_error: 4.480081, mean_q: 0.165542
 57294/100000: episode: 171, duration: 1.109s, episode steps: 196, steps per second: 177, episode reward: -237.895, mean reward: -1.214 [-100.000, 23.450], mean action: 1.918 [0.000, 3.000], mean observation: -0.015 [-2.171, 1.420], loss: 3.287149, mean_absolute_error: 4.273017, mean_q: 0.536910
 58294/100000: episode: 172, duration: 7.001s, episode steps: 1000, steps per second: 143, episode reward: -12.

 66213/100000: episode: 197, duration: 1.456s, episode steps: 257, steps per second: 177, episode reward: -184.387, mean reward: -0.717 [-100.000, 11.591], mean action: 1.837 [0.000, 3.000], mean observation: 0.081 [-1.030, 4.015], loss: 4.366523, mean_absolute_error: 6.182321, mean_q: 3.672399
 66328/100000: episode: 198, duration: 0.630s, episode steps: 115, steps per second: 183, episode reward: -194.163, mean reward: -1.688 [-100.000, 4.857], mean action: 1.774 [0.000, 3.000], mean observation: 0.309 [-0.729, 1.388], loss: 3.543612, mean_absolute_error: 6.222280, mean_q: 3.809273
 66683/100000: episode: 199, duration: 2.184s, episode steps: 355, steps per second: 163, episode reward: -108.731, mean reward: -0.306 [-100.000, 17.700], mean action: 1.808 [0.000, 3.000], mean observation: 0.041 [-1.158, 1.607], loss: 3.715802, mean_absolute_error: 6.282385, mean_q: 3.781513
 67084/100000: episode: 200, duration: 2.518s, episode steps: 401, steps per second: 159, episode reward: -85.237

 80782/100000: episode: 225, duration: 2.995s, episode steps: 460, steps per second: 154, episode reward: -98.276, mean reward: -0.214 [-100.000, 14.904], mean action: 1.793 [0.000, 3.000], mean observation: 0.130 [-1.126, 1.400], loss: 6.231629, mean_absolute_error: 10.469499, mean_q: 9.963514
 81782/100000: episode: 226, duration: 7.196s, episode steps: 1000, steps per second: 139, episode reward: 52.606, mean reward: 0.053 [-23.603, 26.224], mean action: 1.332 [0.000, 3.000], mean observation: 0.150 [-0.588, 1.408], loss: 5.428951, mean_absolute_error: 10.683931, mean_q: 10.164370
 82782/100000: episode: 227, duration: 7.438s, episode steps: 1000, steps per second: 134, episode reward: -59.505, mean reward: -0.060 [-3.881, 5.796], mean action: 1.534 [0.000, 3.000], mean observation: 0.143 [-0.457, 1.477], loss: 6.648597, mean_absolute_error: 10.934643, mean_q: 10.592723
 83782/100000: episode: 228, duration: 6.941s, episode steps: 1000, steps per second: 144, episode reward: 74.313,

## Evaluation Result

We are testing the above model for 50 episodes and then looking at the mean reward value

In [8]:
# Finally, evaluate our algorithm for 50 episodes.

#dqn2.test(env, nb_episodes=50, visualize=False)

# Finally, evaluate the agent
history = dqn2.test(env, nb_episodes=50, visualize=False)
rewards = np.array(history.history['episode_reward'])
print(("Test rewards (#episodes={}): mean={:>5.2f}, std={:>5.2f}, "
           "min={:>5.2f}, max={:>5.2f}")
                  .format(len(rewards),
                  rewards.mean(),
                  rewards.std(),
                  rewards.min(),
                  rewards.max()))

rl_model_reward_comparisons["Model 2"] = rewards.mean()

Testing for 50 episodes ...
Episode 1: reward: -124.292, steps: 1000
Episode 2: reward: -128.521, steps: 1000
Episode 3: reward: -149.815, steps: 1000
Episode 4: reward: -111.980, steps: 1000
Episode 5: reward: -130.389, steps: 1000
Episode 6: reward: -159.117, steps: 1000
Episode 7: reward: -55.007, steps: 1000
Episode 8: reward: -181.105, steps: 855
Episode 9: reward: -136.369, steps: 1000
Episode 10: reward: -154.247, steps: 1000
Episode 11: reward: -162.554, steps: 1000
Episode 12: reward: -111.477, steps: 1000
Episode 13: reward: -88.013, steps: 1000
Episode 14: reward: -146.472, steps: 1000
Episode 15: reward: -68.793, steps: 1000
Episode 16: reward: -134.974, steps: 1000
Episode 17: reward: -112.135, steps: 1000
Episode 18: reward: -124.836, steps: 1000
Episode 19: reward: -126.504, steps: 1000
Episode 20: reward: -166.332, steps: 1000
Episode 21: reward: -82.429, steps: 1000
Episode 22: reward: -110.298, steps: 1000
Episode 23: reward: -161.058, steps: 741
Episode 24: reward: -

## Model 3

<b> Model Architecture</b>

We are using the similar architecture and process as [Model 1](#task2_model1) but with different hyper-parameters <br/> 
<b>In this Model, we are changing value of learning rate of Adam optimizer to 0.001 and epsilon to 0.1</b>

In [9]:
# Get the environment and extract the number of actions.
#env = gym.make(ENV_NAME)
env = gym.make('LunarLander-v2')
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

In [10]:
# Next, we build a very simple model.
model3 = Sequential()
model3.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model3.add(Dense(16))
model3.add(Activation('relu'))
model3.add(Dense(16))
model3.add(Activation('relu'))
model3.add(Dense(16))
model3.add(Activation('relu'))
model3.add(Dense(nb_actions))
model3.add(Activation('linear'))
print(model3.summary())

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=50000, window_length=1)
policy = EpsGreedyQPolicy()
dqn3 = DQNAgent(model=model3, nb_actions=nb_actions, memory=memory, nb_steps_warmup=15,
               target_model_update=1e-2, policy=policy)
dqn3.compile(Adam(lr=0.001), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
start=time.time()
dqn3.fit(env, nb_steps=100000, visualize=False, verbose=2)

end = time.time()
timetaken=end - start
rl_model_time_comparisons['Model 3'] = timetaken
# After training is done, we save the final weights.
dqn3.save_weights('dqn_{}_weights_model3.h5f'.format(ENV_NAME), overwrite=True)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_3 (Flatten)          (None, 8)                 0         
_________________________________________________________________
dense_9 (Dense)              (None, 16)                144       
_________________________________________________________________
activation_9 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_10 (Dense)             (None, 16)                272       
_________________________________________________________________
activation_10 (Activation)   (None, 16)                0         
_________________________________________________________________
dense_11 (Dense)             (None, 16)                272       
_________________________________________________________________
activation_11 (Activation)   (None, 16)                0         
__________



   100/100000: episode: 1, duration: 2.089s, episode steps: 100, steps per second: 48, episode reward: -426.740, mean reward: -4.267 [-100.000, 36.449], mean action: 1.320 [0.000, 3.000], mean observation: 0.129 [-1.482, 3.255], loss: 7.285984, mean_absolute_error: 0.768169, mean_q: 0.135859
   192/100000: episode: 2, duration: 0.495s, episode steps: 92, steps per second: 186, episode reward: 15.020, mean reward: 0.163 [-100.000, 130.422], mean action: 0.152 [0.000, 3.000], mean observation: 0.006 [-1.701, 2.188], loss: 56.370140, mean_absolute_error: 1.537257, mean_q: -0.057817
   277/100000: episode: 3, duration: 0.452s, episode steps: 85, steps per second: 188, episode reward: -136.845, mean reward: -1.610 [-100.000, 13.186], mean action: 0.082 [0.000, 3.000], mean observation: 0.109 [-1.729, 5.046], loss: 82.552452, mean_absolute_error: 2.254117, mean_q: -0.313259
   338/100000: episode: 4, duration: 0.333s, episode steps: 61, steps per second: 183, episode reward: -102.065, mean r

  2278/100000: episode: 29, duration: 0.575s, episode steps: 104, steps per second: 181, episode reward: -130.431, mean reward: -1.254 [-100.000, 6.752], mean action: 0.885 [0.000, 3.000], mean observation: 0.003 [-1.829, 1.539], loss: 22.781019, mean_absolute_error: 18.434752, mean_q: -21.812496
  2348/100000: episode: 30, duration: 0.385s, episode steps: 70, steps per second: 182, episode reward: -157.980, mean reward: -2.257 [-100.000, 12.443], mean action: 1.471 [0.000, 3.000], mean observation: -0.088 [-1.692, 4.737], loss: 24.335604, mean_absolute_error: 17.715227, mean_q: -20.583763
  2424/100000: episode: 31, duration: 0.433s, episode steps: 76, steps per second: 176, episode reward: -222.597, mean reward: -2.929 [-100.000, 6.289], mean action: 1.250 [0.000, 3.000], mean observation: 0.161 [-1.839, 4.168], loss: 29.643311, mean_absolute_error: 17.841946, mean_q: -20.829298
  2516/100000: episode: 32, duration: 0.515s, episode steps: 92, steps per second: 179, episode reward: -2

  4394/100000: episode: 57, duration: 0.345s, episode steps: 60, steps per second: 174, episode reward: -294.513, mean reward: -4.909 [-100.000, 7.870], mean action: 1.400 [0.000, 3.000], mean observation: 0.126 [-1.717, 1.980], loss: 24.069763, mean_absolute_error: 28.583464, mean_q: -35.172279
  4453/100000: episode: 58, duration: 0.328s, episode steps: 59, steps per second: 180, episode reward: -132.806, mean reward: -2.251 [-100.000, 13.834], mean action: 1.322 [0.000, 3.000], mean observation: -0.178 [-1.603, 3.880], loss: 36.040363, mean_absolute_error: 29.118935, mean_q: -35.948257
  4583/100000: episode: 59, duration: 0.735s, episode steps: 130, steps per second: 177, episode reward: -122.788, mean reward: -0.945 [-100.000, 8.095], mean action: 1.600 [0.000, 3.000], mean observation: 0.129 [-1.637, 1.501], loss: 19.784435, mean_absolute_error: 29.491381, mean_q: -36.647778
  4786/100000: episode: 60, duration: 1.226s, episode steps: 203, steps per second: 166, episode reward: -

 11872/100000: episode: 85, duration: 1.610s, episode steps: 264, steps per second: 164, episode reward: -199.004, mean reward: -0.754 [-100.000, 22.476], mean action: 1.598 [0.000, 3.000], mean observation: 0.044 [-1.362, 1.632], loss: 13.861224, mean_absolute_error: 22.986057, mean_q: -11.673425
 12124/100000: episode: 86, duration: 1.627s, episode steps: 252, steps per second: 155, episode reward: -181.967, mean reward: -0.722 [-100.000, 25.295], mean action: 1.528 [0.000, 3.000], mean observation: 0.127 [-0.648, 1.862], loss: 13.100877, mean_absolute_error: 24.212549, mean_q: -11.751440
 12315/100000: episode: 87, duration: 1.133s, episode steps: 191, steps per second: 169, episode reward: -102.115, mean reward: -0.535 [-100.000, 6.404], mean action: 1.660 [0.000, 3.000], mean observation: 0.032 [-1.007, 3.572], loss: 11.950295, mean_absolute_error: 24.343264, mean_q: -10.444625
 12423/100000: episode: 88, duration: 0.607s, episode steps: 108, steps per second: 178, episode reward:

 20612/100000: episode: 113, duration: 4.273s, episode steps: 673, steps per second: 157, episode reward: -183.614, mean reward: -0.273 [-100.000, 6.239], mean action: 1.617 [0.000, 3.000], mean observation: 0.096 [-0.709, 1.440], loss: 11.026102, mean_absolute_error: 30.562449, mean_q: 0.677762
 20995/100000: episode: 114, duration: 2.426s, episode steps: 383, steps per second: 158, episode reward: -120.824, mean reward: -0.315 [-100.000, 15.572], mean action: 1.462 [0.000, 3.000], mean observation: 0.039 [-0.923, 1.400], loss: 14.154320, mean_absolute_error: 30.515732, mean_q: 0.969476
 21267/100000: episode: 115, duration: 1.598s, episode steps: 272, steps per second: 170, episode reward: -188.637, mean reward: -0.694 [-100.000, 5.476], mean action: 1.691 [0.000, 3.000], mean observation: 0.044 [-1.298, 1.439], loss: 10.864378, mean_absolute_error: 30.884504, mean_q: 1.720683
 21510/100000: episode: 116, duration: 1.400s, episode steps: 243, steps per second: 174, episode reward: -2

 28272/100000: episode: 141, duration: 1.050s, episode steps: 182, steps per second: 173, episode reward: -57.382, mean reward: -0.315 [-100.000, 10.391], mean action: 1.797 [0.000, 3.000], mean observation: -0.045 [-2.334, 1.397], loss: 9.310972, mean_absolute_error: 31.992407, mean_q: 1.772550
 28373/100000: episode: 142, duration: 0.574s, episode steps: 101, steps per second: 176, episode reward: 2.352, mean reward: 0.023 [-100.000, 14.493], mean action: 1.970 [0.000, 3.000], mean observation: 0.057 [-0.876, 1.392], loss: 14.768788, mean_absolute_error: 32.295799, mean_q: 2.759585
 28667/100000: episode: 143, duration: 1.734s, episode steps: 294, steps per second: 170, episode reward: 4.181, mean reward: 0.014 [-100.000, 11.779], mean action: 1.650 [0.000, 3.000], mean observation: 0.050 [-1.245, 1.407], loss: 8.649396, mean_absolute_error: 32.733009, mean_q: 1.531260
 28785/100000: episode: 144, duration: 0.657s, episode steps: 118, steps per second: 180, episode reward: -116.367, 

 36661/100000: episode: 169, duration: 1.985s, episode steps: 331, steps per second: 167, episode reward: -75.939, mean reward: -0.229 [-100.000, 13.308], mean action: 1.782 [0.000, 3.000], mean observation: 0.050 [-0.625, 1.524], loss: 10.147756, mean_absolute_error: 29.488264, mean_q: 3.037578
 37234/100000: episode: 170, duration: 3.730s, episode steps: 573, steps per second: 154, episode reward: -39.114, mean reward: -0.068 [-100.000, 14.387], mean action: 1.749 [0.000, 3.000], mean observation: 0.067 [-0.653, 1.469], loss: 10.437757, mean_absolute_error: 29.533705, mean_q: 2.767088
 37500/100000: episode: 171, duration: 1.607s, episode steps: 266, steps per second: 165, episode reward: -3.233, mean reward: -0.012 [-100.000, 22.039], mean action: 1.880 [0.000, 3.000], mean observation: -0.000 [-0.610, 1.405], loss: 9.787977, mean_absolute_error: 29.324926, mean_q: 4.105457
 37739/100000: episode: 172, duration: 1.422s, episode steps: 239, steps per second: 168, episode reward: -186

 53762/100000: episode: 197, duration: 8.928s, episode steps: 1000, steps per second: 112, episode reward: -84.948, mean reward: -0.085 [-6.117, 5.254], mean action: 1.680 [0.000, 3.000], mean observation: 0.071 [-0.760, 1.386], loss: 7.939304, mean_absolute_error: 23.309713, mean_q: 19.248926
 54201/100000: episode: 198, duration: 2.704s, episode steps: 439, steps per second: 162, episode reward: -213.396, mean reward: -0.486 [-100.000, 20.823], mean action: 1.733 [0.000, 3.000], mean observation: -0.019 [-0.999, 1.438], loss: 9.802689, mean_absolute_error: 23.355740, mean_q: 20.004827
 55201/100000: episode: 199, duration: 6.990s, episode steps: 1000, steps per second: 143, episode reward: -35.557, mean reward: -0.036 [-5.088, 4.443], mean action: 1.551 [0.000, 3.000], mean observation: 0.039 [-0.642, 1.403], loss: 8.683346, mean_absolute_error: 23.177113, mean_q: 20.321720
 56201/100000: episode: 200, duration: 7.200s, episode steps: 1000, steps per second: 139, episode reward: -23.

 77529/100000: episode: 225, duration: 8.371s, episode steps: 1000, steps per second: 119, episode reward: -33.584, mean reward: -0.034 [-4.662, 4.939], mean action: 1.675 [0.000, 3.000], mean observation: 0.082 [-0.772, 1.399], loss: 6.537270, mean_absolute_error: 25.825907, mean_q: 32.129246
 78529/100000: episode: 226, duration: 7.666s, episode steps: 1000, steps per second: 130, episode reward: 1.670, mean reward: 0.002 [-4.827, 5.090], mean action: 1.514 [0.000, 3.000], mean observation: 0.103 [-0.511, 1.496], loss: 6.468633, mean_absolute_error: 25.670715, mean_q: 31.840431
 79529/100000: episode: 227, duration: 8.292s, episode steps: 1000, steps per second: 121, episode reward: -61.676, mean reward: -0.062 [-4.303, 4.802], mean action: 1.518 [0.000, 3.000], mean observation: 0.076 [-0.333, 1.412], loss: 5.509070, mean_absolute_error: 25.395870, mean_q: 31.828310
 80529/100000: episode: 228, duration: 8.226s, episode steps: 1000, steps per second: 122, episode reward: -50.649, me

## Evaluation Result

We are testing the above model for 50 episodes and then looking at the mean reward value

In [11]:
# Finally, evaluate our algorithm for 50 episodes.
#dqn3.test(env, nb_episodes=50, visualize=False)

# Finally, evaluate the agent
history = dqn3.test(env, nb_episodes=50, visualize=False)
rewards = np.array(history.history['episode_reward'])
print(("Test rewards (#episodes={}): mean={:>5.2f}, std={:>5.2f}, "
           "min={:>5.2f}, max={:>5.2f}")
                  .format(len(rewards),
                  rewards.mean(),
                  rewards.std(),
                  rewards.min(),
                  rewards.max()))

rl_model_reward_comparisons["Model 3"] = rewards.mean()

Testing for 50 episodes ...
Episode 1: reward: -13.904, steps: 1000
Episode 2: reward: -27.567, steps: 1000
Episode 3: reward: 7.155, steps: 1000
Episode 4: reward: 3.874, steps: 1000
Episode 5: reward: -31.419, steps: 1000
Episode 6: reward: 5.493, steps: 1000
Episode 7: reward: -53.837, steps: 1000
Episode 8: reward: -58.262, steps: 1000
Episode 9: reward: -26.818, steps: 1000
Episode 10: reward: -22.521, steps: 1000
Episode 11: reward: -28.619, steps: 1000
Episode 12: reward: -2.943, steps: 1000
Episode 13: reward: -13.825, steps: 1000
Episode 14: reward: -25.618, steps: 1000
Episode 15: reward: -33.615, steps: 1000
Episode 16: reward: -52.710, steps: 1000
Episode 17: reward: -27.549, steps: 1000
Episode 18: reward: -17.004, steps: 1000
Episode 19: reward: -22.071, steps: 1000
Episode 20: reward: -48.325, steps: 1000
Episode 21: reward: -22.303, steps: 1000
Episode 22: reward: -122.199, steps: 1000
Episode 23: reward: -14.136, steps: 1000
Episode 24: reward: -1.940, steps: 1000
Epis

## Model 4

<b> Model Architecture</b>

We are using the similar architecture and process as [Model 1](#task2_model1) but with different hyper-parameters <br/> 
<b>In this Model, we are changing value of learning rate of Adam optimizer to 0.001 and epsilon to 0.2</b>

In [12]:
# Get the environment and extract the number of actions.
#env = gym.make(ENV_NAME)
env = gym.make('LunarLander-v2')
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

In [13]:
# Next, we build a very simple model.
model4 = Sequential()
model4.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model4.add(Dense(16))
model4.add(Activation('relu'))
model4.add(Dense(16))
model4.add(Activation('relu'))
model4.add(Dense(16))
model4.add(Activation('relu'))
model4.add(Dense(nb_actions))
model4.add(Activation('linear'))
print(model4.summary())

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=50000, window_length=1)
policy = EpsGreedyQPolicy(eps=.2)
dqn4 = DQNAgent(model=model4, nb_actions=nb_actions, memory=memory, nb_steps_warmup=15,
               target_model_update=1e-2, policy=policy)
dqn4.compile(Adam(lr=0.001), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
start= time.time()
dqn4.fit(env, nb_steps=100000, visualize=False, verbose=2)
end = time.time()
timetaken=end - start
rl_model_time_comparisons['Model 4'] = timetaken
# After training is done, we save the final weights.
dqn4.save_weights('dqn_{}_weights_model4.h5f'.format(ENV_NAME), overwrite=True)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_4 (Flatten)          (None, 8)                 0         
_________________________________________________________________
dense_13 (Dense)             (None, 16)                144       
_________________________________________________________________
activation_13 (Activation)   (None, 16)                0         
_________________________________________________________________
dense_14 (Dense)             (None, 16)                272       
_________________________________________________________________
activation_14 (Activation)   (None, 16)                0         
_________________________________________________________________
dense_15 (Dense)             (None, 16)                272       
_________________________________________________________________
activation_15 (Activation)   (None, 16)                0         
__________



    91/100000: episode: 1, duration: 2.253s, episode steps: 91, steps per second: 40, episode reward: -164.688, mean reward: -1.810 [-100.000, 11.862], mean action: 2.011 [0.000, 3.000], mean observation: 0.034 [-5.194, 1.399], loss: 3.954690, mean_absolute_error: 0.694446, mean_q: 0.109807
   164/100000: episode: 2, duration: 0.426s, episode steps: 73, steps per second: 171, episode reward: -385.565, mean reward: -5.282 [-100.000, 1.582], mean action: 2.192 [0.000, 3.000], mean observation: -0.270 [-2.727, 1.412], loss: 36.663620, mean_absolute_error: 1.179321, mean_q: 0.310515
   227/100000: episode: 3, duration: 0.353s, episode steps: 63, steps per second: 179, episode reward: -219.585, mean reward: -3.485 [-100.000, 80.575], mean action: 1.413 [0.000, 3.000], mean observation: -0.131 [-3.759, 1.398], loss: 50.023174, mean_absolute_error: 1.532596, mean_q: 0.154763
   291/100000: episode: 4, duration: 0.354s, episode steps: 64, steps per second: 181, episode reward: -258.153, mean r

  8018/100000: episode: 29, duration: 3.144s, episode steps: 477, steps per second: 152, episode reward: -208.359, mean reward: -0.437 [-100.000, 5.043], mean action: 1.591 [0.000, 3.000], mean observation: 0.085 [-1.011, 1.463], loss: 5.740035, mean_absolute_error: 22.946516, mean_q: 10.437573
  8340/100000: episode: 30, duration: 2.049s, episode steps: 322, steps per second: 157, episode reward: -227.019, mean reward: -0.705 [-100.000, 4.470], mean action: 1.668 [0.000, 3.000], mean observation: 0.047 [-1.050, 1.387], loss: 5.685569, mean_absolute_error: 23.315964, mean_q: 10.194123
  8472/100000: episode: 31, duration: 0.759s, episode steps: 132, steps per second: 174, episode reward: -147.931, mean reward: -1.121 [-100.000, 7.297], mean action: 1.250 [0.000, 3.000], mean observation: -0.083 [-4.011, 1.388], loss: 6.207466, mean_absolute_error: 23.896080, mean_q: 9.940652
  8731/100000: episode: 32, duration: 1.637s, episode steps: 259, steps per second: 158, episode reward: -177.70

 17415/100000: episode: 57, duration: 2.057s, episode steps: 327, steps per second: 159, episode reward: -133.524, mean reward: -0.408 [-100.000, 3.019], mean action: 1.761 [0.000, 3.000], mean observation: 0.216 [-0.472, 1.397], loss: 3.271354, mean_absolute_error: 27.429646, mean_q: 3.149706
 17822/100000: episode: 58, duration: 2.634s, episode steps: 407, steps per second: 155, episode reward: -212.250, mean reward: -0.521 [-100.000, 3.500], mean action: 1.757 [0.000, 3.000], mean observation: 0.207 [-0.187, 1.417], loss: 3.450402, mean_absolute_error: 27.261034, mean_q: 2.613321
 18145/100000: episode: 59, duration: 1.973s, episode steps: 323, steps per second: 164, episode reward: -149.739, mean reward: -0.464 [-100.000, 5.114], mean action: 1.712 [0.000, 3.000], mean observation: 0.237 [-0.237, 1.513], loss: 2.659300, mean_absolute_error: 27.573450, mean_q: 2.221550
 18409/100000: episode: 60, duration: 1.605s, episode steps: 264, steps per second: 164, episode reward: -124.099, 

 38278/100000: episode: 85, duration: 8.961s, episode steps: 1000, steps per second: 112, episode reward: -116.381, mean reward: -0.116 [-4.572, 4.433], mean action: 1.738 [0.000, 3.000], mean observation: 0.112 [-0.311, 1.411], loss: 1.765229, mean_absolute_error: 19.085438, mean_q: 7.176369
 39278/100000: episode: 86, duration: 7.736s, episode steps: 1000, steps per second: 129, episode reward: -91.277, mean reward: -0.091 [-4.652, 4.744], mean action: 1.738 [0.000, 3.000], mean observation: 0.115 [-0.327, 1.410], loss: 2.868146, mean_absolute_error: 18.897224, mean_q: 7.043236
 40278/100000: episode: 87, duration: 8.376s, episode steps: 1000, steps per second: 119, episode reward: -56.961, mean reward: -0.057 [-4.560, 4.848], mean action: 1.722 [0.000, 3.000], mean observation: 0.084 [-0.361, 1.414], loss: 2.415416, mean_absolute_error: 18.347578, mean_q: 7.892896
 41278/100000: episode: 88, duration: 8.911s, episode steps: 1000, steps per second: 112, episode reward: -72.980, mean 

 61938/100000: episode: 113, duration: 1.171s, episode steps: 196, steps per second: 167, episode reward: -254.988, mean reward: -1.301 [-100.000, 8.343], mean action: 1.847 [0.000, 3.000], mean observation: 0.154 [-1.160, 2.812], loss: 3.742325, mean_absolute_error: 16.734417, mean_q: 18.952995
 62156/100000: episode: 114, duration: 1.316s, episode steps: 218, steps per second: 166, episode reward: -350.957, mean reward: -1.610 [-100.000, 78.007], mean action: 1.908 [0.000, 3.000], mean observation: 0.107 [-0.995, 3.653], loss: 3.996792, mean_absolute_error: 16.618835, mean_q: 18.454985
 62569/100000: episode: 115, duration: 2.566s, episode steps: 413, steps per second: 161, episode reward: -68.059, mean reward: -0.165 [-100.000, 17.241], mean action: 1.748 [0.000, 3.000], mean observation: 0.185 [-0.792, 1.463], loss: 2.831129, mean_absolute_error: 16.753124, mean_q: 19.250179
 63569/100000: episode: 116, duration: 8.409s, episode steps: 1000, steps per second: 119, episode reward: 4

 81535/100000: episode: 141, duration: 8.509s, episode steps: 1000, steps per second: 118, episode reward: -25.920, mean reward: -0.026 [-4.967, 4.690], mean action: 1.673 [0.000, 3.000], mean observation: 0.082 [-0.370, 1.489], loss: 5.278962, mean_absolute_error: 20.103895, mean_q: 24.901442
 82535/100000: episode: 142, duration: 7.815s, episode steps: 1000, steps per second: 128, episode reward: 82.028, mean reward: 0.082 [-22.636, 22.736], mean action: 1.827 [0.000, 3.000], mean observation: 0.099 [-0.710, 1.402], loss: 3.115170, mean_absolute_error: 19.984743, mean_q: 24.659796
 83535/100000: episode: 143, duration: 7.292s, episode steps: 1000, steps per second: 137, episode reward: 90.050, mean reward: 0.090 [-19.118, 12.145], mean action: 1.554 [0.000, 3.000], mean observation: 0.095 [-0.771, 1.393], loss: 3.625318, mean_absolute_error: 20.117731, mean_q: 24.937138
 84535/100000: episode: 144, duration: 9.335s, episode steps: 1000, steps per second: 107, episode reward: -38.257,

## Evaluation Result

We are testing the above model for 50 episodes and then looking at the mean reward value

In [14]:
# Finally, evaluate our algorithm for 50 episodes.
#dqn3.test(env, nb_episodes=50, visualize=False)

# Finally, evaluate the agent
history = dqn4.test(env, nb_episodes=50, visualize=False)
rewards = np.array(history.history['episode_reward'])
print(("Test rewards (#episodes={}): mean={:>5.2f}, std={:>5.2f}, "
           "min={:>5.2f}, max={:>5.2f}")
                  .format(len(rewards),
                  rewards.mean(),
                  rewards.std(),
                  rewards.min(),
                  rewards.max()))

rl_model_reward_comparisons["Model 4"] = rewards.mean()

Testing for 50 episodes ...
Episode 1: reward: -210.185, steps: 191
Episode 2: reward: 103.853, steps: 1000
Episode 3: reward: 118.624, steps: 1000
Episode 4: reward: 246.048, steps: 505
Episode 5: reward: 135.099, steps: 1000
Episode 6: reward: 117.364, steps: 1000
Episode 7: reward: 144.357, steps: 1000
Episode 8: reward: 114.239, steps: 1000
Episode 9: reward: 111.288, steps: 1000
Episode 10: reward: 119.039, steps: 1000
Episode 11: reward: 114.412, steps: 1000
Episode 12: reward: 124.712, steps: 1000
Episode 13: reward: 86.981, steps: 1000
Episode 14: reward: 110.171, steps: 1000
Episode 15: reward: 229.639, steps: 424
Episode 16: reward: 118.417, steps: 1000
Episode 17: reward: 123.959, steps: 1000
Episode 18: reward: 109.251, steps: 1000
Episode 19: reward: 97.037, steps: 1000
Episode 20: reward: -278.117, steps: 228
Episode 21: reward: 127.802, steps: 1000
Episode 22: reward: 211.703, steps: 679
Episode 23: reward: 108.731, steps: 1000
Episode 24: reward: 135.276, steps: 1000
Ep

## Model 5

<b> Model Architecture</b>

We are using the similar architecture and process as [Model 1](#task2_model1) but with different hyper-parameters. <br/> <br/>
<b>In this Model, we are changing value of learning rate of Adam optimizer to 0.001. Also, we are using a different Policy that is LinearAnnealedPolicy. In this Policy, value of epsilon decay our as the agent steps forward in the world. In the below policy, we’re saying that we want to start with a value of 1 for epsilon and go no smaller than 0.1.</b>

In [15]:
# Get the environment and extract the number of actions.
#env = gym.make(ENV_NAME)
env = gym.make('LunarLander-v2')
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

In [16]:
# Next, we build a very simple model.
model5 = Sequential()
model5.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model5.add(Dense(16))
model5.add(Activation('relu'))
model5.add(Dense(16))
model5.add(Activation('relu'))
model5.add(Dense(16))
model5.add(Activation('relu'))
model5.add(Dense(nb_actions))
model5.add(Activation('linear'))
print(model5.summary())

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=50000, window_length=1)
#policy = EpsGreedyQPolicy(eps=.1)
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.05, nb_steps=10000)
dqn5 = DQNAgent(model=model5, nb_actions=nb_actions, memory=memory, nb_steps_warmup=15,
               target_model_update=1e-2, policy=policy)
dqn5.compile(Adam(lr=0.001), metrics=['mae'])

# Okay, now it's time to learn something! We visualize the training here for show, but this
# slows down training quite a lot. You can always safely abort the training prematurely using
# Ctrl + C.
start = time.time()
dqn5.fit(env, nb_steps=100000, visualize=False, verbose=2)
end = time.time()
timetaken=end - start
rl_model_time_comparisons['Model 5'] = timetaken
# After training is done, we save the final weights.
dqn5.save_weights('dqn_{}_weights_model5.h5f'.format(ENV_NAME), overwrite=True)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_5 (Flatten)          (None, 8)                 0         
_________________________________________________________________
dense_17 (Dense)             (None, 16)                144       
_________________________________________________________________
activation_17 (Activation)   (None, 16)                0         
_________________________________________________________________
dense_18 (Dense)             (None, 16)                272       
_________________________________________________________________
activation_18 (Activation)   (None, 16)                0         
_________________________________________________________________
dense_19 (Dense)             (None, 16)                272       
_________________________________________________________________
activation_19 (Activation)   (None, 16)                0         
__________



    80/100000: episode: 1, duration: 2.389s, episode steps: 80, steps per second: 33, episode reward: -96.202, mean reward: -1.203 [-100.000, 11.577], mean action: 1.512 [0.000, 3.000], mean observation: 0.050 [-1.260, 4.242], loss: 1.533736, mean_absolute_error: 0.493214, mean_q: 0.309838, mean_eps: 0.995725
   140/100000: episode: 2, duration: 0.355s, episode steps: 60, steps per second: 169, episode reward: -90.287, mean reward: -1.505 [-100.000, 12.534], mean action: 1.583 [0.000, 3.000], mean observation: -0.101 [-5.039, 1.394], loss: 45.097277, mean_absolute_error: 1.147136, mean_q: 1.884800, mean_eps: 0.990145
   249/100000: episode: 3, duration: 0.638s, episode steps: 109, steps per second: 171, episode reward: -168.031, mean reward: -1.542 [-100.000, 5.909], mean action: 1.450 [0.000, 3.000], mean observation: -0.063 [-1.580, 5.109], loss: 43.599776, mean_absolute_error: 1.258506, mean_q: 2.334912, mean_eps: 0.982540
   388/100000: episode: 4, duration: 0.834s, episode steps: 

  2848/100000: episode: 28, duration: 0.588s, episode steps: 95, steps per second: 161, episode reward: -85.771, mean reward: -0.903 [-100.000, 7.333], mean action: 1.716 [0.000, 3.000], mean observation: -0.046 [-1.083, 3.987], loss: 23.530632, mean_absolute_error: 11.543287, mean_q: 10.871222, mean_eps: 0.748000
  2939/100000: episode: 29, duration: 0.541s, episode steps: 91, steps per second: 168, episode reward: -115.969, mean reward: -1.274 [-100.000, 9.488], mean action: 1.593 [0.000, 3.000], mean observation: -0.125 [-0.974, 2.919], loss: 23.587906, mean_absolute_error: 11.774356, mean_q: 11.483670, mean_eps: 0.739630
  3052/100000: episode: 30, duration: 0.673s, episode steps: 113, steps per second: 168, episode reward: -65.710, mean reward: -0.582 [-100.000, 10.583], mean action: 1.637 [0.000, 3.000], mean observation: -0.033 [-3.390, 1.395], loss: 21.151316, mean_absolute_error: 12.886764, mean_q: 10.703189, mean_eps: 0.730450
  3120/100000: episode: 31, duration: 0.401s, epi

  6974/100000: episode: 54, duration: 1.329s, episode steps: 218, steps per second: 164, episode reward: -41.443, mean reward: -0.190 [-100.000, 14.752], mean action: 1.661 [0.000, 3.000], mean observation: 0.048 [-0.918, 1.424], loss: 11.550512, mean_absolute_error: 22.619874, mean_q: 8.839685, mean_eps: 0.382195
  7118/100000: episode: 55, duration: 0.861s, episode steps: 144, steps per second: 167, episode reward: 15.647, mean reward: 0.109 [-100.000, 17.964], mean action: 1.604 [0.000, 3.000], mean observation: -0.020 [-0.817, 1.448], loss: 12.503320, mean_absolute_error: 22.691998, mean_q: 9.333096, mean_eps: 0.365905
  7472/100000: episode: 56, duration: 2.206s, episode steps: 354, steps per second: 160, episode reward: -30.024, mean reward: -0.085 [-100.000, 7.107], mean action: 1.653 [0.000, 3.000], mean observation: 0.052 [-0.481, 1.496], loss: 11.381594, mean_absolute_error: 23.303030, mean_q: 10.219546, mean_eps: 0.343495
  7625/100000: episode: 57, duration: 0.914s, episode

 17049/100000: episode: 80, duration: 2.354s, episode steps: 366, steps per second: 155, episode reward: -145.359, mean reward: -0.397 [-100.000, 5.307], mean action: 1.852 [0.000, 3.000], mean observation: 0.106 [-0.803, 1.994], loss: 8.151118, mean_absolute_error: 25.999302, mean_q: 13.691301, mean_eps: 0.100000
 17484/100000: episode: 81, duration: 3.040s, episode steps: 435, steps per second: 143, episode reward: -116.594, mean reward: -0.268 [-100.000, 9.138], mean action: 1.726 [0.000, 3.000], mean observation: 0.118 [-0.726, 1.681], loss: 10.215110, mean_absolute_error: 25.770879, mean_q: 13.136301, mean_eps: 0.100000
 18484/100000: episode: 82, duration: 9.827s, episode steps: 1000, steps per second: 102, episode reward: -73.269, mean reward: -0.073 [-4.647, 4.679], mean action: 1.905 [0.000, 3.000], mean observation: 0.096 [-0.533, 1.416], loss: 9.688678, mean_absolute_error: 25.242522, mean_q: 13.499999, mean_eps: 0.100000
 19063/100000: episode: 83, duration: 3.880s, episode

 36730/100000: episode: 106, duration: 9.093s, episode steps: 1000, steps per second: 110, episode reward: -35.246, mean reward: -0.035 [-4.557, 5.323], mean action: 1.521 [0.000, 3.000], mean observation: 0.102 [-0.459, 1.398], loss: 5.345604, mean_absolute_error: 17.225414, mean_q: 15.899656, mean_eps: 0.100000
 37730/100000: episode: 107, duration: 8.044s, episode steps: 1000, steps per second: 124, episode reward: -27.729, mean reward: -0.028 [-4.103, 5.155], mean action: 1.553 [0.000, 3.000], mean observation: 0.071 [-0.666, 1.397], loss: 5.962276, mean_absolute_error: 16.848623, mean_q: 16.442985, mean_eps: 0.100000
 38730/100000: episode: 108, duration: 7.930s, episode steps: 1000, steps per second: 126, episode reward: -76.106, mean reward: -0.076 [-4.690, 4.451], mean action: 1.605 [0.000, 3.000], mean observation: 0.100 [-0.527, 1.409], loss: 6.505173, mean_absolute_error: 16.805387, mean_q: 16.503151, mean_eps: 0.100000
 39600/100000: episode: 109, duration: 6.175s, episode 

 62278/100000: episode: 132, duration: 7.568s, episode steps: 1000, steps per second: 132, episode reward: -28.362, mean reward: -0.028 [-2.977, 4.614], mean action: 1.536 [0.000, 3.000], mean observation: 0.034 [-0.490, 1.418], loss: 2.690518, mean_absolute_error: 20.919919, mean_q: 27.346715, mean_eps: 0.100000
 63278/100000: episode: 133, duration: 8.121s, episode steps: 1000, steps per second: 123, episode reward: 10.810, mean reward: 0.011 [-21.394, 22.414], mean action: 1.557 [0.000, 3.000], mean observation: 0.049 [-0.614, 1.408], loss: 3.514521, mean_absolute_error: 21.364044, mean_q: 27.891466, mean_eps: 0.100000
 64278/100000: episode: 134, duration: 7.735s, episode steps: 1000, steps per second: 129, episode reward: 2.612, mean reward: 0.003 [-11.951, 16.025], mean action: 1.556 [0.000, 3.000], mean observation: 0.034 [-0.659, 1.448], loss: 3.880178, mean_absolute_error: 21.141313, mean_q: 27.718455, mean_eps: 0.100000
 65278/100000: episode: 135, duration: 8.380s, episode s

 83864/100000: episode: 158, duration: 7.553s, episode steps: 1000, steps per second: 132, episode reward: -46.319, mean reward: -0.046 [-4.968, 5.906], mean action: 1.471 [0.000, 3.000], mean observation: 0.043 [-0.315, 1.455], loss: 2.721545, mean_absolute_error: 15.779670, mean_q: 19.989531, mean_eps: 0.100000
 84864/100000: episode: 159, duration: 8.989s, episode steps: 1000, steps per second: 111, episode reward: -59.620, mean reward: -0.060 [-3.652, 4.543], mean action: 1.436 [0.000, 3.000], mean observation: 0.037 [-0.426, 1.406], loss: 1.699680, mean_absolute_error: 15.236044, mean_q: 19.373895, mean_eps: 0.100000
 85864/100000: episode: 160, duration: 9.020s, episode steps: 1000, steps per second: 111, episode reward: -31.764, mean reward: -0.032 [-3.839, 4.895], mean action: 1.496 [0.000, 3.000], mean observation: 0.031 [-0.723, 1.387], loss: 2.791505, mean_absolute_error: 15.008729, mean_q: 18.594353, mean_eps: 0.100000
 86864/100000: episode: 161, duration: 8.988s, episode 

 99973/100000: episode: 184, duration: 7.994s, episode steps: 1000, steps per second: 125, episode reward: -33.155, mean reward: -0.033 [-3.728, 7.563], mean action: 1.685 [0.000, 3.000], mean observation: -0.011 [-1.055, 1.434], loss: 3.596136, mean_absolute_error: 15.100486, mean_q: 18.315063, mean_eps: 0.100000
done, took 772.261 seconds


## Evaluation Result

We are testing the above model for 50 episodes and then looking at the mean reward value

In [17]:
# Finally, evaluate the agent
history = dqn5.test(env, nb_episodes=50, visualize=False)
rewards = np.array(history.history['episode_reward'])
print(("Test rewards (#episodes={}): mean={:>5.2f}, std={:>5.2f}, "
           "min={:>5.2f}, max={:>5.2f}")
                  .format(len(rewards),
                  rewards.mean(),
                  rewards.std(),
                  rewards.min(),
                  rewards.max()))

rl_model_reward_comparisons["Model 5"] = rewards.mean()

Testing for 50 episodes ...
Episode 1: reward: -152.385, steps: 208
Episode 2: reward: -61.721, steps: 1000
Episode 3: reward: -21.787, steps: 1000
Episode 4: reward: -119.642, steps: 210
Episode 5: reward: -56.291, steps: 1000
Episode 6: reward: -104.080, steps: 271
Episode 7: reward: -38.385, steps: 1000
Episode 8: reward: -37.637, steps: 1000
Episode 9: reward: -52.232, steps: 1000
Episode 10: reward: -157.687, steps: 166
Episode 11: reward: 184.895, steps: 565
Episode 12: reward: -74.763, steps: 1000
Episode 13: reward: -119.801, steps: 227
Episode 14: reward: -108.668, steps: 212
Episode 15: reward: -37.236, steps: 1000
Episode 16: reward: -62.961, steps: 1000
Episode 17: reward: -161.091, steps: 176
Episode 18: reward: -131.141, steps: 208
Episode 19: reward: -54.391, steps: 1000
Episode 20: reward: -88.476, steps: 1000
Episode 21: reward: -25.260, steps: 1000
Episode 22: reward: -27.738, steps: 1000
Episode 23: reward: -157.261, steps: 186
Episode 24: reward: -69.314, steps: 100

## Task 3

Deploy each of the two models trained to the Lunar Lander Game play 200 episodes and analyse the reward achieved by the models trained using each approach <br/>

1) The lunar_lander_ml_images_player.py and lunar_lander_rl_player.py python scripts contain the code to load a saved model and run iterations of the game using that model.<br/>
2) Write a short document (no more that 350 words) in a Jupyter notebook to describe the results of the experiments.
3) Reflect on the performance of each model. <br/>
4) Reflect on the amount of computation required to train each model

In [46]:
task2_model_reward_comparisons = dict()

In [59]:
!python lunar_lander_rl_player_model1.py

Using TensorFlow backend.
2019-04-28 21:04:37.446366: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
results
Testing for 200 episodes ...
Episode 1: reward: -104.231, steps: 1000
Episode 2: reward: -35.669, steps: 1000
Episode 3: reward: -50.068, steps: 1000
Episode 4: reward: -36.481, steps: 1000
Episode 5: reward: -102.598, steps: 1000
Episode 6: reward: -69.979, steps: 1000
Episode 7: reward: -89.020, steps: 1000
Episode 8: reward: -68.549, steps: 1000
Episode 9: reward: -98.523, steps: 1000
Episode 10: reward: -79.001, steps: 1000
Episode 11: reward: -51.673, steps: 1000
Episode 12: reward: -80.564, steps: 1000
Episode 13: reward: -56.508, steps: 1000
Episode 14: reward: -130.665, steps: 1000
Episode 15: reward: -79.176, steps: 1000
Episode 16: reward: -51.271, steps: 1000
Episode 17: reward: -79.792, steps: 1000
Episode 18: reward: -92.625, steps: 1000
Episode 19: reward: -100.128, s

In [47]:
task2_model_reward_comparisons["Model 1"] = '-79.07191'

In [60]:
!python lunar_lander_rl_player_model2.py

Using TensorFlow backend.
2019-04-28 21:13:13.874704: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
results
Testing for 200 episodes ...
Episode 1: reward: -146.097, steps: 1000
Episode 2: reward: -107.581, steps: 1000
Episode 3: reward: -150.205, steps: 677
Episode 4: reward: -167.455, steps: 1000
Episode 5: reward: -143.771, steps: 1000
Episode 6: reward: -137.223, steps: 1000
Episode 7: reward: -119.939, steps: 1000
Episode 8: reward: -132.464, steps: 1000
Episode 9: reward: -145.361, steps: 1000
Episode 10: reward: -112.932, steps: 1000
Episode 11: reward: -82.749, steps: 1000
Episode 12: reward: -127.225, steps: 1000
Episode 13: reward: -144.793, steps: 1000
Episode 14: reward: -102.089, steps: 1000
Episode 15: reward: -116.437, steps: 1000
Episode 16: reward: -143.992, steps: 655
Episode 17: reward: -173.656, steps: 834
Episode 18: reward: -95.258, steps: 1000
Episode 19: reward: -

In [48]:
task2_model_reward_comparisons["Model 2"] = '-127.74672'

In [61]:
!python lunar_lander_rl_player_model3.py

Using TensorFlow backend.
2019-04-28 21:22:11.902123: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
results
Testing for 200 episodes ...
Episode 1: reward: -31.248, steps: 1000
Episode 2: reward: 5.535, steps: 1000
Episode 3: reward: 0.744, steps: 1000
Episode 4: reward: 13.047, steps: 1000
Episode 5: reward: -21.307, steps: 1000
Episode 6: reward: -5.697, steps: 1000
Episode 7: reward: -18.054, steps: 1000
Episode 8: reward: -8.069, steps: 1000
Episode 9: reward: -24.705, steps: 1000
Episode 10: reward: -15.123, steps: 1000
Episode 11: reward: -1.592, steps: 1000
Episode 12: reward: -12.140, steps: 1000
Episode 13: reward: 8.659, steps: 1000
Episode 14: reward: -66.282, steps: 1000
Episode 15: reward: 4.503, steps: 1000
Episode 16: reward: 19.355, steps: 1000
Episode 17: reward: -12.493, steps: 1000
Episode 18: reward: -21.302, steps: 1000
Episode 19: reward: -31.803, steps: 1000
Episod

In [49]:
task2_model_reward_comparisons["Model 3"] = '-26.935395'

In [62]:
!python lunar_lander_rl_player_model4.py

Using TensorFlow backend.
2019-04-28 21:33:08.561394: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
results
Testing for 200 episodes ...
Episode 1: reward: 93.794, steps: 1000
Episode 2: reward: 125.313, steps: 1000
Episode 3: reward: -215.775, steps: 180
Episode 4: reward: 127.415, steps: 1000
Episode 5: reward: 115.491, steps: 1000
Episode 6: reward: 156.143, steps: 1000
Episode 7: reward: 86.420, steps: 1000
Episode 8: reward: 119.398, steps: 1000
Episode 9: reward: 209.597, steps: 618
Episode 10: reward: 227.445, steps: 450
Episode 11: reward: 123.056, steps: 1000
Episode 12: reward: -251.392, steps: 202
Episode 13: reward: 106.589, steps: 1000
Episode 14: reward: 104.477, steps: 1000
Episode 15: reward: 105.321, steps: 1000
Episode 16: reward: 75.263, steps: 1000
Episode 17: reward: 97.711, steps: 1000
Episode 18: reward: 71.227, steps: 1000
Episode 19: reward: 70.100, steps: 1000
E

In [50]:
task2_model_reward_comparisons["Model 4"] = '91.990245'

In [63]:
!python lunar_lander_rl_player_model5.py

Using TensorFlow backend.
2019-04-28 21:41:12.489404: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
results
Testing for 200 episodes ...
Episode 1: reward: -52.496, steps: 1000
Episode 2: reward: -58.191, steps: 298
Episode 3: reward: -129.416, steps: 202
Episode 4: reward: -39.778, steps: 1000
Episode 5: reward: -96.629, steps: 337
Episode 6: reward: -77.787, steps: 1000
Episode 7: reward: -33.579, steps: 1000
Episode 8: reward: 61.435, steps: 863
Episode 9: reward: -147.273, steps: 199
Episode 10: reward: -58.201, steps: 1000
Episode 11: reward: -131.636, steps: 228
Episode 12: reward: -51.474, steps: 1000
Episode 13: reward: -19.968, steps: 1000
Episode 14: reward: -34.425, steps: 1000
Episode 15: reward: -30.228, steps: 1000
Episode 16: reward: -124.573, steps: 294
Episode 17: reward: -54.870, steps: 1000
Episode 18: reward: -124.986, steps: 204
Episode 19: reward: -31.925, steps: 10

In [51]:
task2_model_reward_comparisons["Model 5"] = '-61.32903500'

In [52]:
rl_model_reward_comparisons

{'Model 1': -76.05800147872863,
 'Model 2': -125.63024379728482,
 'Model 3': -23.049652860423006,
 'Model 4': 87.6691901914162,
 'Model 5': -75.87643170241438}

In [53]:
rl_model_time_comparisons 

{'Model 1': 639.8961799144745,
 'Model 2': 683.990748167038,
 'Model 3': 715.8376429080963,
 'Model 4': 769.9784548282623,
 'Model 5': 772.2631468772888}

In [54]:
task2_model_reward_comparisons

{'Model 1': '-79.07191',
 'Model 2': '-127.74672',
 'Model 3': '-26.935395',
 'Model 4': '91.990245',
 'Model 5': '-61.32903500'}

In [55]:
dict_list = [rl_model_reward_comparisons,rl_model_time_comparisons,task2_model_reward_comparisons]

In [57]:
import json

In [58]:
filename = 'part2_model_results.txt'

with open(filename, 'w') as fd:
    fd.write(json.dumps(dict_list))

with open(filename, 'r') as fd:
    print(json.load(fd))

[{'Model 1': -76.05800147872863, 'Model 2': -125.63024379728482, 'Model 3': -23.049652860423006, 'Model 4': 87.6691901914162, 'Model 5': -75.87643170241438}, {'Model 1': 639.8961799144745, 'Model 2': 683.990748167038, 'Model 3': 715.8376429080963, 'Model 4': 769.9784548282623, 'Model 5': 772.2631468772888}, {'Model 1': '-79.07191', 'Model 2': '-127.74672', 'Model 3': '-26.935395', 'Model 4': '91.990245', 'Model 5': '-61.32903500'}]
