# Keras-RL DQN Model


At first we will import all necessary packages:

In [1]:
import time  # to reduce the game speed when playing manually

import gym  # Contains the game we want to play
from pyglet.window import key  # for manual playing

# import necessary blocks from keras to build the Deep Learning backbone of our agent
from tensorflow.keras.models import Sequential  # To compose multiple Layers
from tensorflow.keras.layers import Dense  # Fully-Connected layer
from tensorflow.keras.layers import Activation  # Activation functions
from tensorflow.keras.layers import Flatten  # Flatten function

from tensorflow.keras.optimizers import Adam  # Adam optimizer

# Now the keras-rl2 agent. Dont get confused as it is only called rl and not keras-rl

from rl.agents.dqn import DQNAgent  # Use the basic Deep-Q-Network agent

In [2]:
import wandb
from wandb.keras import WandbCallback

wandb.init(config={"hyper": "parameter"})

[34m[1mwandb[0m: Currently logged in as: [33mskyfall-blue[0m (use `wandb login --relogin` to force relogin)


Now we will create the environment:

In [2]:
env_name = ENV_NAME = 'LunarLander-v2'  # https://gym.openai.com/envs/LunarLander-v2/
env = gym.make(env_name)  # create the environment
nb_actions = env.action_space.n  # get the number of possible actions
NUMBER_STEPS = 150000

Lets watch how the game looks when chosing random actions

In [5]:
env.reset()  # reset the environment to the initial state
for _ in range(200):  # play for max 200 iterations
    env.render(mode="human")  # render the current game state on your screen
    random_action = env.action_space.sample()  # chose a random action
    env.step(random_action)  # execute that action
env.close()  # close the environment

In [6]:
action = 0
def key_press(k, mod):
    '''
    This function gets the key press for gym
    '''
    global action
    if k == key.A: #fly left
        action = 3
    if k == key.D: #fly right
        action = 1
    if k == key.W: #fly up
        action = 2
    if k == key.S:
        action = 0

env.reset()
rewards = 0
for _ in range(1000):
    env.render(mode="human")
    env.viewer.window.on_key_press = key_press  # update the key press
    observation, reward, done, info = env.step(action)
    rewards+=1
    if done:
        print(f"You got {rewards} points!")
        break
    time.sleep(0.05)  # reduce speed a little bit
env.close()

You got 188 points!


**TASK: Create the Neural Network for your Deep-Q-Agent**
Take a look at the size of the action space and the size of the observation space.
You are free to chose any architecture you want!
Hint: It already works with three layers, each having 64 neurons.

In [7]:
model = Sequential()
# https://keras.io/api/layers/reshaping_layers/flatten/
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))

model.add(Dense(64))
model.add(Activation('relu'))

model.add(Dense(64))
model.add(Activation('relu'))

model.add(Dense(32))
model.add(Activation('relu'))

model.add(Dense(nb_actions))
model.add(Activation('linear'))

print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 8)                 0         
_________________________________________________________________
dense (Dense)                (None, 64)                576       
_________________________________________________________________
activation (Activation)      (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_1 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
activation_2 (Activation)    (None, 32)                0

Lets create the DQN agent from keras-rl
For this setting, the agent takes the following parameters:

1. model = The model
2. nb_actions = The number of actions (2 in this case)
3. memory = The action replay memory. You can choose between the *SequentialMemory()* and *EpisodeParameterMemory() which is only used for one RL agent called CEM*
4. nb_steps_warmup = How many iterations without training - Used to fill the memory
5. target_model_update = When do we update the target model?
6. Action Selection policy. You can choose between a *LinearAnnealedPolicy()*, *SoftmaxPolicy()*, *EpsGreedyQPolicy()*, *GreedyQPolicy()*, *GreedyQPolicy()*, *MaxBoltzmannQPolicy()* and *BoltzmannGumbelQPolicy()*. We use all of them during the next notebooks but feel free to try them out and inspect which works best here

There are some more parameters, you can pass to the DQN Agent. Feel free to explore them, but we will also take a look at them together in the remaining notebooks

Here we initialize the circular buffer with a limit of 50000 and a window length of 1.
The window length describes the number of subsequent actions stored for a state.
This will be demonstrated in the next lecture, when we start dealing with images

In [8]:
from rl.memory import SequentialMemory  # Sequential Memory for storing observations ( optimized circular buffer)

memory = SequentialMemory(limit=50000, window_length=1)


Then we define the Action Selection Policy: <br />
We use *LinearAnnealedPolicy* in order to perform the epsilon greedy strategy with decaying epsilon. <br />
*LinearAnnealedPolicy* accepts an action selection policy, its maximal and minimal values and a step number in order to create a dynimal policy. <br/>
The minimal value epsilon can reach during training is 0.1.<br />
For evaluation (e.g running the agent) it is fixed to 0


In [9]:
# LinearAnnealedPolicy allows to decay the epsilon for the epsilon greedy strategy
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy

policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), 
                              attr='eps',
                              value_max=1.,
                              value_min=.1,
                              value_test=0,
                              nb_steps=NUMBER_STEPS) 


**TASK: Create the DQNAgent** <br />
Feel free to play with the nb_steps_warump, target_model_update, batch_size and gamma parameters. <br />
Hint:<br />
You can try *nb_steps_warmup*=100, *target_model_update*=1000, *batch_size*=64 and *gamma*=0.992 as a first guess

In [10]:
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=100,
               target_model_update=1000, policy=policy, batch_size=64, gamma=0.992)

Finally we compile our model with the Adam optimizer and a learning rate of 0.001.<br />
We log the Mean Absolute Error

In [11]:
# Use learning_rate instead of lr if you get warning
dqn.compile(Adam(lr=0.001), metrics=['mae']) 

Now we run the training for 150000 steps. You can change visualize=True if you want to watch your model learning.
Keep in mind that this increases the running time
The training time is around 30 min so grep your favorite beverage and stay tuned


In [9]:
dqn.fit(env, nb_steps=NUMBER_STEPS, visualize=False, verbose=2, callbacks=[WandbCallback()])

Training for 150000 steps ...
     93/150000: episode: 1, duration: 0.133s, episode steps:  93, steps per second: 701, episode reward: -115.098, mean reward: -1.238 [-100.000, 12.663], mean action: 1.559 [0.000, 3.000],  loss: --, mae: --, mean_q: --, mean_eps: --




    210/150000: episode: 2, duration: 1.218s, episode steps: 117, steps per second:  96, episode reward: -171.760, mean reward: -1.468 [-100.000, 19.164], mean action: 1.624 [0.000, 3.000],  loss: 24.544618, mae: 1.193189, mean_q: 1.460490, mean_eps: 0.999070
    294/150000: episode: 3, duration: 0.487s, episode steps:  84, steps per second: 172, episode reward: -80.888, mean reward: -0.963 [-100.000,  7.298], mean action: 1.583 [0.000, 3.000],  loss: 31.907632, mae: 1.597973, mean_q: 1.511593, mean_eps: 0.998491
    414/150000: episode: 4, duration: 0.715s, episode steps: 120, steps per second: 168, episode reward: -379.890, mean reward: -3.166 [-100.000, 110.786], mean action: 1.450 [0.000, 3.000],  loss: 23.614650, mae: 2.098975, mean_q: 2.325129, mean_eps: 0.997879
    506/150000: episode: 5, duration: 0.543s, episode steps:  92, steps per second: 169, episode reward: -331.912, mean reward: -3.608 [-100.000,  5.127], mean action: 1.598 [0.000, 3.000],  loss: 25.103136, mae: 2.50062

   3140/150000: episode: 34, duration: 0.662s, episode steps: 101, steps per second: 153, episode reward: -121.345, mean reward: -1.201 [-100.000, 13.871], mean action: 1.475 [0.000, 3.000],  loss: 18.427108, mae: 5.363812, mean_q: 3.543080, mean_eps: 0.981466
   3278/150000: episode: 35, duration: 0.886s, episode steps: 138, steps per second: 156, episode reward: -154.994, mean reward: -1.123 [-100.000,  7.176], mean action: 1.645 [0.000, 3.000],  loss: 16.396252, mae: 5.380090, mean_q: 3.677482, mean_eps: 0.980749
   3345/150000: episode: 36, duration: 0.487s, episode steps:  67, steps per second: 138, episode reward: -95.707, mean reward: -1.428 [-100.000, 16.914], mean action: 1.478 [0.000, 3.000],  loss: 13.288225, mae: 5.221876, mean_q: 3.907543, mean_eps: 0.980134
   3470/150000: episode: 37, duration: 0.866s, episode steps: 125, steps per second: 144, episode reward: -156.282, mean reward: -1.250 [-100.000,  5.550], mean action: 1.552 [0.000, 3.000],  loss: 18.133352, mae: 5.37

   6163/150000: episode: 66, duration: 0.461s, episode steps:  72, steps per second: 156, episode reward: -50.242, mean reward: -0.698 [-100.000, 11.999], mean action: 1.528 [0.000, 3.000],  loss: 13.400370, mae: 7.743819, mean_q: 5.035287, mean_eps: 0.963241
   6269/150000: episode: 67, duration: 0.717s, episode steps: 106, steps per second: 148, episode reward: -285.082, mean reward: -2.689 [-100.000, 108.083], mean action: 1.500 [0.000, 3.000],  loss: 11.429059, mae: 7.966436, mean_q: 4.650674, mean_eps: 0.962707
   6351/150000: episode: 68, duration: 0.640s, episode steps:  82, steps per second: 128, episode reward: -123.706, mean reward: -1.509 [-100.000,  8.191], mean action: 1.476 [0.000, 3.000],  loss: 12.435406, mae: 8.026945, mean_q: 4.363116, mean_eps: 0.962143
   6488/150000: episode: 69, duration: 1.036s, episode steps: 137, steps per second: 132, episode reward: -211.840, mean reward: -1.546 [-100.000, 14.325], mean action: 1.723 [0.000, 3.000],  loss: 15.239830, mae: 8.0

   9148/150000: episode: 98, duration: 0.491s, episode steps:  73, steps per second: 149, episode reward: -124.938, mean reward: -1.711 [-100.000,  9.914], mean action: 1.493 [0.000, 3.000],  loss: 12.929275, mae: 11.127684, mean_q: 5.931338, mean_eps: 0.945334
   9236/150000: episode: 99, duration: 0.569s, episode steps:  88, steps per second: 155, episode reward: -311.146, mean reward: -3.536 [-100.000,  6.630], mean action: 1.602 [0.000, 3.000],  loss: 12.039179, mae: 11.220533, mean_q: 6.191321, mean_eps: 0.944851
   9304/150000: episode: 100, duration: 0.424s, episode steps:  68, steps per second: 160, episode reward: -137.964, mean reward: -2.029 [-100.000, 15.702], mean action: 1.603 [0.000, 3.000],  loss: 8.473913, mae: 11.102570, mean_q: 5.624313, mean_eps: 0.944383
   9369/150000: episode: 101, duration: 0.408s, episode steps:  65, steps per second: 159, episode reward: -102.909, mean reward: -1.583 [-100.000,  6.942], mean action: 1.800 [0.000, 3.000],  loss: 10.137023, mae:

  11946/150000: episode: 130, duration: 0.496s, episode steps:  72, steps per second: 145, episode reward: -138.816, mean reward: -1.928 [-100.000,  7.761], mean action: 1.597 [0.000, 3.000],  loss: 10.594880, mae: 13.542006, mean_q: 6.519893, mean_eps: 0.928543
  12033/150000: episode: 131, duration: 0.570s, episode steps:  87, steps per second: 153, episode reward: -32.795, mean reward: -0.377 [-100.000, 11.218], mean action: 1.552 [0.000, 3.000],  loss: 12.190500, mae: 14.187948, mean_q: 7.037169, mean_eps: 0.928066
  12137/150000: episode: 132, duration: 0.817s, episode steps: 104, steps per second: 127, episode reward: -187.941, mean reward: -1.807 [-100.000,  1.813], mean action: 1.644 [0.000, 3.000],  loss: 13.383901, mae: 15.008656, mean_q: 6.699350, mean_eps: 0.927493
  12252/150000: episode: 133, duration: 0.843s, episode steps: 115, steps per second: 136, episode reward: -107.796, mean reward: -0.937 [-100.000,  5.788], mean action: 1.391 [0.000, 3.000],  loss: 13.526540, ma

  14895/150000: episode: 162, duration: 0.443s, episode steps:  69, steps per second: 156, episode reward: -85.697, mean reward: -1.242 [-100.000,  9.940], mean action: 1.507 [0.000, 3.000],  loss: 8.922827, mae: 16.574114, mean_q: 7.380854, mean_eps: 0.910840
  14991/150000: episode: 163, duration: 0.637s, episode steps:  96, steps per second: 151, episode reward: -268.709, mean reward: -2.799 [-100.000,  4.555], mean action: 1.635 [0.000, 3.000],  loss: 10.439651, mae: 16.272304, mean_q: 7.679170, mean_eps: 0.910345
  15080/150000: episode: 164, duration: 0.561s, episode steps:  89, steps per second: 159, episode reward: -191.669, mean reward: -2.154 [-100.000, 22.924], mean action: 1.449 [0.000, 3.000],  loss: 11.294942, mae: 17.579012, mean_q: 7.130084, mean_eps: 0.909790
  15178/150000: episode: 165, duration: 0.605s, episode steps:  98, steps per second: 162, episode reward: -112.564, mean reward: -1.149 [-100.000, 13.232], mean action: 1.541 [0.000, 3.000],  loss: 10.623696, mae

  18025/150000: episode: 194, duration: 0.536s, episode steps:  86, steps per second: 160, episode reward: -107.255, mean reward: -1.247 [-100.000, 27.514], mean action: 1.488 [0.000, 3.000],  loss: 6.349089, mae: 18.902823, mean_q: 7.534113, mean_eps: 0.892111
  18141/150000: episode: 195, duration: 0.780s, episode steps: 116, steps per second: 149, episode reward: -71.174, mean reward: -0.614 [-100.000, 12.258], mean action: 1.500 [0.000, 3.000],  loss: 11.313719, mae: 20.082786, mean_q: 7.672297, mean_eps: 0.891505
  18229/150000: episode: 196, duration: 0.557s, episode steps:  88, steps per second: 158, episode reward: -53.625, mean reward: -0.609 [-100.000, 15.754], mean action: 1.500 [0.000, 3.000],  loss: 7.659841, mae: 19.732425, mean_q: 7.785956, mean_eps: 0.890893
  18298/150000: episode: 197, duration: 0.473s, episode steps:  69, steps per second: 146, episode reward: -120.373, mean reward: -1.745 [-100.000,  8.433], mean action: 1.681 [0.000, 3.000],  loss: 10.017442, mae: 

  20877/150000: episode: 226, duration: 0.624s, episode steps:  97, steps per second: 155, episode reward: -123.656, mean reward: -1.275 [-100.000,  5.304], mean action: 1.794 [0.000, 3.000],  loss: 6.823948, mae: 22.218126, mean_q: 9.224854, mean_eps: 0.875032
  20938/150000: episode: 227, duration: 0.384s, episode steps:  61, steps per second: 159, episode reward: -42.857, mean reward: -0.703 [-100.000, 16.644], mean action: 1.525 [0.000, 3.000],  loss: 12.613511, mae: 22.113790, mean_q: 9.168183, mean_eps: 0.874558
  21003/150000: episode: 228, duration: 0.402s, episode steps:  65, steps per second: 162, episode reward: -35.461, mean reward: -0.546 [-100.000, 23.445], mean action: 1.538 [0.000, 3.000],  loss: 7.793853, mae: 22.083455, mean_q: 9.742774, mean_eps: 0.874180
  21084/150000: episode: 229, duration: 0.501s, episode steps:  81, steps per second: 162, episode reward: -101.666, mean reward: -1.255 [-100.000,  8.232], mean action: 1.642 [0.000, 3.000],  loss: 9.019715, mae: 2

  23750/150000: episode: 258, duration: 0.412s, episode steps:  66, steps per second: 160, episode reward: -87.172, mean reward: -1.321 [-100.000, 10.715], mean action: 1.667 [0.000, 3.000],  loss: 8.767964, mae: 23.961454, mean_q: 9.965940, mean_eps: 0.857701
  23823/150000: episode: 259, duration: 0.470s, episode steps:  73, steps per second: 155, episode reward: 45.813, mean reward:  0.628 [-100.000, 131.430], mean action: 1.589 [0.000, 3.000],  loss: 7.190174, mae: 23.789901, mean_q: 10.638375, mean_eps: 0.857284
  23958/150000: episode: 260, duration: 0.876s, episode steps: 135, steps per second: 154, episode reward: -81.745, mean reward: -0.606 [-100.000,  9.963], mean action: 1.496 [0.000, 3.000],  loss: 11.059071, mae: 23.926115, mean_q: 10.773417, mean_eps: 0.856660
  24021/150000: episode: 261, duration: 0.393s, episode steps:  63, steps per second: 160, episode reward: -91.382, mean reward: -1.451 [-100.000,  6.259], mean action: 1.857 [0.000, 3.000],  loss: 8.974356, mae: 2

  26722/150000: episode: 290, duration: 0.475s, episode steps:  71, steps per second: 150, episode reward: -79.581, mean reward: -1.121 [-100.000,  9.054], mean action: 1.592 [0.000, 3.000],  loss: 7.008504, mae: 25.848175, mean_q: 12.372808, mean_eps: 0.839884
  26789/150000: episode: 291, duration: 0.437s, episode steps:  67, steps per second: 153, episode reward: -76.360, mean reward: -1.140 [-100.000,  7.244], mean action: 1.716 [0.000, 3.000],  loss: 7.169850, mae: 26.287747, mean_q: 11.961996, mean_eps: 0.839470
  26867/150000: episode: 292, duration: 0.495s, episode steps:  78, steps per second: 158, episode reward: -107.178, mean reward: -1.374 [-100.000, 12.242], mean action: 1.769 [0.000, 3.000],  loss: 5.651642, mae: 25.713452, mean_q: 12.921466, mean_eps: 0.839035
  26960/150000: episode: 293, duration: 0.609s, episode steps:  93, steps per second: 153, episode reward: -171.800, mean reward: -1.847 [-100.000, 16.518], mean action: 1.355 [0.000, 3.000],  loss: 9.364415, mae:

  29546/150000: episode: 322, duration: 0.524s, episode steps:  83, steps per second: 158, episode reward: -75.365, mean reward: -0.908 [-100.000, 12.023], mean action: 1.518 [0.000, 3.000],  loss: 7.352708, mae: 27.843152, mean_q: 13.862409, mean_eps: 0.822976
  29630/150000: episode: 323, duration: 0.529s, episode steps:  84, steps per second: 159, episode reward: -64.072, mean reward: -0.763 [-100.000,  6.640], mean action: 1.548 [0.000, 3.000],  loss: 12.115000, mae: 27.494587, mean_q: 13.881583, mean_eps: 0.822475
  29722/150000: episode: 324, duration: 0.606s, episode steps:  92, steps per second: 152, episode reward: -56.758, mean reward: -0.617 [-100.000, 13.596], mean action: 1.413 [0.000, 3.000],  loss: 9.325412, mae: 28.197578, mean_q: 14.035832, mean_eps: 0.821947
  29813/150000: episode: 325, duration: 0.603s, episode steps:  91, steps per second: 151, episode reward: -47.530, mean reward: -0.522 [-100.000, 27.732], mean action: 1.593 [0.000, 3.000],  loss: 13.961379, mae:

  32548/150000: episode: 354, duration: 0.629s, episode steps:  98, steps per second: 156, episode reward: -44.067, mean reward: -0.450 [-100.000, 12.967], mean action: 1.745 [0.000, 3.000],  loss: 7.460371, mae: 29.792957, mean_q: 15.851097, mean_eps: 0.805009
  32637/150000: episode: 355, duration: 0.568s, episode steps:  89, steps per second: 157, episode reward: -130.689, mean reward: -1.468 [-100.000,  5.566], mean action: 1.393 [0.000, 3.000],  loss: 8.809092, mae: 29.558242, mean_q: 16.008864, mean_eps: 0.804448
  32738/150000: episode: 356, duration: 0.696s, episode steps: 101, steps per second: 145, episode reward: -75.536, mean reward: -0.748 [-100.000, 14.352], mean action: 1.653 [0.000, 3.000],  loss: 10.884951, mae: 29.421460, mean_q: 16.351891, mean_eps: 0.803878
  32810/150000: episode: 357, duration: 0.467s, episode steps:  72, steps per second: 154, episode reward:  3.460, mean reward:  0.048 [-100.000, 21.374], mean action: 1.625 [0.000, 3.000],  loss: 6.653661, mae: 

  35687/150000: episode: 386, duration: 0.518s, episode steps:  77, steps per second: 149, episode reward: -77.275, mean reward: -1.004 [-100.000, 17.048], mean action: 1.766 [0.000, 3.000],  loss: 6.901076, mae: 30.767627, mean_q: 18.802491, mean_eps: 0.786112
  35795/150000: episode: 387, duration: 0.737s, episode steps: 108, steps per second: 146, episode reward: -112.353, mean reward: -1.040 [-100.000,  7.872], mean action: 1.509 [0.000, 3.000],  loss: 8.161109, mae: 31.306365, mean_q: 17.303164, mean_eps: 0.785557
  35879/150000: episode: 388, duration: 0.575s, episode steps:  84, steps per second: 146, episode reward: -113.543, mean reward: -1.352 [-100.000, 12.096], mean action: 1.667 [0.000, 3.000],  loss: 7.983832, mae: 30.819564, mean_q: 18.267524, mean_eps: 0.784981
  35941/150000: episode: 389, duration: 0.418s, episode steps:  62, steps per second: 148, episode reward: -48.951, mean reward: -0.790 [-100.000, 16.065], mean action: 1.597 [0.000, 3.000],  loss: 5.921978, mae:

  38634/150000: episode: 418, duration: 0.842s, episode steps: 129, steps per second: 153, episode reward: -393.334, mean reward: -3.049 [-100.000, 45.490], mean action: 1.612 [0.000, 3.000],  loss: 7.206176, mae: 32.583365, mean_q: 19.111198, mean_eps: 0.768586
  38716/150000: episode: 419, duration: 0.538s, episode steps:  82, steps per second: 152, episode reward: -100.596, mean reward: -1.227 [-100.000, 35.003], mean action: 1.366 [0.000, 3.000],  loss: 5.882298, mae: 32.444584, mean_q: 19.951000, mean_eps: 0.767953
  38836/150000: episode: 420, duration: 0.832s, episode steps: 120, steps per second: 144, episode reward: -125.941, mean reward: -1.050 [-100.000, 26.796], mean action: 1.667 [0.000, 3.000],  loss: 8.523036, mae: 32.584182, mean_q: 19.579815, mean_eps: 0.767347
  38932/150000: episode: 421, duration: 0.624s, episode steps:  96, steps per second: 154, episode reward: -112.816, mean reward: -1.175 [-100.000,  5.067], mean action: 1.740 [0.000, 3.000],  loss: 7.256391, ma

  41621/150000: episode: 450, duration: 0.573s, episode steps:  82, steps per second: 143, episode reward: -122.527, mean reward: -1.494 [-100.000, 21.356], mean action: 1.134 [0.000, 3.000],  loss: 4.903860, mae: 32.459277, mean_q: 20.114383, mean_eps: 0.750523
  41714/150000: episode: 451, duration: 0.639s, episode steps:  93, steps per second: 145, episode reward: -115.683, mean reward: -1.244 [-100.000, 17.172], mean action: 1.495 [0.000, 3.000],  loss: 8.299766, mae: 32.113703, mean_q: 19.624049, mean_eps: 0.749998
  41843/150000: episode: 452, duration: 0.845s, episode steps: 129, steps per second: 153, episode reward: -42.656, mean reward: -0.331 [-100.000,  7.154], mean action: 1.504 [0.000, 3.000],  loss: 9.779988, mae: 32.354453, mean_q: 20.160950, mean_eps: 0.749332
  41911/150000: episode: 453, duration: 0.457s, episode steps:  68, steps per second: 149, episode reward: -148.433, mean reward: -2.183 [-100.000,  7.564], mean action: 1.412 [0.000, 3.000],  loss: 6.597249, mae

  44514/150000: episode: 482, duration: 0.554s, episode steps:  80, steps per second: 144, episode reward: -27.953, mean reward: -0.349 [-100.000, 15.580], mean action: 1.325 [0.000, 3.000],  loss: 6.861582, mae: 33.649924, mean_q: 19.742610, mean_eps: 0.733159
  44603/150000: episode: 483, duration: 0.585s, episode steps:  89, steps per second: 152, episode reward: -69.632, mean reward: -0.782 [-100.000, 10.537], mean action: 1.753 [0.000, 3.000],  loss: 5.937384, mae: 32.831088, mean_q: 19.363054, mean_eps: 0.732652
  44709/150000: episode: 484, duration: 0.717s, episode steps: 106, steps per second: 148, episode reward: -66.825, mean reward: -0.630 [-100.000, 10.648], mean action: 1.519 [0.000, 3.000],  loss: 8.214511, mae: 33.492190, mean_q: 19.998093, mean_eps: 0.732067
  44819/150000: episode: 485, duration: 0.755s, episode steps: 110, steps per second: 146, episode reward: -31.774, mean reward: -0.289 [-100.000, 12.404], mean action: 1.618 [0.000, 3.000],  loss: 5.992909, mae: 3

  47546/150000: episode: 514, duration: 0.696s, episode steps: 102, steps per second: 147, episode reward: -98.972, mean reward: -0.970 [-100.000, 10.900], mean action: 1.461 [0.000, 3.000],  loss: 7.559752, mae: 33.156245, mean_q: 20.851578, mean_eps: 0.715033
  47634/150000: episode: 515, duration: 0.613s, episode steps:  88, steps per second: 144, episode reward: -115.189, mean reward: -1.309 [-100.000,  6.511], mean action: 1.455 [0.000, 3.000],  loss: 6.945227, mae: 33.188907, mean_q: 20.111664, mean_eps: 0.714463
  47703/150000: episode: 516, duration: 0.469s, episode steps:  69, steps per second: 147, episode reward: -50.480, mean reward: -0.732 [-100.000, 14.107], mean action: 1.652 [0.000, 3.000],  loss: 4.655087, mae: 33.363126, mean_q: 18.590262, mean_eps: 0.713992
  47811/150000: episode: 517, duration: 0.723s, episode steps: 108, steps per second: 149, episode reward: -89.929, mean reward: -0.833 [-100.000, 13.225], mean action: 1.676 [0.000, 3.000],  loss: 10.156859, mae:

  50655/150000: episode: 546, duration: 0.643s, episode steps:  95, steps per second: 148, episode reward: -86.695, mean reward: -0.913 [-100.000,  6.764], mean action: 1.684 [0.000, 3.000],  loss: 5.015145, mae: 33.279986, mean_q: 20.521668, mean_eps: 0.696358
  50751/150000: episode: 547, duration: 0.681s, episode steps:  96, steps per second: 141, episode reward: -106.419, mean reward: -1.109 [-100.000, 15.261], mean action: 1.375 [0.000, 3.000],  loss: 5.898320, mae: 33.909001, mean_q: 20.465735, mean_eps: 0.695785
  50858/150000: episode: 548, duration: 0.726s, episode steps: 107, steps per second: 147, episode reward: -67.228, mean reward: -0.628 [-100.000, 11.775], mean action: 1.505 [0.000, 3.000],  loss: 9.287596, mae: 33.220127, mean_q: 20.833539, mean_eps: 0.695176
  50959/150000: episode: 549, duration: 0.687s, episode steps: 101, steps per second: 147, episode reward: -89.596, mean reward: -0.887 [-100.000, 12.506], mean action: 1.713 [0.000, 3.000],  loss: 6.221144, mae: 

  53816/150000: episode: 578, duration: 0.590s, episode steps:  84, steps per second: 142, episode reward: -21.752, mean reward: -0.259 [-100.000, 17.104], mean action: 1.643 [0.000, 3.000],  loss: 8.515072, mae: 33.722496, mean_q: 19.382689, mean_eps: 0.677359
  53920/150000: episode: 579, duration: 0.701s, episode steps: 104, steps per second: 148, episode reward: -45.806, mean reward: -0.440 [-100.000, 13.315], mean action: 1.663 [0.000, 3.000],  loss: 5.334314, mae: 33.871264, mean_q: 21.374811, mean_eps: 0.676795
  54008/150000: episode: 580, duration: 0.633s, episode steps:  88, steps per second: 139, episode reward: -76.860, mean reward: -0.873 [-100.000,  8.202], mean action: 1.500 [0.000, 3.000],  loss: 7.578843, mae: 33.789845, mean_q: 19.966911, mean_eps: 0.676219
  54133/150000: episode: 581, duration: 0.846s, episode steps: 125, steps per second: 148, episode reward: -57.337, mean reward: -0.459 [-100.000,  7.230], mean action: 1.416 [0.000, 3.000],  loss: 7.051161, mae: 3

  57141/150000: episode: 610, duration: 0.694s, episode steps: 100, steps per second: 144, episode reward: -67.699, mean reward: -0.677 [-100.000, 22.509], mean action: 1.650 [0.000, 3.000],  loss: 6.706931, mae: 34.437203, mean_q: 18.912294, mean_eps: 0.657457
  57291/150000: episode: 611, duration: 1.002s, episode steps: 150, steps per second: 150, episode reward: -240.447, mean reward: -1.603 [-100.000, 25.560], mean action: 1.447 [0.000, 3.000],  loss: 5.803976, mae: 33.859607, mean_q: 19.865143, mean_eps: 0.656707
  57428/150000: episode: 612, duration: 0.982s, episode steps: 137, steps per second: 140, episode reward: -91.274, mean reward: -0.666 [-100.000, 10.310], mean action: 1.708 [0.000, 3.000],  loss: 7.221682, mae: 34.414675, mean_q: 19.608450, mean_eps: 0.655846
  57527/150000: episode: 613, duration: 0.670s, episode steps:  99, steps per second: 148, episode reward: -101.215, mean reward: -1.022 [-100.000,  9.722], mean action: 1.576 [0.000, 3.000],  loss: 7.453058, mae:

  60340/150000: episode: 642, duration: 0.803s, episode steps: 119, steps per second: 148, episode reward: -150.849, mean reward: -1.268 [-100.000, 11.264], mean action: 1.840 [0.000, 3.000],  loss: 10.729680, mae: 34.247815, mean_q: 19.123897, mean_eps: 0.638320
  60469/150000: episode: 643, duration: 0.873s, episode steps: 129, steps per second: 148, episode reward: -79.156, mean reward: -0.614 [-100.000,  6.668], mean action: 1.620 [0.000, 3.000],  loss: 8.032526, mae: 34.591223, mean_q: 18.909427, mean_eps: 0.637576
  60579/150000: episode: 644, duration: 0.790s, episode steps: 110, steps per second: 139, episode reward: -130.743, mean reward: -1.189 [-100.000,  3.967], mean action: 1.527 [0.000, 3.000],  loss: 4.421055, mae: 34.144685, mean_q: 18.508574, mean_eps: 0.636859
  60700/150000: episode: 645, duration: 0.814s, episode steps: 121, steps per second: 149, episode reward: 30.014, mean reward:  0.248 [-100.000, 59.733], mean action: 1.612 [0.000, 3.000],  loss: 9.360707, mae:

  63550/150000: episode: 674, duration: 0.868s, episode steps: 123, steps per second: 142, episode reward: -11.828, mean reward: -0.096 [-100.000, 16.708], mean action: 1.659 [0.000, 3.000],  loss: 4.127840, mae: 34.523863, mean_q: 18.345703, mean_eps: 0.619072
  63639/150000: episode: 675, duration: 0.609s, episode steps:  89, steps per second: 146, episode reward: -41.734, mean reward: -0.469 [-100.000, 13.773], mean action: 1.438 [0.000, 3.000],  loss: 5.959537, mae: 34.586038, mean_q: 20.143349, mean_eps: 0.618436
  63730/150000: episode: 676, duration: 0.660s, episode steps:  91, steps per second: 138, episode reward: -67.095, mean reward: -0.737 [-100.000, 10.808], mean action: 1.505 [0.000, 3.000],  loss: 7.065000, mae: 34.901721, mean_q: 19.475058, mean_eps: 0.617896
  63809/150000: episode: 677, duration: 0.583s, episode steps:  79, steps per second: 136, episode reward: -94.313, mean reward: -1.194 [-100.000, 13.706], mean action: 1.481 [0.000, 3.000],  loss: 6.720678, mae: 3

  66915/150000: episode: 706, duration: 0.997s, episode steps: 146, steps per second: 147, episode reward: -25.405, mean reward: -0.174 [-100.000, 22.052], mean action: 1.644 [0.000, 3.000],  loss: 3.563983, mae: 35.237519, mean_q: 19.668299, mean_eps: 0.598951
  67050/150000: episode: 707, duration: 0.926s, episode steps: 135, steps per second: 146, episode reward: -37.293, mean reward: -0.276 [-100.000, 21.379], mean action: 1.681 [0.000, 3.000],  loss: 8.543249, mae: 35.557248, mean_q: 20.605136, mean_eps: 0.598108
  67153/150000: episode: 708, duration: 0.709s, episode steps: 103, steps per second: 145, episode reward: -3.719, mean reward: -0.036 [-100.000, 18.073], mean action: 1.757 [0.000, 3.000],  loss: 8.119633, mae: 35.761494, mean_q: 21.876140, mean_eps: 0.597394
  67267/150000: episode: 709, duration: 0.821s, episode steps: 114, steps per second: 139, episode reward: -40.720, mean reward: -0.357 [-100.000, 13.240], mean action: 1.632 [0.000, 3.000],  loss: 6.414598, mae: 35

  70352/150000: episode: 738, duration: 1.108s, episode steps: 153, steps per second: 138, episode reward: -49.272, mean reward: -0.322 [-100.000,  9.479], mean action: 1.569 [0.000, 3.000],  loss: 6.111862, mae: 36.745453, mean_q: 22.146757, mean_eps: 0.578350
  70424/150000: episode: 739, duration: 0.490s, episode steps:  72, steps per second: 147, episode reward: -45.653, mean reward: -0.634 [-100.000, 16.357], mean action: 1.792 [0.000, 3.000],  loss: 3.565128, mae: 36.613608, mean_q: 22.023906, mean_eps: 0.577675
  70573/150000: episode: 740, duration: 0.994s, episode steps: 149, steps per second: 150, episode reward: -13.732, mean reward: -0.092 [-100.000, 14.822], mean action: 1.799 [0.000, 3.000],  loss: 4.018649, mae: 36.838660, mean_q: 21.461408, mean_eps: 0.577012
  70703/150000: episode: 741, duration: 0.912s, episode steps: 130, steps per second: 143, episode reward: -19.721, mean reward: -0.152 [-100.000, 17.865], mean action: 1.577 [0.000, 3.000],  loss: 5.425392, mae: 3

  74100/150000: episode: 770, duration: 0.618s, episode steps:  85, steps per second: 137, episode reward: -63.968, mean reward: -0.753 [-100.000, 12.973], mean action: 1.553 [0.000, 3.000],  loss: 6.437460, mae: 36.742774, mean_q: 21.593810, mean_eps: 0.555658
  74222/150000: episode: 771, duration: 0.869s, episode steps: 122, steps per second: 140, episode reward: -71.738, mean reward: -0.588 [-100.000,  7.041], mean action: 1.697 [0.000, 3.000],  loss: 5.166299, mae: 36.440906, mean_q: 21.334916, mean_eps: 0.555037
  74322/150000: episode: 772, duration: 0.674s, episode steps: 100, steps per second: 148, episode reward: -57.358, mean reward: -0.574 [-100.000, 15.490], mean action: 1.680 [0.000, 3.000],  loss: 4.642952, mae: 36.725521, mean_q: 21.432260, mean_eps: 0.554371
  74437/150000: episode: 773, duration: 0.776s, episode steps: 115, steps per second: 148, episode reward: -56.778, mean reward: -0.494 [-100.000, 18.160], mean action: 1.539 [0.000, 3.000],  loss: 5.619163, mae: 3

  79641/150000: episode: 802, duration: 1.065s, episode steps: 153, steps per second: 144, episode reward: -186.349, mean reward: -1.218 [-100.000,  8.672], mean action: 1.673 [0.000, 3.000],  loss: 4.303985, mae: 35.575775, mean_q: 20.611393, mean_eps: 0.522616
  79775/150000: episode: 803, duration: 0.912s, episode steps: 134, steps per second: 147, episode reward: -10.044, mean reward: -0.075 [-100.000, 15.200], mean action: 1.552 [0.000, 3.000],  loss: 6.506110, mae: 35.953140, mean_q: 20.787103, mean_eps: 0.521755
  79933/150000: episode: 804, duration: 1.131s, episode steps: 158, steps per second: 140, episode reward: -151.013, mean reward: -0.956 [-100.000, 41.239], mean action: 1.797 [0.000, 3.000],  loss: 4.691068, mae: 35.455677, mean_q: 21.312353, mean_eps: 0.520879
  80036/150000: episode: 805, duration: 0.691s, episode steps: 103, steps per second: 149, episode reward: -39.704, mean reward: -0.385 [-100.000, 15.872], mean action: 1.689 [0.000, 3.000],  loss: 5.614282, mae:

  85845/150000: episode: 834, duration: 0.799s, episode steps: 113, steps per second: 141, episode reward: -23.486, mean reward: -0.208 [-100.000, 21.832], mean action: 1.717 [0.000, 3.000],  loss: 8.117824, mae: 35.277742, mean_q: 21.396441, mean_eps: 0.485272
  85926/150000: episode: 835, duration: 0.552s, episode steps:  81, steps per second: 147, episode reward: -36.282, mean reward: -0.448 [-100.000, 10.965], mean action: 1.840 [0.000, 3.000],  loss: 9.446624, mae: 35.244554, mean_q: 21.342148, mean_eps: 0.484690
  86163/150000: episode: 836, duration: 1.778s, episode steps: 237, steps per second: 133, episode reward: -73.835, mean reward: -0.312 [-100.000, 15.453], mean action: 1.759 [0.000, 3.000],  loss: 7.407952, mae: 34.959829, mean_q: 21.621873, mean_eps: 0.483736
  86262/150000: episode: 837, duration: 0.657s, episode steps:  99, steps per second: 151, episode reward: -163.675, mean reward: -1.653 [-100.000,  1.490], mean action: 1.586 [0.000, 3.000],  loss: 8.212032, mae: 

  99009/150000: episode: 866, duration: 1.464s, episode steps: 209, steps per second: 143, episode reward: -44.859, mean reward: -0.215 [-100.000, 52.220], mean action: 1.579 [0.000, 3.000],  loss: 7.151001, mae: 30.051914, mean_q: 23.205500, mean_eps: 0.406576
  99307/150000: episode: 867, duration: 2.098s, episode steps: 298, steps per second: 142, episode reward: -166.036, mean reward: -0.557 [-100.000, 18.946], mean action: 1.762 [0.000, 3.000],  loss: 7.398812, mae: 30.582721, mean_q: 24.413634, mean_eps: 0.405055
  99463/150000: episode: 868, duration: 1.127s, episode steps: 156, steps per second: 138, episode reward: -55.357, mean reward: -0.355 [-100.000,  9.588], mean action: 1.641 [0.000, 3.000],  loss: 7.815472, mae: 30.394103, mean_q: 23.955935, mean_eps: 0.403693
  99688/150000: episode: 869, duration: 1.579s, episode steps: 225, steps per second: 142, episode reward: -81.323, mean reward: -0.361 [-100.000, 19.505], mean action: 1.711 [0.000, 3.000],  loss: 8.012784, mae: 

 119360/150000: episode: 898, duration: 7.605s, episode steps: 1000, steps per second: 131, episode reward: 153.488, mean reward:  0.153 [-24.100, 22.827], mean action: 1.089 [0.000, 3.000],  loss: 9.017034, mae: 23.155876, mean_q: 25.286232, mean_eps: 0.286843
 119901/150000: episode: 899, duration: 4.095s, episode steps: 541, steps per second: 132, episode reward: -204.692, mean reward: -0.378 [-100.000, 16.156], mean action: 1.750 [0.000, 3.000],  loss: 9.133801, mae: 22.828125, mean_q: 25.539530, mean_eps: 0.282220
 120901/150000: episode: 900, duration: 7.460s, episode steps: 1000, steps per second: 134, episode reward: 58.004, mean reward:  0.058 [-21.500, 23.964], mean action: 1.314 [0.000, 3.000],  loss: 8.508068, mae: 22.530595, mean_q: 25.537361, mean_eps: 0.277597
 121901/150000: episode: 901, duration: 7.886s, episode steps: 1000, steps per second: 127, episode reward: -11.128, mean reward: -0.011 [-21.814, 14.587], mean action: 1.711 [0.000, 3.000],  loss: 8.809993, mae: 2

 143771/150000: episode: 930, duration: 2.478s, episode steps: 345, steps per second: 139, episode reward: 236.485, mean reward:  0.685 [-14.019, 100.000], mean action: 1.522 [0.000, 3.000],  loss: 6.165199, mae: 20.828962, mean_q: 27.483780, mean_eps: 0.138412
 144713/150000: episode: 931, duration: 7.255s, episode steps: 942, steps per second: 130, episode reward: 140.280, mean reward:  0.149 [-20.089, 100.000], mean action: 1.304 [0.000, 3.000],  loss: 4.109369, mae: 20.768226, mean_q: 27.566711, mean_eps: 0.134551
 145456/150000: episode: 932, duration: 5.700s, episode steps: 743, steps per second: 130, episode reward: 174.815, mean reward:  0.235 [-20.721, 100.000], mean action: 1.857 [0.000, 3.000],  loss: 4.049153, mae: 20.871992, mean_q: 27.780278, mean_eps: 0.129496
 145878/150000: episode: 933, duration: 3.054s, episode steps: 422, steps per second: 138, episode reward: 254.777, mean reward:  0.604 [-10.768, 100.000], mean action: 1.438 [0.000, 3.000],  loss: 4.593402, mae: 2

<tensorflow.python.keras.callbacks.History at 0x1ab98598130>

Wow! After only some minutes of training, we achieve great results!
The reason for this is, that keras-rl has implemented many optimization strategies (e.g the optimized replay buffer) which lead to a much faster convergence than our DQN implemented by hand

In [20]:
# After training is done, we save the final weights.
dqn.save_weights(f'OKv1_LunarLander_2x150000steps_y0,992_a0,001.h5f', overwrite=True)

In [None]:
# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=200, visualize=False, callbacks=[WandbCallback()])
env.close()

In [12]:
dqn.load_weights(f'OKv1_LunarLander_2x150000steps_y0,992_a0,001.h5f')

In [14]:
# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=5, visualize=True)
env.close()

Testing for 5 episodes ...
Episode 1: reward: 255.760, steps: 199
Episode 2: reward: 228.585, steps: 346
Episode 3: reward: 246.992, steps: 190
Episode 4: reward: 270.068, steps: 192
Episode 5: reward: 254.709, steps: 303
