# Keras-RL DQN Model


At first we will import all necessary packages:

In [2]:
import time  # to reduce the game speed when playing manually

import gym  # Contains the game we want to play
from pyglet.window import key  # for manual playing

# import necessary blocks from keras to build the Deep Learning backbone of our agent
from tensorflow.keras.models import Sequential  # To compose multiple Layers
from tensorflow.keras.layers import Dense  # Fully-Connected layer
from tensorflow.keras.layers import Activation  # Activation functions
from tensorflow.keras.layers import Flatten  # Flatten function

from tensorflow.keras.optimizers import Adam  # Adam optimizer

# Now the keras-rl2 agent. Dont get confused as it is only called rl and not keras-rl

from rl.agents.dqn import DQNAgent  # Use the basic Deep-Q-Network agent

In [3]:
import wandb
from wandb.keras import WandbCallback

wandb.init(config={"hyper": "parameter"})

[34m[1mwandb[0m: Currently logged in as: [33mskyfall-blue[0m (use `wandb login --relogin` to force relogin)


Now we will create the environment:

In [4]:
env_name = ENV_NAME = 'LunarLander-v2'  # https://gym.openai.com/envs/LunarLander-v2/
env = gym.make(env_name)  # create the environment
nb_actions = env.action_space.n  # get the number of possible actions
NUMBER_STEPS = 150000

Lets watch how the game looks when chosing random actions

In [None]:
env.reset()  # reset the environment to the initial state
for _ in range(200):  # play for max 200 iterations
    env.render(mode="human")  # render the current game state on your screen
    random_action = env.action_space.sample()  # chose a random action
    env.step(random_action)  # execute that action
env.close()  # close the environment

In [None]:
action = 0
def key_press(k, mod):
    '''
    This function gets the key press for gym
    '''
    global action
    if k == key.LEFT:
        action = 0
    if k == key.RIGHT:
        action = 1

env.reset()
rewards = 0
for _ in range(1000):
    env.render(mode="human")
    env.viewer.window.on_key_press = key_press  # update the key press
    observation, reward, done, info = env.step(action)
    rewards+=1
    if done:
        print(f"You got {rewards} points!")
        break
    time.sleep(0.1)  # reduce speed a little bit
env.close()

**TASK: Create the Neural Network for your Deep-Q-Agent**
Take a look at the size of the action space and the size of the observation space.
You are free to chose any architecture you want!
Hint: It already works with three layers, each having 64 neurons.

In [5]:
model = Sequential()
# https://keras.io/api/layers/reshaping_layers/flatten/
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))

model.add(Dense(64))
model.add(Activation('relu'))

model.add(Dense(64))
model.add(Activation('relu'))

model.add(Dense(32))
model.add(Activation('relu'))

model.add(Dense(nb_actions))
model.add(Activation('linear'))

print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 8)                 0         
_________________________________________________________________
dense (Dense)                (None, 64)                576       
_________________________________________________________________
activation (Activation)      (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_1 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
activation_2 (Activation)    (None, 32)                0

Lets create the DQN agent from keras-rl
For this setting, the agent takes the following parameters:

1. model = The model
2. nb_actions = The number of actions (2 in this case)
3. memory = The action replay memory. You can choose between the *SequentialMemory()* and *EpisodeParameterMemory() which is only used for one RL agent called CEM*
4. nb_steps_warmup = How many iterations without training - Used to fill the memory
5. target_model_update = When do we update the target model?
6. Action Selection policy. You can choose between a *LinearAnnealedPolicy()*, *SoftmaxPolicy()*, *EpsGreedyQPolicy()*, *GreedyQPolicy()*, *GreedyQPolicy()*, *MaxBoltzmannQPolicy()* and *BoltzmannGumbelQPolicy()*. We use all of them during the next notebooks but feel free to try them out and inspect which works best here

There are some more parameters, you can pass to the DQN Agent. Feel free to explore them, but we will also take a look at them together in the remaining notebooks

Here we initialize the circular buffer with a limit of 50000 and a window length of 1.
The window length describes the number of subsequent actions stored for a state.
This will be demonstrated in the next lecture, when we start dealing with images

In [6]:
from rl.memory import SequentialMemory  # Sequential Memory for storing observations ( optimized circular buffer)

memory = SequentialMemory(limit=50000, window_length=1)


Then we define the Action Selection Policy: <br />
We use *LinearAnnealedPolicy* in order to perform the epsilon greedy strategy with decaying epsilon. <br />
*LinearAnnealedPolicy* accepts an action selection policy, its maximal and minimal values and a step number in order to create a dynimal policy. <br/>
The minimal value epsilon can reach during training is 0.1.<br />
For evaluation (e.g running the agent) it is fixed to 0


In [7]:
# LinearAnnealedPolicy allows to decay the epsilon for the epsilon greedy strategy
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy

policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), 
                              attr='eps',
                              value_max=1.,
                              value_min=.1,
                              value_test=0,
                              nb_steps=NUMBER_STEPS) 


**TASK: Create the DQNAgent** <br />
Feel free to play with the nb_steps_warump, target_model_update, batch_size and gamma parameters. <br />
Hint:<br />
You can try *nb_steps_warmup*=1000, *target_model_update*=1000, *batch_size*=32 and *gamma*=0.99 as a first guess

In [8]:
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               target_model_update=100, policy=policy, batch_size=64, gamma=0.991)


Finally we compile our model with the Adam optimizer and a learning rate of 0.001.<br />
We log the Mean Absolute Error

In [9]:
# Use learning_rate instead of lr if you get warning
dqn.compile(Adam(lr=0.001), metrics=['mse']) 

Now we run the training for 150000 steps. You can change visualize=True if you want to watch your model learning.
Keep in mind that this increases the running time
The training time is around 5 min so grep your favorite beverage and stay tuned


In [13]:
dqn.fit(env, nb_steps=NUMBER_STEPS, visualize=False, verbose=2, callbacks=[WandbCallback()])

Training for 150000 steps ...
    101/150000: episode: 1, duration: 0.680s, episode steps: 101, steps per second: 148, episode reward: -80.514, mean reward: -0.797 [-100.000,  6.883], mean action: 1.554 [0.000, 3.000],  loss: 12.809403, mse: 5478.669884, mean_q: 74.996231, mean_eps: 0.999667
    192/150000: episode: 2, duration: 0.730s, episode steps:  91, steps per second: 125, episode reward: -69.796, mean reward: -0.767 [-100.000, 16.831], mean action: 1.593 [0.000, 3.000],  loss: 8.018393, mse: 5453.873549, mean_q: 74.532044, mean_eps: 0.999124
    262/150000: episode: 3, duration: 0.496s, episode steps:  70, steps per second: 141, episode reward: -78.401, mean reward: -1.120 [-100.000,  7.790], mean action: 1.486 [0.000, 3.000],  loss: 18.678165, mse: 5459.173535, mean_q: 74.923272, mean_eps: 0.998641
    340/150000: episode: 4, duration: 0.649s, episode steps:  78, steps per second: 120, episode reward: -99.475, mean reward: -1.275 [-100.000, 15.132], mean action: 1.590 [0.000, 3

   2933/150000: episode: 32, duration: 0.969s, episode steps: 102, steps per second: 105, episode reward: -321.955, mean reward: -3.156 [-100.000,  7.311], mean action: 1.382 [0.000, 3.000],  loss: 14.968071, mse: 5121.941550, mean_q: 71.737554, mean_eps: 0.982711
   3037/150000: episode: 33, duration: 1.008s, episode steps: 104, steps per second: 103, episode reward: -271.764, mean reward: -2.613 [-100.000,  5.886], mean action: 1.365 [0.000, 3.000],  loss: 14.510038, mse: 5327.710977, mean_q: 73.188263, mean_eps: 0.982093
   3132/150000: episode: 34, duration: 0.866s, episode steps:  95, steps per second: 110, episode reward: -222.916, mean reward: -2.346 [-100.000,  5.440], mean action: 1.568 [0.000, 3.000],  loss: 11.553653, mse: 5323.510537, mean_q: 73.596199, mean_eps: 0.981496
   3223/150000: episode: 35, duration: 0.740s, episode steps:  91, steps per second: 123, episode reward: -455.181, mean reward: -5.002 [-100.000,  0.088], mean action: 1.736 [0.000, 3.000],  loss: 5.78367

   5866/150000: episode: 63, duration: 0.811s, episode steps: 113, steps per second: 139, episode reward: -337.795, mean reward: -2.989 [-100.000,  1.207], mean action: 1.407 [0.000, 3.000],  loss: 11.163018, mse: 4855.290795, mean_q: 68.276288, mean_eps: 0.965146
   5928/150000: episode: 64, duration: 0.440s, episode steps:  62, steps per second: 141, episode reward: -85.872, mean reward: -1.385 [-100.000, 11.943], mean action: 1.661 [0.000, 3.000],  loss: 6.871792, mse: 4865.794564, mean_q: 69.683473, mean_eps: 0.964621
   6036/150000: episode: 65, duration: 0.774s, episode steps: 108, steps per second: 139, episode reward: -118.802, mean reward: -1.100 [-100.000,  6.268], mean action: 1.694 [0.000, 3.000],  loss: 13.300338, mse: 4664.253863, mean_q: 66.745900, mean_eps: 0.964111
   6139/150000: episode: 66, duration: 0.714s, episode steps: 103, steps per second: 144, episode reward: -433.573, mean reward: -4.209 [-100.000,  0.642], mean action: 1.437 [0.000, 3.000],  loss: 9.567291,

   8705/150000: episode: 94, duration: 0.627s, episode steps:  80, steps per second: 128, episode reward: -54.952, mean reward: -0.687 [-100.000, 105.737], mean action: 1.387 [0.000, 3.000],  loss: 16.828836, mse: 5084.430768, mean_q: 68.814553, mean_eps: 0.948013
   8827/150000: episode: 95, duration: 0.897s, episode steps: 122, steps per second: 136, episode reward: -161.783, mean reward: -1.326 [-100.000,  4.932], mean action: 1.566 [0.000, 3.000],  loss: 11.224929, mse: 4800.568750, mean_q: 66.303100, mean_eps: 0.947407
   8937/150000: episode: 96, duration: 0.733s, episode steps: 110, steps per second: 150, episode reward: -518.726, mean reward: -4.716 [-100.000, 50.564], mean action: 1.636 [0.000, 3.000],  loss: 19.265038, mse: 4708.990039, mean_q: 65.103234, mean_eps: 0.946711
   9049/150000: episode: 97, duration: 0.772s, episode steps: 112, steps per second: 145, episode reward: -53.647, mean reward: -0.479 [-100.000, 10.867], mean action: 1.500 [0.000, 3.000],  loss: 14.42554

  11522/150000: episode: 125, duration: 0.853s, episode steps: 115, steps per second: 135, episode reward: -32.501, mean reward: -0.283 [-100.000, 87.846], mean action: 1.348 [0.000, 3.000],  loss: 17.560802, mse: 4696.684258, mean_q: 64.767067, mean_eps: 0.931216
  11601/150000: episode: 126, duration: 0.609s, episode steps:  79, steps per second: 130, episode reward: -101.281, mean reward: -1.282 [-100.000, 11.960], mean action: 1.544 [0.000, 3.000],  loss: 16.239678, mse: 4667.732558, mean_q: 63.910011, mean_eps: 0.930634
  11693/150000: episode: 127, duration: 0.722s, episode steps:  92, steps per second: 127, episode reward: -212.637, mean reward: -2.311 [-100.000, 21.576], mean action: 1.533 [0.000, 3.000],  loss: 14.052078, mse: 4767.283187, mean_q: 64.869002, mean_eps: 0.930121
  11804/150000: episode: 128, duration: 1.220s, episode steps: 111, steps per second:  91, episode reward: -41.814, mean reward: -0.377 [-100.000, 89.212], mean action: 1.532 [0.000, 3.000],  loss: 19.84

  14308/150000: episode: 156, duration: 0.802s, episode steps: 118, steps per second: 147, episode reward: -132.616, mean reward: -1.124 [-100.000, 11.839], mean action: 1.585 [0.000, 3.000],  loss: 17.023880, mse: 4854.405809, mean_q: 64.635580, mean_eps: 0.914509
  14407/150000: episode: 157, duration: 0.693s, episode steps:  99, steps per second: 143, episode reward: -174.440, mean reward: -1.762 [-100.000,  6.804], mean action: 1.495 [0.000, 3.000],  loss: 14.008723, mse: 4906.623932, mean_q: 64.153304, mean_eps: 0.913858
  14522/150000: episode: 158, duration: 0.783s, episode steps: 115, steps per second: 147, episode reward: -158.356, mean reward: -1.377 [-100.000,  6.682], mean action: 1.409 [0.000, 3.000],  loss: 16.786418, mse: 4920.526832, mean_q: 65.068036, mean_eps: 0.913216
  14583/150000: episode: 159, duration: 0.520s, episode steps:  61, steps per second: 117, episode reward: -134.282, mean reward: -2.201 [-100.000,  6.591], mean action: 1.377 [0.000, 3.000],  loss: 24.

  17130/150000: episode: 187, duration: 0.519s, episode steps:  74, steps per second: 143, episode reward: -60.493, mean reward: -0.817 [-100.000,  6.441], mean action: 1.297 [0.000, 3.000],  loss: 15.236877, mse: 6264.007153, mean_q: 70.913472, mean_eps: 0.897445
  17192/150000: episode: 188, duration: 0.425s, episode steps:  62, steps per second: 146, episode reward: -59.676, mean reward: -0.963 [-100.000, 20.113], mean action: 1.597 [0.000, 3.000],  loss: 14.622349, mse: 6357.106115, mean_q: 71.481323, mean_eps: 0.897037
  17307/150000: episode: 189, duration: 0.805s, episode steps: 115, steps per second: 143, episode reward: -112.454, mean reward: -0.978 [-100.000, 10.416], mean action: 1.626 [0.000, 3.000],  loss: 21.937715, mse: 6306.069090, mean_q: 69.874488, mean_eps: 0.896506
  17402/150000: episode: 190, duration: 0.695s, episode steps:  95, steps per second: 137, episode reward: -82.564, mean reward: -0.869 [-100.000,  6.068], mean action: 1.495 [0.000, 3.000],  loss: 13.052

  20004/150000: episode: 218, duration: 0.520s, episode steps:  68, steps per second: 131, episode reward: -79.888, mean reward: -1.175 [-100.000,  8.369], mean action: 1.618 [0.000, 3.000],  loss: 13.361195, mse: 6930.403378, mean_q: 73.452529, mean_eps: 0.880183
  20118/150000: episode: 219, duration: 0.965s, episode steps: 114, steps per second: 118, episode reward: -144.600, mean reward: -1.268 [-100.000,  5.132], mean action: 1.333 [0.000, 3.000],  loss: 13.117667, mse: 6760.313753, mean_q: 71.649211, mean_eps: 0.879637
  21118/150000: episode: 220, duration: 8.119s, episode steps: 1000, steps per second: 123, episode reward: 104.432, mean reward:  0.104 [-22.460, 93.317], mean action: 1.582 [0.000, 3.000],  loss: 16.183108, mse: 6794.290860, mean_q: 71.739078, mean_eps: 0.876295
  21209/150000: episode: 221, duration: 0.714s, episode steps:  91, steps per second: 128, episode reward: -284.812, mean reward: -3.130 [-100.000,  0.612], mean action: 1.451 [0.000, 3.000],  loss: 16.17

  23891/150000: episode: 249, duration: 0.718s, episode steps: 103, steps per second: 143, episode reward: -226.403, mean reward: -2.198 [-100.000, 32.505], mean action: 1.757 [0.000, 3.000],  loss: 19.658629, mse: 7504.145385, mean_q: 74.937418, mean_eps: 0.856966
  23980/150000: episode: 250, duration: 0.645s, episode steps:  89, steps per second: 138, episode reward: -43.744, mean reward: -0.492 [-100.000, 17.208], mean action: 1.663 [0.000, 3.000],  loss: 16.756922, mse: 7678.414874, mean_q: 77.070551, mean_eps: 0.856390
  24061/150000: episode: 251, duration: 0.563s, episode steps:  81, steps per second: 144, episode reward: -103.054, mean reward: -1.272 [-100.000,  7.270], mean action: 1.580 [0.000, 3.000],  loss: 17.454685, mse: 7540.448996, mean_q: 75.200239, mean_eps: 0.855880
  24146/150000: episode: 252, duration: 0.571s, episode steps:  85, steps per second: 149, episode reward: -103.244, mean reward: -1.215 [-100.000,  8.363], mean action: 1.741 [0.000, 3.000],  loss: 20.2

  26890/150000: episode: 280, duration: 1.266s, episode steps: 146, steps per second: 115, episode reward: -18.361, mean reward: -0.126 [-100.000, 44.662], mean action: 1.596 [0.000, 3.000],  loss: 17.466916, mse: 8516.848736, mean_q: 80.245891, mean_eps: 0.839101
  27020/150000: episode: 281, duration: 1.009s, episode steps: 130, steps per second: 129, episode reward: -68.066, mean reward: -0.524 [-100.000, 10.773], mean action: 1.523 [0.000, 3.000],  loss: 18.288526, mse: 8461.724504, mean_q: 79.449923, mean_eps: 0.838273
  27124/150000: episode: 282, duration: 0.870s, episode steps: 104, steps per second: 120, episode reward: -142.234, mean reward: -1.368 [-100.000,  4.722], mean action: 1.644 [0.000, 3.000],  loss: 26.058714, mse: 8515.719750, mean_q: 80.350485, mean_eps: 0.837571
  27230/150000: episode: 283, duration: 0.831s, episode steps: 106, steps per second: 128, episode reward: -362.594, mean reward: -3.421 [-100.000, 77.254], mean action: 1.425 [0.000, 3.000],  loss: 22.52

  30037/150000: episode: 311, duration: 1.032s, episode steps: 136, steps per second: 132, episode reward: -31.041, mean reward: -0.228 [-100.000, 20.980], mean action: 1.618 [0.000, 3.000],  loss: 18.386232, mse: 9553.999400, mean_q: 91.664874, mean_eps: 0.820189
  30137/150000: episode: 312, duration: 0.796s, episode steps: 100, steps per second: 126, episode reward: -173.896, mean reward: -1.739 [-100.000,  9.510], mean action: 1.630 [0.000, 3.000],  loss: 17.265993, mse: 9594.725742, mean_q: 91.261644, mean_eps: 0.819481
  30226/150000: episode: 313, duration: 0.632s, episode steps:  89, steps per second: 141, episode reward: -120.598, mean reward: -1.355 [-100.000,  8.241], mean action: 1.697 [0.000, 3.000],  loss: 13.623955, mse: 9651.579047, mean_q: 91.944236, mean_eps: 0.818914
  30321/150000: episode: 314, duration: 0.674s, episode steps:  95, steps per second: 141, episode reward: -116.319, mean reward: -1.224 [-100.000,  9.346], mean action: 1.589 [0.000, 3.000],  loss: 17.8

  33105/150000: episode: 342, duration: 0.594s, episode steps:  73, steps per second: 123, episode reward: -33.579, mean reward: -0.460 [-100.000, 12.174], mean action: 1.671 [0.000, 3.000],  loss: 19.029919, mse: 12954.237505, mean_q: 105.857851, mean_eps: 0.801592
  33166/150000: episode: 343, duration: 0.427s, episode steps:  61, steps per second: 143, episode reward: -79.862, mean reward: -1.309 [-100.000, 12.533], mean action: 1.672 [0.000, 3.000],  loss: 20.122824, mse: 12848.514360, mean_q: 106.128196, mean_eps: 0.801190
  33251/150000: episode: 344, duration: 0.580s, episode steps:  85, steps per second: 146, episode reward: -104.955, mean reward: -1.235 [-100.000,  8.987], mean action: 1.529 [0.000, 3.000],  loss: 20.012200, mse: 12631.475414, mean_q: 105.050029, mean_eps: 0.800752
  33318/150000: episode: 345, duration: 0.469s, episode steps:  67, steps per second: 143, episode reward: -120.126, mean reward: -1.793 [-100.000, 13.115], mean action: 1.507 [0.000, 3.000],  loss:

  36027/150000: episode: 373, duration: 1.065s, episode steps: 141, steps per second: 132, episode reward: -76.432, mean reward: -0.542 [-100.000,  6.589], mean action: 1.582 [0.000, 3.000],  loss: 27.488929, mse: 12385.818941, mean_q: 103.245436, mean_eps: 0.784264
  36154/150000: episode: 374, duration: 0.899s, episode steps: 127, steps per second: 141, episode reward: -375.556, mean reward: -2.957 [-100.000, 52.960], mean action: 1.740 [0.000, 3.000],  loss: 18.277474, mse: 12001.219542, mean_q: 101.856844, mean_eps: 0.783460
  36249/150000: episode: 375, duration: 0.643s, episode steps:  95, steps per second: 148, episode reward: -59.992, mean reward: -0.631 [-100.000, 26.357], mean action: 1.589 [0.000, 3.000],  loss: 19.445421, mse: 12121.378187, mean_q: 103.461799, mean_eps: 0.782794
  36338/150000: episode: 376, duration: 0.623s, episode steps:  89, steps per second: 143, episode reward: -100.327, mean reward: -1.127 [-100.000,  9.820], mean action: 1.685 [0.000, 3.000],  loss:

  39193/150000: episode: 404, duration: 0.961s, episode steps: 133, steps per second: 138, episode reward: 12.643, mean reward:  0.095 [-100.000, 66.260], mean action: 1.564 [0.000, 3.000],  loss: 30.419445, mse: 13335.589484, mean_q: 107.425470, mean_eps: 0.765244
  39275/150000: episode: 405, duration: 0.603s, episode steps:  82, steps per second: 136, episode reward: -59.739, mean reward: -0.729 [-100.000,  7.852], mean action: 1.549 [0.000, 3.000],  loss: 20.492110, mse: 13586.615675, mean_q: 108.965543, mean_eps: 0.764599
  39405/150000: episode: 406, duration: 0.935s, episode steps: 130, steps per second: 139, episode reward: -80.755, mean reward: -0.621 [-100.000, 15.578], mean action: 1.685 [0.000, 3.000],  loss: 21.558152, mse: 13940.305416, mean_q: 111.106866, mean_eps: 0.763963
  39508/150000: episode: 407, duration: 0.742s, episode steps: 103, steps per second: 139, episode reward: -60.495, mean reward: -0.587 [-100.000, 27.524], mean action: 1.544 [0.000, 3.000],  loss: 30

  42374/150000: episode: 435, duration: 0.609s, episode steps:  89, steps per second: 146, episode reward: -41.919, mean reward: -0.471 [-100.000, 18.457], mean action: 1.618 [0.000, 3.000],  loss: 23.436101, mse: 14768.033313, mean_q: 111.122228, mean_eps: 0.746026
  42489/150000: episode: 436, duration: 0.790s, episode steps: 115, steps per second: 146, episode reward: -88.983, mean reward: -0.774 [-100.000, 19.957], mean action: 1.791 [0.000, 3.000],  loss: 39.112976, mse: 14883.506055, mean_q: 111.258277, mean_eps: 0.745414
  42561/150000: episode: 437, duration: 0.551s, episode steps:  72, steps per second: 131, episode reward: -53.673, mean reward: -0.745 [-100.000,  8.764], mean action: 1.708 [0.000, 3.000],  loss: 19.888703, mse: 15326.424805, mean_q: 114.610153, mean_eps: 0.744853
  42690/150000: episode: 438, duration: 0.911s, episode steps: 129, steps per second: 142, episode reward: -36.700, mean reward: -0.284 [-100.000,  8.719], mean action: 1.581 [0.000, 3.000],  loss: 2

  45459/150000: episode: 466, duration: 0.953s, episode steps: 129, steps per second: 135, episode reward: -103.487, mean reward: -0.802 [-100.000,  7.198], mean action: 1.636 [0.000, 3.000],  loss: 25.745714, mse: 14431.602925, mean_q: 110.070761, mean_eps: 0.727636
  45578/150000: episode: 467, duration: 0.882s, episode steps: 119, steps per second: 135, episode reward: -255.448, mean reward: -2.147 [-100.000, 58.154], mean action: 1.639 [0.000, 3.000],  loss: 17.813390, mse: 14223.667115, mean_q: 109.662108, mean_eps: 0.726892
  45692/150000: episode: 468, duration: 0.804s, episode steps: 114, steps per second: 142, episode reward: -159.318, mean reward: -1.398 [-100.000,  9.639], mean action: 1.640 [0.000, 3.000],  loss: 38.351150, mse: 13868.559579, mean_q: 107.865330, mean_eps: 0.726193
  45858/150000: episode: 469, duration: 1.162s, episode steps: 166, steps per second: 143, episode reward: -93.660, mean reward: -0.564 [-100.000, 28.025], mean action: 1.717 [0.000, 3.000],  loss

  48861/150000: episode: 497, duration: 0.733s, episode steps:  96, steps per second: 131, episode reward: -95.107, mean reward: -0.991 [-100.000, 14.714], mean action: 1.625 [0.000, 3.000],  loss: 34.930545, mse: 13457.693929, mean_q: 103.181865, mean_eps: 0.707125
  48957/150000: episode: 498, duration: 0.861s, episode steps:  96, steps per second: 112, episode reward: -82.428, mean reward: -0.859 [-100.000,  7.446], mean action: 1.646 [0.000, 3.000],  loss: 14.481301, mse: 13312.937022, mean_q: 102.946099, mean_eps: 0.706549
  49058/150000: episode: 499, duration: 0.818s, episode steps: 101, steps per second: 123, episode reward: -202.902, mean reward: -2.009 [-100.000, 33.665], mean action: 1.644 [0.000, 3.000],  loss: 39.259142, mse: 13415.614645, mean_q: 103.253193, mean_eps: 0.705958
  49145/150000: episode: 500, duration: 0.657s, episode steps:  87, steps per second: 132, episode reward: -88.551, mean reward: -1.018 [-100.000, 11.333], mean action: 1.655 [0.000, 3.000],  loss: 

  51996/150000: episode: 528, duration: 1.187s, episode steps:  78, steps per second:  66, episode reward: -86.716, mean reward: -1.112 [-100.000, 10.488], mean action: 1.718 [0.000, 3.000],  loss: 32.308990, mse: 12790.183857, mean_q: 97.868854, mean_eps: 0.688261
  52109/150000: episode: 529, duration: 1.282s, episode steps: 113, steps per second:  88, episode reward: -19.532, mean reward: -0.173 [-100.000, 14.972], mean action: 1.602 [0.000, 3.000],  loss: 28.623715, mse: 12951.058948, mean_q: 100.428285, mean_eps: 0.687688
  52193/150000: episode: 530, duration: 0.909s, episode steps:  84, steps per second:  92, episode reward: -84.032, mean reward: -1.000 [-100.000, 18.773], mean action: 1.726 [0.000, 3.000],  loss: 19.408998, mse: 12963.188174, mean_q: 100.064731, mean_eps: 0.687097
  52281/150000: episode: 531, duration: 1.791s, episode steps:  88, steps per second:  49, episode reward: -97.285, mean reward: -1.106 [-100.000, 11.424], mean action: 1.659 [0.000, 3.000],  loss: 15

  55097/150000: episode: 559, duration: 0.809s, episode steps:  87, steps per second: 107, episode reward: -47.342, mean reward: -0.544 [-100.000,  6.370], mean action: 1.586 [0.000, 3.000],  loss: 23.437929, mse: 12333.594771, mean_q: 99.037330, mean_eps: 0.669682
  55233/150000: episode: 560, duration: 1.125s, episode steps: 136, steps per second: 121, episode reward: -124.262, mean reward: -0.914 [-100.000, 16.490], mean action: 1.684 [0.000, 3.000],  loss: 25.580324, mse: 12816.094856, mean_q: 101.283865, mean_eps: 0.669013
  55363/150000: episode: 561, duration: 0.990s, episode steps: 130, steps per second: 131, episode reward: -31.636, mean reward: -0.243 [-100.000, 10.222], mean action: 1.777 [0.000, 3.000],  loss: 33.208362, mse: 12732.650233, mean_q: 100.414480, mean_eps: 0.668215
  55457/150000: episode: 562, duration: 0.685s, episode steps:  94, steps per second: 137, episode reward: -189.132, mean reward: -2.012 [-100.000, 10.203], mean action: 1.840 [0.000, 3.000],  loss: 

  58482/150000: episode: 590, duration: 17.061s, episode steps: 103, steps per second:   6, episode reward: -50.217, mean reward: -0.488 [-100.000, 14.722], mean action: 1.718 [0.000, 3.000],  loss: 27.008141, mse: 11253.430313, mean_q: 95.393453, mean_eps: 0.649420
  58584/150000: episode: 591, duration: 2.269s, episode steps: 102, steps per second:  45, episode reward: -174.551, mean reward: -1.711 [-100.000,  4.226], mean action: 1.402 [0.000, 3.000],  loss: 20.178954, mse: 11373.615455, mean_q: 95.984192, mean_eps: 0.648805
  58711/150000: episode: 592, duration: 2.378s, episode steps: 127, steps per second:  53, episode reward: -77.717, mean reward: -0.612 [-100.000, 16.324], mean action: 1.701 [0.000, 3.000],  loss: 26.400579, mse: 11148.646830, mean_q: 93.234171, mean_eps: 0.648118
  58839/150000: episode: 593, duration: 1.275s, episode steps: 128, steps per second: 100, episode reward: -42.673, mean reward: -0.333 [-100.000, 17.751], mean action: 1.812 [0.000, 3.000],  loss: 18

  61842/150000: episode: 621, duration: 0.838s, episode steps:  86, steps per second: 103, episode reward: -86.724, mean reward: -1.008 [-100.000,  5.839], mean action: 1.535 [0.000, 3.000],  loss: 21.888174, mse: 10228.082565, mean_q: 93.383093, mean_eps: 0.629209
  61974/150000: episode: 622, duration: 1.238s, episode steps: 132, steps per second: 107, episode reward: -288.899, mean reward: -2.189 [-100.000,  6.442], mean action: 1.629 [0.000, 3.000],  loss: 20.774224, mse: 10526.496549, mean_q: 95.195541, mean_eps: 0.628555
  62072/150000: episode: 623, duration: 0.736s, episode steps:  98, steps per second: 133, episode reward: -66.930, mean reward: -0.683 [-100.000, 19.020], mean action: 1.806 [0.000, 3.000],  loss: 24.530085, mse: 10469.090741, mean_q: 94.699212, mean_eps: 0.627865
  62206/150000: episode: 624, duration: 0.952s, episode steps: 134, steps per second: 141, episode reward: -62.952, mean reward: -0.470 [-100.000,  7.272], mean action: 1.724 [0.000, 3.000],  loss: 20.

  65596/150000: episode: 652, duration: 0.887s, episode steps: 114, steps per second: 129, episode reward: -57.567, mean reward: -0.505 [-100.000,  6.711], mean action: 1.614 [0.000, 3.000],  loss: 28.848712, mse: 9986.473646, mean_q: 92.612760, mean_eps: 0.606769
  65735/150000: episode: 653, duration: 0.945s, episode steps: 139, steps per second: 147, episode reward: -10.296, mean reward: -0.074 [-100.000, 11.912], mean action: 1.583 [0.000, 3.000],  loss: 26.664059, mse: 9899.334195, mean_q: 91.469913, mean_eps: 0.606010
  65848/150000: episode: 654, duration: 0.784s, episode steps: 113, steps per second: 144, episode reward: -95.592, mean reward: -0.846 [-100.000,  5.646], mean action: 1.469 [0.000, 3.000],  loss: 26.416228, mse: 9977.970344, mean_q: 91.655609, mean_eps: 0.605254
  65971/150000: episode: 655, duration: 0.848s, episode steps: 123, steps per second: 145, episode reward: -100.000, mean reward: -0.813 [-100.000,  7.985], mean action: 1.724 [0.000, 3.000],  loss: 23.069

  69178/150000: episode: 683, duration: 0.963s, episode steps: 144, steps per second: 149, episode reward: -72.339, mean reward: -0.502 [-100.000, 11.768], mean action: 1.535 [0.000, 3.000],  loss: 24.339845, mse: 10674.686751, mean_q: 96.252198, mean_eps: 0.585367
  70178/150000: episode: 684, duration: 8.326s, episode steps: 1000, steps per second: 120, episode reward:  7.888, mean reward:  0.008 [-24.946, 23.761], mean action: 1.688 [0.000, 3.000],  loss: 23.351726, mse: 11060.363520, mean_q: 97.871591, mean_eps: 0.581935
  70273/150000: episode: 685, duration: 0.715s, episode steps:  95, steps per second: 133, episode reward: -66.943, mean reward: -0.705 [-100.000,  9.255], mean action: 1.516 [0.000, 3.000],  loss: 17.717300, mse: 10742.610578, mean_q: 95.994851, mean_eps: 0.578650
  70338/150000: episode: 686, duration: 0.571s, episode steps:  65, steps per second: 114, episode reward: -104.300, mean reward: -1.605 [-100.000, 14.823], mean action: 1.754 [0.000, 3.000],  loss: 23.1

  73414/150000: episode: 714, duration: 0.963s, episode steps: 109, steps per second: 113, episode reward: -104.969, mean reward: -0.963 [-100.000,  8.993], mean action: 1.679 [0.000, 3.000],  loss: 23.823383, mse: 10711.271368, mean_q: 97.609291, mean_eps: 0.559846
  73534/150000: episode: 715, duration: 1.001s, episode steps: 120, steps per second: 120, episode reward: -63.274, mean reward: -0.527 [-100.000, 12.646], mean action: 1.792 [0.000, 3.000],  loss: 19.036612, mse: 10776.467692, mean_q: 96.153859, mean_eps: 0.559159
  73656/150000: episode: 716, duration: 0.953s, episode steps: 122, steps per second: 128, episode reward: -55.284, mean reward: -0.453 [-100.000, 21.657], mean action: 1.648 [0.000, 3.000],  loss: 24.000689, mse: 10652.534860, mean_q: 95.357657, mean_eps: 0.558433
  73732/150000: episode: 717, duration: 0.579s, episode steps:  76, steps per second: 131, episode reward: -45.145, mean reward: -0.594 [-100.000,  8.041], mean action: 1.684 [0.000, 3.000],  loss: 25.

  78653/150000: episode: 745, duration: 0.631s, episode steps:  83, steps per second: 131, episode reward:  4.333, mean reward:  0.052 [-100.000, 21.838], mean action: 1.759 [0.000, 3.000],  loss: 29.681484, mse: 12240.759883, mean_q: 107.019298, mean_eps: 0.528334
  78785/150000: episode: 746, duration: 1.018s, episode steps: 132, steps per second: 130, episode reward: -61.995, mean reward: -0.470 [-100.000,  8.319], mean action: 1.833 [0.000, 3.000],  loss: 32.397399, mse: 12023.583947, mean_q: 105.637215, mean_eps: 0.527689
  78900/150000: episode: 747, duration: 0.790s, episode steps: 115, steps per second: 146, episode reward: -86.289, mean reward: -0.750 [-100.000,  8.594], mean action: 1.539 [0.000, 3.000],  loss: 41.022715, mse: 11882.268359, mean_q: 105.256261, mean_eps: 0.526948
  79017/150000: episode: 748, duration: 0.839s, episode steps: 117, steps per second: 139, episode reward: -48.320, mean reward: -0.413 [-100.000, 18.733], mean action: 1.427 [0.000, 3.000],  loss: 24

  84283/150000: episode: 776, duration: 0.758s, episode steps:  95, steps per second: 125, episode reward: -24.666, mean reward: -0.260 [-100.000, 12.226], mean action: 1.632 [0.000, 3.000],  loss: 34.863354, mse: 15466.232442, mean_q: 122.789881, mean_eps: 0.494590
  84400/150000: episode: 777, duration: 0.814s, episode steps: 117, steps per second: 144, episode reward: -81.268, mean reward: -0.695 [-100.000,  8.186], mean action: 1.684 [0.000, 3.000],  loss: 31.823599, mse: 15377.735660, mean_q: 120.732760, mean_eps: 0.493954
  84519/150000: episode: 778, duration: 1.146s, episode steps: 119, steps per second: 104, episode reward: -44.593, mean reward: -0.375 [-100.000,  9.895], mean action: 1.529 [0.000, 3.000],  loss: 28.136812, mse: 15382.619830, mean_q: 121.403171, mean_eps: 0.493246
  84599/150000: episode: 779, duration: 0.743s, episode steps:  80, steps per second: 108, episode reward: -49.614, mean reward: -0.620 [-100.000, 12.333], mean action: 1.613 [0.000, 3.000],  loss: 3

  89864/150000: episode: 807, duration: 0.808s, episode steps:  92, steps per second: 114, episode reward: -24.657, mean reward: -0.268 [-100.000, 18.309], mean action: 1.793 [0.000, 3.000],  loss: 32.668890, mse: 21833.844217, mean_q: 150.651198, mean_eps: 0.461095
  89955/150000: episode: 808, duration: 0.774s, episode steps:  91, steps per second: 118, episode reward: -20.635, mean reward: -0.227 [-100.000,  8.100], mean action: 1.626 [0.000, 3.000],  loss: 36.801244, mse: 21730.215895, mean_q: 149.519274, mean_eps: 0.460546
  90036/150000: episode: 809, duration: 0.712s, episode steps:  81, steps per second: 114, episode reward: -29.337, mean reward: -0.362 [-100.000, 14.983], mean action: 1.741 [0.000, 3.000],  loss: 44.596872, mse: 21926.122782, mean_q: 149.277949, mean_eps: 0.460030
  90125/150000: episode: 810, duration: 0.828s, episode steps:  89, steps per second: 107, episode reward:  0.550, mean reward:  0.006 [-100.000, 20.599], mean action: 1.573 [0.000, 3.000],  loss: 31

  98366/150000: episode: 838, duration: 8.314s, episode steps: 1000, steps per second: 120, episode reward: 50.083, mean reward:  0.050 [-23.751, 21.896], mean action: 1.576 [0.000, 3.000],  loss: 42.329394, mse: 25468.189191, mean_q: 162.577027, mean_eps: 0.412807
  98476/150000: episode: 839, duration: 0.874s, episode steps: 110, steps per second: 126, episode reward: -19.273, mean reward: -0.175 [-100.000, 23.133], mean action: 1.664 [0.000, 3.000],  loss: 34.415024, mse: 24989.572496, mean_q: 160.581412, mean_eps: 0.409477
  99476/150000: episode: 840, duration: 7.996s, episode steps: 1000, steps per second: 125, episode reward: -27.758, mean reward: -0.028 [-22.987, 22.546], mean action: 1.503 [0.000, 3.000],  loss: 39.856778, mse: 24313.945555, mean_q: 157.457801, mean_eps: 0.406147
  99554/150000: episode: 841, duration: 0.544s, episode steps:  78, steps per second: 143, episode reward: 18.464, mean reward:  0.237 [-100.000, 17.191], mean action: 1.756 [0.000, 3.000],  loss: 40.

 107338/150000: episode: 869, duration: 0.785s, episode steps:  90, steps per second: 115, episode reward: -31.293, mean reward: -0.348 [-100.000, 11.018], mean action: 1.833 [0.000, 3.000],  loss: 38.369446, mse: 16821.617459, mean_q: 130.001409, mean_eps: 0.356245
 107525/150000: episode: 870, duration: 1.347s, episode steps: 187, steps per second: 139, episode reward: -323.132, mean reward: -1.728 [-100.000,  5.225], mean action: 1.663 [0.000, 3.000],  loss: 35.240715, mse: 16826.054672, mean_q: 129.878383, mean_eps: 0.355414
 107667/150000: episode: 871, duration: 0.982s, episode steps: 142, steps per second: 145, episode reward: 17.470, mean reward:  0.123 [-100.000, 17.696], mean action: 1.641 [0.000, 3.000],  loss: 29.149170, mse: 17246.226363, mean_q: 132.406731, mean_eps: 0.354427
 107772/150000: episode: 872, duration: 0.735s, episode steps: 105, steps per second: 143, episode reward: -8.210, mean reward: -0.078 [-100.000, 13.594], mean action: 1.410 [0.000, 3.000],  loss: 27

 115379/150000: episode: 900, duration: 1.531s, episode steps: 187, steps per second: 122, episode reward: -38.069, mean reward: -0.204 [-100.000, 14.830], mean action: 1.759 [0.000, 3.000],  loss: 29.214248, mse: 12479.490600, mean_q: 111.339062, mean_eps: 0.308290
 115508/150000: episode: 901, duration: 1.066s, episode steps: 129, steps per second: 121, episode reward: -6.838, mean reward: -0.053 [-100.000, 16.441], mean action: 1.574 [0.000, 3.000],  loss: 21.236280, mse: 12362.058791, mean_q: 110.638418, mean_eps: 0.307342
 116030/150000: episode: 902, duration: 4.125s, episode steps: 522, steps per second: 127, episode reward: -246.482, mean reward: -0.472 [-100.000, 23.248], mean action: 1.402 [0.000, 3.000],  loss: 25.533646, mse: 11798.188718, mean_q: 107.390052, mean_eps: 0.305389
 116166/150000: episode: 903, duration: 1.004s, episode steps: 136, steps per second: 135, episode reward: -86.716, mean reward: -0.638 [-100.000,  4.113], mean action: 1.559 [0.000, 3.000],  loss: 2

 129932/150000: episode: 931, duration: 2.622s, episode steps: 369, steps per second: 141, episode reward: 271.219, mean reward:  0.735 [-18.292, 100.000], mean action: 1.545 [0.000, 3.000],  loss: 16.001201, mse: 6165.214602, mean_q: 77.624931, mean_eps: 0.221518
 130932/150000: episode: 932, duration: 7.584s, episode steps: 1000, steps per second: 132, episode reward: 90.525, mean reward:  0.091 [-23.842, 17.712], mean action: 1.196 [0.000, 3.000],  loss: 16.799092, mse: 5900.092398, mean_q: 74.894335, mean_eps: 0.217411
 131932/150000: episode: 933, duration: 8.037s, episode steps: 1000, steps per second: 124, episode reward: 52.146, mean reward:  0.052 [-17.764, 19.270], mean action: 1.361 [0.000, 3.000],  loss: 16.380088, mse: 5558.705601, mean_q: 73.146607, mean_eps: 0.211411
 132024/150000: episode: 934, duration: 0.652s, episode steps:  92, steps per second: 141, episode reward: 30.281, mean reward:  0.329 [-100.000, 12.378], mean action: 1.837 [0.000, 3.000],  loss: 15.097853,

 143297/150000: episode: 962, duration: 7.407s, episode steps: 898, steps per second: 121, episode reward: 152.072, mean reward:  0.169 [-22.902, 100.000], mean action: 1.199 [0.000, 3.000],  loss: 15.533905, mse: 3977.129760, mean_q: 62.673222, mean_eps: 0.142915
 143714/150000: episode: 963, duration: 3.113s, episode steps: 417, steps per second: 134, episode reward: 156.957, mean reward:  0.376 [-18.231, 100.000], mean action: 2.149 [0.000, 3.000],  loss: 14.468026, mse: 3941.503255, mean_q: 62.665976, mean_eps: 0.138970
 144284/150000: episode: 964, duration: 4.540s, episode steps: 570, steps per second: 126, episode reward: 187.628, mean reward:  0.329 [-21.158, 100.000], mean action: 1.454 [0.000, 3.000],  loss: 15.433039, mse: 4042.094885, mean_q: 63.502053, mean_eps: 0.136009
 145103/150000: episode: 965, duration: 6.069s, episode steps: 819, steps per second: 135, episode reward: 250.648, mean reward:  0.306 [-18.309, 100.000], mean action: 1.004 [0.000, 3.000],  loss: 16.8608

<tensorflow.python.keras.callbacks.History at 0x1d7ebf968b0>

Wow! After only some minutes of training, we achieve great results!
The reason for this is, that keras-rl has implemented many optimization strategies (e.g the optimized replay buffer) which lead to a much faster convergence than our DQN implemented by hand

In [None]:
# After training is done, we save the final weights.
dqn.save_weights(f'dqn_{env_name}_weights_0,995_a0,001.h5f', overwrite=True)

In [None]:
# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=200, visualize=False, callbacks=[WandbCallback()])
env.close()

In [None]:
# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=5, visualize=True)
env.close()

Testing for 5 episodes ...
Episode 1: reward: 250.716, steps: 321
Episode 2: reward: 257.224, steps: 242
