# Deep Q Networks

Code can be found at: https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/

## 1. Setup

In [1]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

Using TensorFlow backend.


In [2]:
# Get the environment and extract the number of actions available
env = gym.make('CartPole-v0')
np.random.seed(0)
env.seed(0)
nb_actions = env.action_space.n

## 2. Create Neural Network

In [3]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                80        
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34        
_________________________________________________________________
activation_2 (Activation)    (None, 2)                 0         
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________
None


## 3. Train and Test a simple DQN

In [4]:
# Initiate Parameters
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

### Train 5,000 Times

In [5]:
# Training with verbose and visualization
dqn.fit(env, nb_steps=5000, visualize=True, verbose=2)

Training for 5000 steps ...
    9/5000: episode: 1, duration: 0.328s, episode steps: 9, steps per second: 27, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.143 [-1.715, 2.789], loss: --, mean_absolute_error: --, mean_q: --
Instructions for updating:
Use tf.cast instead.




   22/5000: episode: 2, duration: 0.567s, episode steps: 13, steps per second: 23, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.154 [0.000, 1.000], mean observation: 0.115 [-1.745, 2.863], loss: 0.623680, mean_absolute_error: 0.738524, mean_q: 0.574850




   34/5000: episode: 3, duration: 0.202s, episode steps: 12, steps per second: 59, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.083 [0.000, 1.000], mean observation: 0.112 [-1.950, 3.064], loss: 0.565361, mean_absolute_error: 0.744800, mean_q: 0.734636
   44/5000: episode: 4, duration: 0.164s, episode steps: 10, steps per second: 61, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.141 [-1.557, 2.618], loss: 0.531535, mean_absolute_error: 0.739168, mean_q: 0.845040
   54/5000: episode: 5, duration: 0.166s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.142 [-1.716, 2.698], loss: 0.488719, mean_absolute_error: 0.715716, mean_q: 0.939681
   63/5000: episode: 6, duration: 0.150s, episode steps: 9, steps per second: 60, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0

  323/5000: episode: 32, duration: 0.194s, episode steps: 11, steps per second: 57, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.091 [0.000, 1.000], mean observation: 0.140 [-1.712, 2.778], loss: 0.441408, mean_absolute_error: 0.972610, mean_q: 2.965413
  335/5000: episode: 33, duration: 0.186s, episode steps: 12, steps per second: 65, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.083 [0.000, 1.000], mean observation: 0.079 [-2.003, 2.989], loss: 0.451643, mean_absolute_error: 0.977772, mean_q: 2.975425
  349/5000: episode: 34, duration: 0.233s, episode steps: 14, steps per second: 60, episode reward: 14.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.143 [0.000, 1.000], mean observation: 0.078 [-2.001, 3.014], loss: 0.468996, mean_absolute_error: 1.040299, mean_q: 3.035325
  359/5000: episode: 35, duration: 0.168s, episode steps: 10, steps per second: 59, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean act

  630/5000: episode: 63, duration: 0.173s, episode steps: 10, steps per second: 58, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.139 [-1.955, 3.035], loss: 0.348101, mean_absolute_error: 1.885573, mean_q: 4.439894
  640/5000: episode: 64, duration: 0.158s, episode steps: 10, steps per second: 63, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.150 [-1.735, 2.755], loss: 0.306011, mean_absolute_error: 1.884556, mean_q: 4.390641
  650/5000: episode: 65, duration: 0.167s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.123 [-1.938, 2.947], loss: 0.290853, mean_absolute_error: 1.893733, mean_q: 4.443768
  659/5000: episode: 66, duration: 0.150s, episode steps: 9, steps per second: 60, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean actio

  925/5000: episode: 92, duration: 0.183s, episode steps: 11, steps per second: 60, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.273 [0.000, 1.000], mean observation: 0.137 [-1.173, 1.929], loss: 0.252245, mean_absolute_error: 2.588031, mean_q: 5.470732
  934/5000: episode: 93, duration: 0.156s, episode steps: 9, steps per second: 58, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.222 [0.000, 1.000], mean observation: 0.144 [-1.353, 2.133], loss: 0.189928, mean_absolute_error: 2.608366, mean_q: 5.637427
  945/5000: episode: 94, duration: 0.179s, episode steps: 11, steps per second: 62, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.273 [0.000, 1.000], mean observation: 0.105 [-1.373, 2.123], loss: 0.357427, mean_absolute_error: 2.609134, mean_q: 5.482111
  955/5000: episode: 95, duration: 0.166s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean actio

 1220/5000: episode: 121, duration: 0.208s, episode steps: 12, steps per second: 58, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.333 [0.000, 1.000], mean observation: 0.111 [-1.207, 1.855], loss: 0.327599, mean_absolute_error: 3.023829, mean_q: 6.018233
 1233/5000: episode: 122, duration: 0.228s, episode steps: 13, steps per second: 57, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.308 [0.000, 1.000], mean observation: 0.091 [-1.184, 1.854], loss: 0.281513, mean_absolute_error: 3.037763, mean_q: 6.079416
 1245/5000: episode: 123, duration: 0.190s, episode steps: 12, steps per second: 63, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.333 [0.000, 1.000], mean observation: 0.117 [-1.181, 1.878], loss: 0.209909, mean_absolute_error: 3.060978, mean_q: 6.189703
 1254/5000: episode: 124, duration: 0.148s, episode steps: 9, steps per second: 61, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean a

 1570/5000: episode: 150, duration: 0.162s, episode steps: 10, steps per second: 62, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.132 [-2.986, 1.952], loss: 1.627133, mean_absolute_error: 4.098371, mean_q: 7.525575
 1583/5000: episode: 151, duration: 0.220s, episode steps: 13, steps per second: 59, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.923 [0.000, 1.000], mean observation: -0.085 [-3.256, 2.187], loss: 0.739658, mean_absolute_error: 3.937388, mean_q: 7.252093
 1598/5000: episode: 152, duration: 0.245s, episode steps: 15, steps per second: 61, episode reward: 15.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.400 [0.000, 1.000], mean observation: 0.085 [-0.818, 1.462], loss: 0.828060, mean_absolute_error: 3.952991, mean_q: 7.427989
 1612/5000: episode: 153, duration: 0.237s, episode steps: 14, steps per second: 59, episode reward: 14.000, mean reward: 1.000 [1.000, 1.000], me

 2152/5000: episode: 180, duration: 0.132s, episode steps: 8, steps per second: 61, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.875 [0.000, 1.000], mean observation: -0.123 [-2.180, 1.399], loss: 2.619481, mean_absolute_error: 5.083424, mean_q: 9.345673
 2163/5000: episode: 181, duration: 0.183s, episode steps: 11, steps per second: 60, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.727 [0.000, 1.000], mean observation: -0.106 [-1.927, 1.211], loss: 1.302474, mean_absolute_error: 4.941520, mean_q: 9.228623
 2175/5000: episode: 182, duration: 0.196s, episode steps: 12, steps per second: 61, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.750 [0.000, 1.000], mean observation: -0.116 [-2.163, 1.363], loss: 1.888739, mean_absolute_error: 5.087807, mean_q: 9.409200
 2185/5000: episode: 183, duration: 0.168s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mea

 2456/5000: episode: 209, duration: 0.182s, episode steps: 11, steps per second: 60, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.818 [0.000, 1.000], mean observation: -0.108 [-2.402, 1.607], loss: 2.420222, mean_absolute_error: 5.579640, mean_q: 10.234366
 2465/5000: episode: 210, duration: 0.149s, episode steps: 9, steps per second: 60, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.889 [0.000, 1.000], mean observation: -0.161 [-2.304, 1.357], loss: 2.065156, mean_absolute_error: 5.564383, mean_q: 10.305596
 2475/5000: episode: 211, duration: 0.162s, episode steps: 10, steps per second: 62, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.900 [0.000, 1.000], mean observation: -0.136 [-2.758, 1.781], loss: 1.763359, mean_absolute_error: 5.551276, mean_q: 10.389808
 2485/5000: episode: 212, duration: 0.169s, episode steps: 10, steps per second: 59, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], 

 2746/5000: episode: 239, duration: 0.218s, episode steps: 13, steps per second: 60, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.538 [0.000, 1.000], mean observation: -0.114 [-1.166, 0.616], loss: 2.106350, mean_absolute_error: 5.723833, mean_q: 10.638388
 2756/5000: episode: 240, duration: 0.167s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.700 [0.000, 1.000], mean observation: -0.115 [-1.444, 0.775], loss: 2.603588, mean_absolute_error: 5.773190, mean_q: 10.624571
 2766/5000: episode: 241, duration: 0.165s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.800 [0.000, 1.000], mean observation: -0.100 [-1.945, 1.222], loss: 1.717933, mean_absolute_error: 5.588039, mean_q: 10.463107
 2775/5000: episode: 242, duration: 0.145s, episode steps: 9, steps per second: 62, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], 

 3631/5000: episode: 268, duration: 0.849s, episode steps: 51, steps per second: 60, episode reward: 51.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.490 [0.000, 1.000], mean observation: 0.007 [-0.578, 0.970], loss: 1.691353, mean_absolute_error: 6.109680, mean_q: 11.413475
 3691/5000: episode: 269, duration: 1.000s, episode steps: 60, steps per second: 60, episode reward: 60.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.633 [0.000, 1.000], mean observation: 0.173 [-2.984, 3.210], loss: 1.754368, mean_absolute_error: 6.182782, mean_q: 11.554001
 3743/5000: episode: 270, duration: 0.864s, episode steps: 52, steps per second: 60, episode reward: 52.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.519 [0.000, 1.000], mean observation: 0.048 [-0.411, 0.857], loss: 1.563946, mean_absolute_error: 6.227271, mean_q: 11.672683
 3801/5000: episode: 271, duration: 0.967s, episode steps: 58, steps per second: 60, episode reward: 58.000, mean reward: 1.000 [1.000, 1.000], m

<keras.callbacks.History at 0x1ebf820fb70>

### Test

In [6]:
# Test the model
dqn.test(env, nb_episodes=20, visualize=True)

Testing for 20 episodes ...
Episode 1: reward: 42.000, steps: 42
Episode 2: reward: 62.000, steps: 62
Episode 3: reward: 31.000, steps: 31
Episode 4: reward: 51.000, steps: 51
Episode 5: reward: 46.000, steps: 46
Episode 6: reward: 36.000, steps: 36
Episode 7: reward: 35.000, steps: 35
Episode 8: reward: 63.000, steps: 63
Episode 9: reward: 40.000, steps: 40
Episode 10: reward: 44.000, steps: 44
Episode 11: reward: 66.000, steps: 66
Episode 12: reward: 44.000, steps: 44
Episode 13: reward: 32.000, steps: 32
Episode 14: reward: 44.000, steps: 44
Episode 15: reward: 36.000, steps: 36
Episode 16: reward: 30.000, steps: 30
Episode 17: reward: 43.000, steps: 43
Episode 18: reward: 47.000, steps: 47
Episode 19: reward: 35.000, steps: 35
Episode 20: reward: 79.000, steps: 79


<keras.callbacks.History at 0x1eb8a2774e0>

## 4. Train and Test a DQN (no verbosity)

In [7]:
# Training with verbose and visualization
dqn.fit(env, nb_steps=100000)

Training for 100000 steps ...
Interval 1 (0 steps performed)
105 episodes - episode_reward: 94.790 [29.000, 200.000] - loss: 5.346 - mean_absolute_error: 14.495 - mean_q: 28.363

Interval 2 (10000 steps performed)
51 episodes - episode_reward: 194.922 [88.000, 200.000] - loss: 12.340 - mean_absolute_error: 27.364 - mean_q: 55.054

Interval 3 (20000 steps performed)
50 episodes - episode_reward: 200.000 [200.000, 200.000] - loss: 18.884 - mean_absolute_error: 37.948 - mean_q: 76.908

Interval 4 (30000 steps performed)
50 episodes - episode_reward: 200.000 [200.000, 200.000] - loss: 22.428 - mean_absolute_error: 43.282 - mean_q: 87.853

Interval 5 (40000 steps performed)
50 episodes - episode_reward: 200.000 [200.000, 200.000] - loss: 23.853 - mean_absolute_error: 44.914 - mean_q: 91.326

Interval 6 (50000 steps performed)
50 episodes - episode_reward: 200.000 [200.000, 200.000] - loss: 19.451 - mean_absolute_error: 44.271 - mean_q: 89.980

Interval 7 (60000 steps performed)
50 episodes 

<keras.callbacks.History at 0x1eb8a291630>

In [8]:
# Test the model
dqn.test(env, nb_episodes=20, visualize=True)

Testing for 20 episodes ...
Episode 1: reward: 200.000, steps: 200
Episode 2: reward: 200.000, steps: 200
Episode 3: reward: 200.000, steps: 200
Episode 4: reward: 200.000, steps: 200
Episode 5: reward: 200.000, steps: 200
Episode 6: reward: 200.000, steps: 200
Episode 7: reward: 200.000, steps: 200
Episode 8: reward: 200.000, steps: 200
Episode 9: reward: 200.000, steps: 200
Episode 10: reward: 200.000, steps: 200
Episode 11: reward: 200.000, steps: 200
Episode 12: reward: 200.000, steps: 200
Episode 13: reward: 200.000, steps: 200
Episode 14: reward: 200.000, steps: 200
Episode 15: reward: 200.000, steps: 200
Episode 16: reward: 200.000, steps: 200
Episode 17: reward: 200.000, steps: 200
Episode 18: reward: 200.000, steps: 200
Episode 19: reward: 200.000, steps: 200
Episode 20: reward: 200.000, steps: 200


<keras.callbacks.History at 0x1ebff27af60>

# Conclusion: 

* A DQN trained only 5000 times performs fairly well, managing to earn a max reward of 79.
* A DQN trained 100,000 times, on the other hand, performed exceedingly well. In this scenario, the agent managed to reach the maximum number of steps per episode, 200, each episode.