# Deep Q Networks

Code can be found at: https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-python/

## 1. Setup

In [1]:
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory

Using TensorFlow backend.


In [2]:
# Get the environment and extract the number of actions available
env = gym.make('CartPole-v0')
np.random.seed(0)
env.seed(0)
nb_actions = env.action_space.n

## 2. Testing with no AI

In [3]:
env.reset()
for _ in range(150):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()



## 3. Create Neural Network

In [4]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                80        
_________________________________________________________________
activation_1 (Activation)    (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 34        
_________________________________________________________________
activation_2 (Activation)    (None, 2)                 0         
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________
None


## 4. Train and Test a simple DQN (5,000 iterations)

In [5]:
# Initiate Parameters
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

### Train 5,000 Times

In [6]:
# Training with verbose and visualization
dqn.fit(env, nb_steps=5000, visualize=True, verbose=2)

Training for 5000 steps ...
   10/5000: episode: 1, duration: 0.260s, episode steps: 10, steps per second: 38, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.149 [-1.938, 3.116], loss: --, mean_absolute_error: --, mean_q: --
Instructions for updating:
Use tf.cast instead.




   24/5000: episode: 2, duration: 0.654s, episode steps: 14, steps per second: 21, episode reward: 14.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.143 [0.000, 1.000], mean observation: 0.089 [-1.948, 3.035], loss: 0.589399, mean_absolute_error: 0.758852, mean_q: 0.680146
   34/5000: episode: 3, duration: 0.165s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.132 [-1.556, 2.591], loss: 0.527845, mean_absolute_error: 0.736724, mean_q: 0.767827




   46/5000: episode: 4, duration: 0.200s, episode steps: 12, steps per second: 60, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.083 [0.000, 1.000], mean observation: 0.122 [-1.911, 3.024], loss: 0.474312, mean_absolute_error: 0.708733, mean_q: 0.845246
   56/5000: episode: 5, duration: 0.166s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.136 [-1.615, 2.548], loss: 0.444827, mean_absolute_error: 0.703594, mean_q: 0.985417
   65/5000: episode: 6, duration: 0.147s, episode steps: 9, steps per second: 61, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.160 [-1.733, 2.800], loss: 0.544864, mean_absolute_error: 0.740440, mean_q: 1.102551
   76/5000: episode: 7, duration: 0.183s, episode steps: 11, steps per second: 60, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0

  349/5000: episode: 34, duration: 0.206s, episode steps: 12, steps per second: 58, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.083 [0.000, 1.000], mean observation: 0.094 [-1.986, 3.041], loss: 0.425101, mean_absolute_error: 1.285824, mean_q: 3.054163
  358/5000: episode: 35, duration: 0.148s, episode steps: 9, steps per second: 61, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.146 [-1.745, 2.772], loss: 0.345916, mean_absolute_error: 1.302348, mean_q: 3.080801
  367/5000: episode: 36, duration: 0.148s, episode steps: 9, steps per second: 61, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.121 [-1.805, 2.800], loss: 0.446412, mean_absolute_error: 1.367441, mean_q: 3.183436
  376/5000: episode: 37, duration: 0.149s, episode steps: 9, steps per second: 61, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0

  645/5000: episode: 64, duration: 0.162s, episode steps: 10, steps per second: 62, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.100 [0.000, 1.000], mean observation: 0.119 [-1.742, 2.611], loss: 0.303750, mean_absolute_error: 2.057117, mean_q: 4.493590
  654/5000: episode: 65, duration: 0.148s, episode steps: 9, steps per second: 61, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.111 [0.000, 1.000], mean observation: 0.140 [-1.609, 2.505], loss: 0.360788, mean_absolute_error: 2.108777, mean_q: 4.543900
  663/5000: episode: 66, duration: 0.149s, episode steps: 9, steps per second: 60, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.111 [0.000, 1.000], mean observation: 0.151 [-1.348, 2.294], loss: 0.332103, mean_absolute_error: 2.104213, mean_q: 4.419550
  671/5000: episode: 67, duration: 0.131s, episode steps: 8, steps per second: 61, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action: 0

  952/5000: episode: 94, duration: 0.149s, episode steps: 9, steps per second: 60, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.000 [0.000, 0.000], mean observation: 0.146 [-1.737, 2.747], loss: 0.182172, mean_absolute_error: 2.595509, mean_q: 5.508574
  960/5000: episode: 95, duration: 0.137s, episode steps: 8, steps per second: 58, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.125 [0.000, 1.000], mean observation: 0.134 [-1.213, 2.026], loss: 0.204866, mean_absolute_error: 2.575796, mean_q: 5.419012
  971/5000: episode: 96, duration: 0.181s, episode steps: 11, steps per second: 61, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.182 [0.000, 1.000], mean observation: 0.145 [-1.512, 2.438], loss: 0.186981, mean_absolute_error: 2.637967, mean_q: 5.546784
  981/5000: episode: 97, duration: 0.164s, episode steps: 10, steps per second: 61, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action:

 1276/5000: episode: 124, duration: 0.166s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.156 [-3.111, 1.985], loss: 0.615982, mean_absolute_error: 3.262220, mean_q: 6.140171
 1288/5000: episode: 125, duration: 0.198s, episode steps: 12, steps per second: 60, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.917 [0.000, 1.000], mean observation: -0.115 [-3.030, 1.963], loss: 1.082268, mean_absolute_error: 3.377720, mean_q: 6.359297
 1298/5000: episode: 126, duration: 0.169s, episode steps: 10, steps per second: 59, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 1.000 [1.000, 1.000], mean observation: -0.147 [-3.037, 1.924], loss: 0.632299, mean_absolute_error: 3.527035, mean_q: 6.675466
 1322/5000: episode: 127, duration: 0.393s, episode steps: 24, steps per second: 61, episode reward: 24.000, mean reward: 1.000 [1.000, 1.000], m

 2424/5000: episode: 153, duration: 0.198s, episode steps: 12, steps per second: 61, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.750 [0.000, 1.000], mean observation: -0.136 [-2.450, 1.518], loss: 2.141752, mean_absolute_error: 6.354073, mean_q: 12.150212
 2434/5000: episode: 154, duration: 0.163s, episode steps: 10, steps per second: 61, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.900 [0.000, 1.000], mean observation: -0.140 [-2.532, 1.538], loss: 3.473845, mean_absolute_error: 6.361268, mean_q: 11.948492
 2445/5000: episode: 155, duration: 0.181s, episode steps: 11, steps per second: 61, episode reward: 11.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.727 [0.000, 1.000], mean observation: -0.104 [-2.175, 1.414], loss: 2.899632, mean_absolute_error: 6.164294, mean_q: 11.625362
 2454/5000: episode: 156, duration: 0.148s, episode steps: 9, steps per second: 61, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], 

 2756/5000: episode: 182, duration: 0.165s, episode steps: 10, steps per second: 61, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.800 [0.000, 1.000], mean observation: -0.143 [-2.189, 1.339], loss: 3.851773, mean_absolute_error: 7.215198, mean_q: 13.681656
 2766/5000: episode: 183, duration: 0.168s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.800 [0.000, 1.000], mean observation: -0.158 [-2.237, 1.356], loss: 5.431403, mean_absolute_error: 7.404189, mean_q: 13.753653
 2774/5000: episode: 184, duration: 0.131s, episode steps: 8, steps per second: 61, episode reward: 8.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.875 [0.000, 1.000], mean observation: -0.159 [-2.214, 1.326], loss: 3.912998, mean_absolute_error: 7.150518, mean_q: 13.398586
 2783/5000: episode: 185, duration: 0.149s, episode steps: 9, steps per second: 61, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], me

 3047/5000: episode: 211, duration: 0.166s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.800 [0.000, 1.000], mean observation: -0.127 [-2.383, 1.549], loss: 3.057979, mean_absolute_error: 7.657626, mean_q: 14.509886
 3056/5000: episode: 212, duration: 0.148s, episode steps: 9, steps per second: 61, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.889 [0.000, 1.000], mean observation: -0.134 [-2.413, 1.571], loss: 5.918040, mean_absolute_error: 7.990569, mean_q: 14.761852
 3066/5000: episode: 213, duration: 0.165s, episode steps: 10, steps per second: 61, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.800 [0.000, 1.000], mean observation: -0.134 [-2.036, 1.172], loss: 2.467758, mean_absolute_error: 7.718955, mean_q: 14.715654
 3076/5000: episode: 214, duration: 0.165s, episode steps: 10, steps per second: 61, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], 

 3329/5000: episode: 241, duration: 0.149s, episode steps: 9, steps per second: 60, episode reward: 9.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.667 [0.000, 1.000], mean observation: -0.128 [-1.845, 1.158], loss: 4.914502, mean_absolute_error: 7.939067, mean_q: 14.761989
 3339/5000: episode: 242, duration: 0.168s, episode steps: 10, steps per second: 60, episode reward: 10.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.700 [0.000, 1.000], mean observation: -0.139 [-1.833, 1.173], loss: 3.954776, mean_absolute_error: 7.430270, mean_q: 13.726990
 3351/5000: episode: 243, duration: 0.196s, episode steps: 12, steps per second: 61, episode reward: 12.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.583 [0.000, 1.000], mean observation: -0.106 [-1.741, 1.165], loss: 2.447071, mean_absolute_error: 7.762321, mean_q: 14.575443
 3364/5000: episode: 244, duration: 0.218s, episode steps: 13, steps per second: 60, episode reward: 13.000, mean reward: 1.000 [1.000, 1.000], 

 4317/5000: episode: 270, duration: 0.628s, episode steps: 38, steps per second: 60, episode reward: 38.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.474 [0.000, 1.000], mean observation: -0.101 [-0.926, 0.371], loss: 3.433804, mean_absolute_error: 8.549006, mean_q: 16.095638
 4395/5000: episode: 271, duration: 1.298s, episode steps: 78, steps per second: 60, episode reward: 78.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.513 [0.000, 1.000], mean observation: 0.103 [-0.275, 0.768], loss: 2.514771, mean_absolute_error: 8.573539, mean_q: 16.261259
 4474/5000: episode: 272, duration: 1.317s, episode steps: 79, steps per second: 60, episode reward: 79.000, mean reward: 1.000 [1.000, 1.000], mean action: 0.481 [0.000, 1.000], mean observation: -0.035 [-0.689, 0.570], loss: 3.321003, mean_absolute_error: 8.714490, mean_q: 16.410271
 4534/5000: episode: 273, duration: 0.996s, episode steps: 60, steps per second: 60, episode reward: 60.000, mean reward: 1.000 [1.000, 1.000],

<keras.callbacks.History at 0x2344c50d518>

### Test

In [7]:
# Test the model
dqn.test(env, nb_episodes=20, visualize=True)

Testing for 20 episodes ...
Episode 1: reward: 62.000, steps: 62
Episode 2: reward: 64.000, steps: 64
Episode 3: reward: 66.000, steps: 66
Episode 4: reward: 71.000, steps: 71
Episode 5: reward: 50.000, steps: 50
Episode 6: reward: 59.000, steps: 59
Episode 7: reward: 48.000, steps: 48
Episode 8: reward: 36.000, steps: 36
Episode 9: reward: 189.000, steps: 189
Episode 10: reward: 140.000, steps: 140
Episode 11: reward: 72.000, steps: 72
Episode 12: reward: 48.000, steps: 48
Episode 13: reward: 200.000, steps: 200
Episode 14: reward: 75.000, steps: 75
Episode 15: reward: 55.000, steps: 55
Episode 16: reward: 50.000, steps: 50
Episode 17: reward: 73.000, steps: 73
Episode 18: reward: 106.000, steps: 106
Episode 19: reward: 46.000, steps: 46
Episode 20: reward: 79.000, steps: 79


<keras.callbacks.History at 0x23456d4b630>

## 5. Train and Test a DQN (100,000 iterations)

In [8]:
# Training with verbose and visualization
dqn.fit(env, nb_steps=100000)

Training for 100000 steps ...
Interval 1 (0 steps performed)
73 episodes - episode_reward: 136.918 [30.000, 200.000] - loss: 5.918 - mean_absolute_error: 15.723 - mean_q: 30.942

Interval 2 (10000 steps performed)
50 episodes - episode_reward: 197.160 [157.000, 200.000] - loss: 10.537 - mean_absolute_error: 27.565 - mean_q: 55.684

Interval 3 (20000 steps performed)
53 episodes - episode_reward: 190.226 [130.000, 200.000] - loss: 12.308 - mean_absolute_error: 34.244 - mean_q: 69.635

Interval 4 (30000 steps performed)
54 episodes - episode_reward: 185.444 [128.000, 200.000] - loss: 12.951 - mean_absolute_error: 37.171 - mean_q: 75.856

Interval 5 (40000 steps performed)
58 episodes - episode_reward: 172.190 [135.000, 200.000] - loss: 11.832 - mean_absolute_error: 39.074 - mean_q: 80.075

Interval 6 (50000 steps performed)
53 episodes - episode_reward: 187.226 [11.000, 200.000] - loss: 5.755 - mean_absolute_error: 39.351 - mean_q: 79.483

Interval 7 (60000 steps performed)
50 episodes -

<keras.callbacks.History at 0x23456d665c0>

In [9]:
# Test the model
dqn.test(env, nb_episodes=20, visualize=True)

Testing for 20 episodes ...
Episode 1: reward: 200.000, steps: 200
Episode 2: reward: 200.000, steps: 200
Episode 3: reward: 200.000, steps: 200
Episode 4: reward: 200.000, steps: 200
Episode 5: reward: 200.000, steps: 200
Episode 6: reward: 200.000, steps: 200
Episode 7: reward: 200.000, steps: 200
Episode 8: reward: 200.000, steps: 200
Episode 9: reward: 200.000, steps: 200
Episode 10: reward: 200.000, steps: 200
Episode 11: reward: 200.000, steps: 200
Episode 12: reward: 200.000, steps: 200
Episode 13: reward: 200.000, steps: 200
Episode 14: reward: 200.000, steps: 200
Episode 15: reward: 200.000, steps: 200
Episode 16: reward: 200.000, steps: 200
Episode 17: reward: 200.000, steps: 200
Episode 18: reward: 200.000, steps: 200
Episode 19: reward: 200.000, steps: 200
Episode 20: reward: 200.000, steps: 200


<keras.callbacks.History at 0x23456d66828>

# Conclusion: 

* A DQN trained only 5000 times performs fairly well, managing to earn a max reward of 200- a perfect score- on a single iteration. However, the agent only managed to exceed 100 time steps 3 of 20 times.
* A DQN trained 100,000 times, on the other hand, performed exceedingly well. In this scenario, the agent managed to reach the maximum number of steps per episode, 200, each episode.