## Learn to play at Breakout 

### Requirements

- [install `keras-rl`](https://github.com/keras-rl/keras-rl#installation)

      pip install keras-rl
      
- install the `gym_breakout_pygame` package

      pip install gym_breakout_pygame
      

In [1]:
import numpy as np
import gym
from gym_breakout_pygame.wrappers.observation_space import BreakoutN

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory


pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html


Using TensorFlow backend.


In [2]:
env = BreakoutN(encode=False)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

# Next, we build a very simple model.
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())


W0629 23:48:49.540298 140024841986176 deprecation_wrapper.py:119] From /home/marcofavorito/.virtualenvs/gym-breakout-pygame-7UQzWS9l/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0629 23:48:49.551870 140024841986176 deprecation_wrapper.py:119] From /home/marcofavorito/.virtualenvs/gym-breakout-pygame-7UQzWS9l/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0629 23:48:49.561582 140024841986176 deprecation_wrapper.py:119] From /home/marcofavorito/.virtualenvs/gym-breakout-pygame-7UQzWS9l/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                320       
_________________________________________________________________
activation_1 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                4160      
_________________________________________________________________
activation_2 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 195       
_________________________________________________________________
activation_3 (Activation)    (None, 3)                 0         
Total para

In [None]:
# Configure and compile the RL agent
memory = SequentialMemory(limit=50000, window_length=1)
policy = BoltzmannQPolicy()
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
               target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])

# learn
dqn.fit(env, nb_steps=30000, visualize=False, verbose=2)

# save 
dqn.save_weights('dqn_{}_weights.h5f'.format("breakout-n"), overwrite=True)

# Finally, evaluate our algorithm for 5 episodes.
dqn.test(env, nb_episodes=5, visualize=True)

Training for 30000 steps ...




   419/30000: episode: 1, duration: 3.063s, episode steps: 419, steps per second: 137, episode reward: 15.000, mean reward: 0.036 [0.000, 5.000], mean action: 1.057 [0.000, 2.000], mean observation: 11.356 [0.000, 47.000], loss: 0.023504, mean_absolute_error: 5.422011, mean_q: 8.154562
   708/30000: episode: 2, duration: 1.514s, episode steps: 289, steps per second: 191, episode reward: 10.000, mean reward: 0.035 [0.000, 5.000], mean action: 0.931 [0.000, 2.000], mean observation: 11.562 [0.000, 47.000], loss: 0.033845, mean_absolute_error: 5.324571, mean_q: 7.997057
   989/30000: episode: 3, duration: 1.480s, episode steps: 281, steps per second: 190, episode reward: 10.000, mean reward: 0.036 [0.000, 5.000], mean action: 0.947 [0.000, 2.000], mean observation: 12.093 [0.000, 47.000], loss: 0.045569, mean_absolute_error: 5.272690, mean_q: 7.907522
  1270/30000: episode: 4, duration: 1.508s, episode steps: 281, steps per second: 186, episode reward: 10.000, mean reward: 0.036 [0.000, 5

  9265/30000: episode: 30, duration: 3.310s, episode steps: 557, steps per second: 168, episode reward: 20.000, mean reward: 0.036 [0.000, 5.000], mean action: 1.038 [0.000, 2.000], mean observation: 11.119 [0.000, 47.000], loss: 0.073086, mean_absolute_error: 5.475297, mean_q: 8.213145
  9676/30000: episode: 31, duration: 2.529s, episode steps: 411, steps per second: 162, episode reward: 15.000, mean reward: 0.036 [0.000, 5.000], mean action: 0.990 [0.000, 2.000], mean observation: 11.747 [0.000, 47.000], loss: 0.092612, mean_absolute_error: 5.488288, mean_q: 8.224232
  9965/30000: episode: 32, duration: 1.624s, episode steps: 289, steps per second: 178, episode reward: 10.000, mean reward: 0.035 [0.000, 5.000], mean action: 1.031 [0.000, 2.000], mean observation: 11.562 [0.000, 47.000], loss: 0.087115, mean_absolute_error: 5.484327, mean_q: 8.218536
 10246/30000: episode: 33, duration: 1.864s, episode steps: 281, steps per second: 151, episode reward: 10.000, mean reward: 0.036 [0.00

 19818/30000: episode: 59, duration: 1.007s, episode steps: 151, steps per second: 150, episode reward: 5.000, mean reward: 0.033 [0.000, 5.000], mean action: 1.046 [0.000, 2.000], mean observation: 12.611 [2.000, 47.000], loss: 0.048499, mean_absolute_error: 5.274741, mean_q: 7.892972
 20099/30000: episode: 60, duration: 1.833s, episode steps: 281, steps per second: 153, episode reward: 10.000, mean reward: 0.036 [0.000, 5.000], mean action: 0.989 [0.000, 2.000], mean observation: 11.076 [0.000, 47.000], loss: 0.083863, mean_absolute_error: 5.255237, mean_q: 7.854093
 20510/30000: episode: 61, duration: 2.638s, episode steps: 411, steps per second: 156, episode reward: 15.000, mean reward: 0.036 [0.000, 5.000], mean action: 0.971 [0.000, 2.000], mean observation: 11.533 [0.000, 47.000], loss: 0.081554, mean_absolute_error: 5.245347, mean_q: 7.848889
 20661/30000: episode: 62, duration: 0.980s, episode steps: 151, steps per second: 154, episode reward: 5.000, mean reward: 0.033 [0.000,

 29751/30000: episode: 88, duration: 0.921s, episode steps: 151, steps per second: 164, episode reward: 5.000, mean reward: 0.033 [0.000, 5.000], mean action: 0.980 [0.000, 2.000], mean observation: 12.175 [0.000, 47.000], loss: 0.060603, mean_absolute_error: 4.842140, mean_q: 7.244157
done, took 257.579 seconds
Testing for 5 episodes ...
