# Keras Reinforcement Learning on Images with Breakout

In this notebook we are going to train a Deep Q Network to play the breakout game! <br />
At first we start with the necessary imports and create the environment

In [None]:
from PIL import Image  # To transform the image in the Processor
import numpy as np
import gym

# Convolutional Backbone Network
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten, Convolution2D, Permute
#from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.legacy import Adam

# Keras-RL
from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint

In [None]:
from gym.wrappers import Monitor
env = gym.make("BreakoutDeterministic-v4")
#env = wrap_env(gym.make('BreakoutDeterministic-v4'))
nb_actions = env.action_space.n

We will use an input shape of $(84 \times 84)$ and a window length of 4 thus each timestep will consist of 4 consecutive frames

In [None]:
IMG_SHAPE = (84, 84)
WINDOW_LENGTH = 4

Based on those settings we create our processor. It is the same processor as in the last notebook, with the addition that it standardizes the data into the [0, 1] intervall which often decreases the necessary training time. <br />
We perform this standardization routine in the process_state_batch function, which is only executed on the current batch and not on the complete replay memory which decreases RAM usage by a factor of 4.
Additionally we clip the reward in the intervall [-1, 1] which might speed up the training

In [None]:
class ImageProcessor(Processor):
    def process_observation(self, observation):
        # First convert the numpy array to a PIL Image
        img = Image.fromarray(observation)
        # Then resize the image
        img = img.resize(IMG_SHAPE)
        # And convert it to grayscale  (The L stands for luminance)
        img = img.convert("L")
        # Convert the image back to a numpy array and finally return the image
        img = np.array(img)
        return img.astype('uint8')  # saves storage in experience memory
    
    def process_state_batch(self, batch):
        # Divide the observations by 255 to compress it into the intervall [0, 1].
        # This supports the training of the network
        # We perform this operation here to save memory.
        processed_batch = batch.astype('float32') / 255.
        return processed_batch

    def process_reward(self, reward):
        return np.clip(reward, -1., 1.)

As our input consists of 4 consecutive frames, each having the shape $(84 \times 84)$, the input to the network has the shape $(84 \times 84 \times 4)$.
But as the Convolutional Layers expect our input to be of shape $(4 \times 84 \times 84)$ we add a permute layer at the beginning to swap the channels


In [None]:
input_shape = (WINDOW_LENGTH, IMG_SHAPE[0], IMG_SHAPE[1])
input_shape

Now it is time to define the network!
We use the He Normal weight initialization technique

In [None]:
model = Sequential()
model.add(Permute((2, 3, 1), input_shape=input_shape))

model.add(Convolution2D(32, (8, 8), strides=(4, 4),kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Convolution2D(64, (4, 4), strides=(2, 2), kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Convolution2D(64, (3, 3), strides=(1, 1), kernel_initializer='he_normal'))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

Now we define the memory. We use again the SequentialMemory, but this time with a window_length of 4!

In [None]:
memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)

Then we define the processor

In [None]:
processor = ImageProcessor()

We use again a LinearAnnealedPolicy to implement the epsilon greedy action selection with decaying epsilon.
As we need to train for at least a million steps, we set the number of steps to 1,000,000

In [None]:
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.05,
                              nb_steps=1000000)

Finally we define the agent and compile it. The agent is defined in the same way as in the previous lectures with an additional train_interval of 4 (we only train on every 4th step). <br />
Besides that we clip delta (the error) to 1.<br />
Both, clipping and train_interval often increase the result

In [None]:
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor, nb_steps_warmup=50000, gamma=.99, target_model_update=10000,
              train_interval=4, delta_clip=1)

In [None]:
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

As the training might take several hours, we store our current model each 100,000 steps. <br />
We can use the *ModelIntervalCheckpoint(checkpoint_name, interval)* to do so and store it in a callback variable which we pass to the fit method as a callback

In [None]:
weights_filename = 'dqn_breakout_weights.h5f'
checkpoint_weights_filename = 'dqn_' + "BreakoutDeterministic-v4" + '_weights_{step}.h5f'
checkpoint_callback = ModelIntervalCheckpoint(checkpoint_weights_filename, interval=100000)

If you want a headstart on the training or need to cancel the training and want to continue with the latest checkpoint, you can use the **load_weights()** function provided by tensorflow. <br />
As we are not training from scratch, we also decrease the value for epsilon.<br />
Depending on your weight file you need to set epsilon. When you use checkpoint 900,000, set epsilon to 0.2<br/>
If you use your own checkpoint, make sure to adjust epsilon!

In [None]:
from tensorflow.keras.optimizers.legacy import Adam

# Load the weights
model.load_weights("weights/dqn_BreakoutDeterministic-v4_weights_1200000.h5f")

# Update the policy to start with a smaller epsilon
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=0.3, value_min=.1, value_test=.05,
                              nb_steps=100000)

# Initialize the DQNAgent with the new model and updated policy and compile it
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor, nb_steps_warmup=50000, gamma=.99, target_model_update=10000)
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

# And train the model
dqn.fit(env, nb_steps=500000, callbacks=[checkpoint_callback], log_interval=10000, visualize=False)

In [None]:
#dqn.test(env, nb_episodes=5, visualize=True)
dqn.save_weights(weights_filename, overwrite=True)

Or we train the model for 1.5 mio steps. <br />
Be aware that this might take some time, so feel free to start the next lectures :)

In [None]:
dqn.fit(env, nb_steps=1500000, callbacks=[checkpoint_callback], log_interval=10000, visualize=False)

# After training is done, we save the final weights one more time.
dqn.save_weights(weights_filename, overwrite=True)

In [None]:
#dqn.test(env, nb_episodes=5, visualize=True)

If you only want to load your model for evaluation, you can use the exact same code from above without calling **fit()**. <br />
You can also leave out the warmup steps, gamma and the targe model update variables when defining the DQNAgent as they are only needed for training.

In [205]:
env.close()
env.reset();

# Testing with different saved weights

In [206]:
# Load the weights
#model.load_weights("weights/dqn_BreakoutDeterministic-v4_weights_1200000.h5f")
#model.load_weights("weights/dqn_BreakoutDeterministic-v4_weights_1100000.h5f")
model.load_weights("dqn_BreakoutDeterministic-v4_weights_2500000.h5f") # Good
#model.load_weights("dqn_BreakoutDeterministic-v4_weights_1500000.h5f")
#model.load_weights("dqn_breakout_weights.h5f")

#You can chose an arbitrary policy for evaluation, it is fixed here.
policy = EpsGreedyQPolicy(0.1)

# Initialize the DQNAgent with the new model and updated policy and compile it
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory, processor=processor)
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

2023-03-21 15:11:57.779315: W tensorflow/c/c_api.cc:291] Operation '{name:'count_122/Assign' id:11191 op device:{requested: '', assigned: ''} def:{{{node count_122/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](count_122, count_122/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2023-03-21 15:11:58.588659: W tensorflow/c/c_api.cc:291] Operation '{name:'conv2d_3_3/kernel/Assign' id:11345 op device:{requested: '', assigned: ''} def:{{{node conv2d_3_3/kernel/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](conv2d_3_3/kernel, conv2d_3_3/kernel/Initializer/stateless_truncated_normal)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigg

In [207]:
import numpy as np
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only

In [208]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7f2c28689600>

In [209]:
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
def show_video(num=0):
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > num:
    print(mp4list[num])
    mp4 = mp4list[num]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="{1}" autoplay 
                controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'),mp4list[num])))
  else: print("Could not find video")

def wrap_env(env): return Monitor(env, './video', force=True)

In [210]:
env=gym.make('BreakoutDeterministic-v4')
env._max_episode_steps = 2000
print(env.unwrapped.get_action_meanings())
env = wrap_env(env)
dqn.test(env, nb_episodes=1, visualize=True)
env.close()

['NOOP', 'FIRE', 'RIGHT', 'LEFT']
Testing for 1 episodes ...


  updates=self.state_updates,
2023-03-21 15:12:49.923268: W tensorflow/c/c_api.cc:291] Operation '{name:'activation_9_1/activation_9/Identity' id:10925 op device:{requested: '', assigned: ''} def:{{{node activation_9_1/activation_9/Identity}} = Identity[T=DT_FLOAT, _has_manual_control_dependencies=true](dense_3_1/BiasAdd)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2023-03-21 15:12:50.170668: W tensorflow/c/c_api.cc:291] Operation '{name:'count_127/Assign' id:11579 op device:{requested: '', assigned: ''} def:{{{node count_127/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](count_127, count_127/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don'

Episode 1: reward: 72.000, steps: 1696


In [211]:
show_video(0)

video/openaigym.video.30.90.video000000.mp4
