# Convolutional DQN

**Work in progress!** Please forgive lack of clarity, bugs, and typos.

This notebook follows on from [Deep Q-learner starter code](https://www.kaggle.com/garethjns/deep-q-learner-starter-code), which includes code for running a DQN in Python using Keras with a simple fully connected model. Here we add a convolutional model instead, which is slightly less straightforward than the typical Pong-solving networks. We'll also develop the wrappers required to use the observations from the SMM version of the GFootball environment.

To keep this notebook short and focused, the example training the mode use my package [reinforcement-learning-keras package](https://github.com/garethjns/reinforcement-learning-keras). Alternatively, code in [Deep Q-learner starter code](https://www.kaggle.com/garethjns/deep-q-learner-starter-code) can be used instead, without requiring any additional packages. In either case, the model needs to be saved as a keras model to be used in a submission. This conversion process is shown below, and can be run locally by following the (again, WIP) structure and examples here https://github.com/garethjns/kaggle-football .

# Set up

In [None]:
# GFootball environment.
!pip install kaggle_environments
!apt-get update -y
!apt-get install -y libsdl2-gfx-dev libsdl2-ttf-dev
!git clone -b v2.3 https://github.com/google-research/football.git
!mkdir -p football/third_party/gfootball_engine/lib
!wget https://storage.googleapis.com/gfootball/prebuilt_gameplayfootball_v2.3.so -O football/third_party/gfootball_engine/lib/prebuilt_gameplayfootball.so
!cd football && GFOOTBALL_USE_PREBUILT_SO=1 pip3 install .

# Some helper code
!git clone https://github.com/garethjns/kaggle-football.git
!pip install reinforcement_learning_keras==0.6.0

In [None]:
import collections
from typing import Union, Callable, List, Tuple, Iterable, Any, Dict
from dataclasses import dataclass
from tqdm import tqdm
import matplotlib.pyplot as plt
import numpy as np
from tensorflow import keras
import tensorflow as tf
import seaborn as sns
import gym
import gfootball
import glob 
import imageio
import pathlib
import zlib
import pickle
import tempfile
import os
import sys
from IPython.display import Image, display

sns.set()

# In TF > 2, training keras models in a loop with eager execution on causes memory leaks and terrible performance.
tf.compat.v1.disable_eager_execution()

sys.path.append("/kaggle/working/kaggle-football/")

# Environment

## SMM observation space

The "GFootball-11_vs_11_kaggle-SMM-v0" enviroment in GFootball uses the [SMMWrapper](https://github.com/google-research/football/blob/master/gfootball/doc/observation.md) observation wrapper to return frames showing team positions, ball position, and current active player position.

In [None]:
from kaggle_football.viz import generate_gif, plot_smm_obs


smm_env = gym.make("GFootball-11_vs_11_kaggle-SMM-v0")
print(smm_env.reset().shape)

generate_gif(smm_env, n_steps=500, expected_min=0, expected_max=255)
Image(filename='smm_env_replay.gif', format='png')

These frames can be used by a convolutional neural network, but require pre-processing first.

 - Reshaping
   - 2D convolutional layers typically expect input with the shape (None, x, y, c), where c is the colour channel and is 1 or 3. We need to split the (x, y, 4) observation array into 4 (None, 72, 96, 1) arrays.
 - Scaling 
   - The data range is 0 -> 255. This needs to be scaled to 0 -> 1.
 - Adding time
   - Static frames don't convey direction or speed, only position. We can add temporal information to each frame in a couple of ways:
       - Remember the past n frames, and also give those to the network in the shape (None, x, y, n_buffer). This is very effective for solving pong, however increases the dimensionality of the network.
       - Remember the previous frame and calculate the difference between it and the latest frame. We'll this do that here.
       
These 3 steps can be accomplished using another wrapper class that follows the expected Gym API:

In [None]:
class SMMFrameProcessWrapper(gym.Wrapper):
    """
    Wrapper for processing frames from SMM observation wrapper from football env.

    Input is (72, 96, 4), where last dim is (team 1 pos, team 2 pos, ball pos, 
    active player pos). Range 0 -> 255.
    Output is (72, 96, 4) as difference to last frame for all. Range -1 -> 1
    """

    def __init__(self, env: gym.Env = None,
                 obs_shape: Tuple[int, int] = (72, 96, 4)) -> None:
        """
        :param env: Gym env, or None. Allowing None here is unusual,
                    but we'll reuse the buffer functunality later in
                    the submission, when we won't be using the gym API.
        :param obs_shape: Expected shape of single observation.
        """
        if env is not None:
            super().__init__(env)
        self._buffer_length = 2
        self._obs_shape = obs_shape
        self._prepare_obs_buffer()

    @staticmethod
    def _normalise_frame(frame: np.ndarray):
        return frame / 255.0

    def _prepare_obs_buffer(self) -> None:
        """Create buffer and preallocate with empty arrays of expected shape."""

        self._obs_buffer = collections.deque(maxlen=self._buffer_length)

        for _ in range(self._buffer_length):
            self._obs_buffer.append(np.zeros(shape=self._obs_shape))

    def build_buffered_obs(self) -> np.ndarray:
        """
        Iterate over the last dimenion, and take the difference between this obs 
        and the last obs for each.
        """
        agg_buff = np.empty(self._obs_shape)
        for f in range(self._obs_shape[-1]):
            agg_buff[..., f] = self._obs_buffer[1][..., f] - self._obs_buffer[0][..., f]

        return agg_buff

    def step(self, action: int) -> Tuple[np.ndarray, float, bool, Dict[Any, Any]]:
        """Step env, add new obs to buffer, return buffer."""
        obs, reward, done, info = self.env.step(action)

        obs = self._normalise_frame(obs)
        self._obs_buffer.append(obs)

        return self.build_buffered_obs(), reward, done, info

    def reset(self) -> np.ndarray:
        """Add initial obs to end of pre-allocated buffer.

        :return: Buffered observation
        """
        self._prepare_obs_buffer()
        obs = self.env.reset()
        self._obs_buffer.append(obs)

        return self.build_buffered_obs()

The enviroment is wrapped by running:

In [None]:
smm_env = gym.make("GFootball-11_vs_11_kaggle-SMM-v0")
wrapped_smm_env = SMMFrameProcessWrapper(smm_env)
print(wrapped_smm_env.reset().shape)

And the output now looks a bit different....

In [None]:
generate_gif(wrapped_smm_env, n_steps=500, suffix="wrapped_smm_env_", expected_min=-1, expected_max=1)
Image(filename='wrapped_smm_env_replay.gif', format='png')

One potential downside here is that non-moving players are invisible!

# Model

The below class builds a convolutional network, however there's an additional complication compared building networks to solve games like Pong. The input here is 4 independent frames, rather than a single frame. We could prehaps combined them in the wrapper, or another solution would be to build multiple inputs.

Here the first layer in the nextwork is a custom layer that splits the input array (72, 96, 4) on the last dimension. 

In [None]:
class SplitLayer(keras.layers.Layer):
    def __init__(self, split_dim: int = 3) -> None:
        super().__init__()
        self.split_dim = split_dim

    def call(self, inputs) -> tf.Tensor:
        """Split a given dim into seperate tensors."""
        return [tf.expand_dims(inputs[..., i], self.split_dim) 
                for i in range(inputs.shape[self.split_dim])]

After going through the split layer, each array (now (72, 96, 1) each) go through seperate convolutional branches of the model and are concatenated into a Dense layer.

In [None]:
class SplitterConvNN:

    def __init__(self, observation_shape: List[int], n_actions: int, 
                 output_activation: Union[None, str] = None,
                 unit_scale: int = 1, learning_rate: float = 0.0001, 
                 opt: str = 'Adam') -> None:
        """
        :param observation_shape: Tuple specifying input shape.
        :param n_actions: Int specifying number of outputs
        :param output_activation: Activation function for output. Eg. 
                                  None for value estimation (off-policy methods).
        :param unit_scale: Multiplier for all units in FC layers in network 
                           (not used here at the moment).
        :param opt: Keras optimiser to use. Should be string. 
                    This is to avoid storing TF/Keras objects here.
        :param learning_rate: Learning rate for optimiser.

        """
        self.observation_shape = observation_shape
        self.n_actions = n_actions
        self.unit_scale = unit_scale
        self.output_activation = output_activation
        self.learning_rate = learning_rate
        self.opt = opt

    @staticmethod
    def _build_conv_branch(frame: keras.layers.Layer, name: str) -> keras.layers.Layer:
        conv1 = keras.layers.Conv2D(16, kernel_size=(8, 8), strides=(4, 4),
                                    name=f'conv1_frame_{name}', padding='same', 
                                    activation='relu')(frame)
        conv2 = keras.layers.Conv2D(24, kernel_size=(4, 4), strides=(2, 2),
                                    name=f'conv2_frame_{name}', padding='same', 
                                    activation='relu')(conv1)
        conv3 = keras.layers.Conv2D(32, kernel_size=(3, 3), strides=(1, 1),
                                    name=f'conv3_frame_{name}', padding='same', 
                                    activation='relu')(conv2)

        flatten = keras.layers.Flatten(name=f'flatten_{name}')(conv3)

        return flatten

    def _model_architecture(self) -> Tuple[keras.layers.Layer, keras.layers.Layer]:
        n_units = 512 * self.unit_scale

        frames_input = keras.layers.Input(name='input', shape=self.observation_shape)
        frames_split = SplitLayer(split_dim=3)(frames_input)
        conv_branches = []
        for f, frame in enumerate(frames_split):
            conv_branches.append(self._build_conv_branch(frame, name=str(f)))

        concat = keras.layers.concatenate(conv_branches)
        fc1 = keras.layers.Dense(units=int(n_units), name='fc1', 
                                 activation='relu')(concat)
        fc2 = keras.layers.Dense(units=int(n_units / 2), name='fc2', 
                                 activation='relu')(fc1)
        action_output = keras.layers.Dense(units=self.n_actions, name='output',
                                           activation=self.output_activation)(fc2)

        return frames_input, action_output

    def compile(self, model_name: str = 'model', 
                loss: Union[str, Callable] = 'mse') -> keras.Model:
        """
        Compile a copy of the model using the provided loss.

        :param model_name: Name of model
        :param loss: Model loss. Default 'mse'. Can be custom callable.
        """
        # Get optimiser
        if self.opt.lower() == 'adam':
            opt = keras.optimizers.Adam
        elif self.opt.lower() == 'rmsprop':
            opt = keras.optimizers.RMSprop
        else:
            raise ValueError(f"Invalid optimiser {self.opt}")

        state_input, action_output = self._model_architecture()
        model = keras.Model(inputs=[state_input], outputs=[action_output], 
                            name=model_name)
        model.compile(optimizer=opt(learning_rate=self.learning_rate), 
                      loss=loss)

        return model

    def plot(self, model_name: str = 'model') -> None:
        keras.utils.plot_model(self.compile(model_name), 
                               to_file=f"{model_name}.png", show_shapes=True)
        plt.show()


mod = SplitterConvNN(observation_shape=wrapped_smm_env.observation_space.shape, 
                     n_actions=wrapped_smm_env.action_space.n)
mod.compile()
mod.plot()
Image(filename='model.png') 

This model works as a drop in replacement for the dense model shown in [Deep Q-learner starter code](https://www.kaggle.com/garethjns/deep-q-learner-starter-code), and no additional changes to the replay buffer or agent are required. It's possible to reuse the code there with this model, although to avoid copying and pasting that code, I'm going to import from [reinforcement-learning-keras](https://github.com/garethjns/reinforcement-learning-keras) instead.

One difference to bear in mind is the agent here handles building and wrapping the env, so we only need to specify the env name and the wrapper to add. To use the environment and wrapper in a training loop as show in [Deep Q-learner starter code](https://www.kaggle.com/garethjns/deep-q-learner-starter-code), this code should be used:

```python
smm_env = gym.make("GFootball-11_vs_11_kaggle-SMM-v0")
wrapped_smm_env = SMMFrameProcessWrapper(smm_env)
```

In [None]:
from reinforcement_learning_keras.agents.components.history.training_history import TrainingHistory
from reinforcement_learning_keras.agents.components.replay_buffers.continuous_buffer import ContinuousBuffer
from reinforcement_learning_keras.agents.q_learning.deep_q_agent import DeepQAgent
from reinforcement_learning_keras.agents.q_learning.exploration.epsilon_greedy import EpsilonGreedy

agent = DeepQAgent(
    name='deep_q',
    model_architecture=SplitterConvNN(observation_shape=(72, 96, 4), 
                                      n_actions=19),
    replay_buffer=ContinuousBuffer(buffer_size=300),
    env_spec="GFootball-11_vs_11_kaggle-SMM-v0",
    env_wrappers=[SMMFrameProcessWrapper],
    eps=EpsilonGreedy(eps_initial=0.5, 
                      decay=0.001, 
                      eps_min=0.01, 
                      decay_schedule='linear'),
    training_history=TrainingHistory(agent_name='deep_q', 
                                     plotting_on=True, 
                                     plot_every=5, 
                                     rolling_average=5)
)

agent.train(verbose=True, render=False,
            n_episodes=2, max_episode_steps=100, 
            update_every=10, checkpoint_every=10)

Note how training becomes extremely slow after the replay buffer is filled and the model training begins. This should be much faster on GPU.

# Creating submission

This approach saves everything required to run the agent into one python file, including the model weights.

## Saving model weights

The submitted agent only needs the action model from the agent trained above. This requires the model arcitecture and custom layer defined above, and the trained weights. The trained weights can be saved with:

In [None]:
agent._action_model.save("saved_model/")
!ls

## Constructing main.py

We need:
  - The agent(obs) function defined
  - The action model redefined
    - Architecture (including custom layer)
    - The weights serialised above
  - The environment wrapper redefined - we need to buffer the observations as in training.

We **don't** need:
  - Components of the agent only used in training
    - The replay buffer
    - EpisolonGreedy action selection
  
The agent function is new and needs to handle:
 - Taking the raw observation and processing to the same shape that the model was trained on - ie. the SMM space with the buffering added by SMMFrameProcessWrapper.
 - Using the naked keras model for prediction
 - Applying the "policy" of the Q learner, which is just argmax over the action values
 - Return the action predicted to be most valuable
    
It'll look something like this:
```
def agent(obs):
    
    # Use the existing model and obs buffer on each call to agent
    global tf_mod
    global obs_buffer

    # Get the raw observations return by the environment
    obs = obs['players_raw'][0]
    # Convert these to the same output as the SMMWrapper we used in training
    obs = observation_preprocessing.generate_smm([obs])
    
    # Use the SMMFrameProcessWrapper to do the buffering, but not enviroment
    # stepping or anything related to the Gym API.
    obs_buffer._obs_buffer.append(obs)
    
    # Predict actions from keras model
    actions = tf_mod.predict(obs)
    action = np.argmax(actions)

    return [action]
```

In [None]:
%%writefile main.py


import os
import collections
import pickle
import zlib
from typing import Tuple, Dict, Any, Union, Callable, List

import gym
import numpy as np
import tensorflow as tf
from gfootball.env import observation_preprocessing
from tensorflow import keras


class SMMFrameProcessWrapper(gym.Wrapper):
    """
    Wrapper for processing frames from SMM observation wrapper from football env.

    Input is (72, 96, 4), where last dim is (team 1 pos, team 2 pos, ball pos,
    active player pos). Range 0 -> 255.
    Output is (72, 96, 4) as difference to last frame for all. Range -1 -> 1
    """

    def __init__(self, env: Union[None, gym.Env] = None,
                 obs_shape: Tuple[int, int] = (72, 96, 4)) -> None:
        """
        :param env: Gym env.
        :param obs_shape: Expected shape of single observation.
        """
        if env is not None:
            super().__init__(env)
        self._buffer_length = 2
        self._obs_shape = obs_shape
        self._prepare_obs_buffer()

    @staticmethod
    def _normalise_frame(frame: np.ndarray):
        return frame / 255.0

    def _prepare_obs_buffer(self) -> None:
        """Create buffer and preallocate with empty arrays of expected shape."""

        self._obs_buffer = collections.deque(maxlen=self._buffer_length)

        for _ in range(self._buffer_length):
            self._obs_buffer.append(np.zeros(shape=self._obs_shape))

    def build_buffered_obs(self) -> np.ndarray:
        """
        Iterate over the last dimenion, and take the difference between this obs
        and the last obs for each.
        """
        agg_buff = np.empty(self._obs_shape)
        for f in range(self._obs_shape[-1]):
            agg_buff[..., f] = self._obs_buffer[1][..., f] - self._obs_buffer[0][..., f]

        return agg_buff

    def step(self, action: int) -> Tuple[np.ndarray, float, bool, Dict[Any, Any]]:
        """Step env, add new obs to buffer, return buffer."""
        obs, reward, done, info = self.env.step(action)

        obs = self._normalise_frame(obs)
        self._obs_buffer.append(obs)

        return self.build_buffered_obs(), reward, done, info

    def reset(self) -> np.ndarray:
        """Add initial obs to end of pre-allocated buffer.

        :return: Buffered observation
        """
        self._prepare_obs_buffer()
        obs = self.env.reset()
        self._obs_buffer.append(obs)

        return self.build_buffered_obs()

    
    
class SplitLayer(keras.layers.Layer):
    def __init__(self, split_dim: int = 3) -> None:
        super().__init__()
        self.split_dim = split_dim

    def call(self, inputs) -> tf.Tensor:
        """Split a given dim into seperate tensors."""
        return [tf.expand_dims(inputs[..., i], self.split_dim) 
                for i in range(inputs.shape[self.split_dim])]

    
class SplitterConvNN:

    def __init__(self, observation_shape: List[int], n_actions: int, 
                 output_activation: Union[None, str] = None,
                 unit_scale: int = 1, learning_rate: float = 0.0001, 
                 opt: str = 'Adam') -> None:
        """
        :param observation_shape: Tuple specifying input shape.
        :param n_actions: Int specifying number of outputs
        :param output_activation: Activation function for output. Eg. 
                                  None for value estimation (off-policy methods).
        :param unit_scale: Multiplier for all units in FC layers in network 
                           (not used here at the moment).
        :param opt: Keras optimiser to use. Should be string. 
                    This is to avoid storing TF/Keras objects here.
        :param learning_rate: Learning rate for optimiser.

        """
        self.observation_shape = observation_shape
        self.n_actions = n_actions
        self.unit_scale = unit_scale
        self.output_activation = output_activation
        self.learning_rate = learning_rate
        self.opt = opt

    @staticmethod
    def _build_conv_branch(frame: keras.layers.Layer, name: str) -> keras.layers.Layer:
        conv1 = keras.layers.Conv2D(16, kernel_size=(8, 8), strides=(4, 4),
                                    name=f'conv1_frame_{name}', padding='same', 
                                    activation='relu')(frame)
        conv2 = keras.layers.Conv2D(24, kernel_size=(4, 4), strides=(2, 2),
                                    name=f'conv2_frame_{name}', padding='same', 
                                    activation='relu')(conv1)
        conv3 = keras.layers.Conv2D(32, kernel_size=(3, 3), strides=(1, 1),
                                    name=f'conv3_frame_{name}', padding='same', 
                                    activation='relu')(conv2)

        flatten = keras.layers.Flatten(name=f'flatten_{name}')(conv3)

        return flatten

    def _model_architecture(self) -> Tuple[keras.layers.Layer, keras.layers.Layer]:
        n_units = 512 * self.unit_scale

        frames_input = keras.layers.Input(name='input', shape=self.observation_shape)
        frames_split = SplitLayer(split_dim=3)(frames_input)
        conv_branches = []
        for f, frame in enumerate(frames_split):
            conv_branches.append(self._build_conv_branch(frame, name=str(f)))

        concat = keras.layers.concatenate(conv_branches)
        fc1 = keras.layers.Dense(units=int(n_units), name='fc1', 
                                 activation='relu')(concat)
        fc2 = keras.layers.Dense(units=int(n_units / 2), name='fc2', 
                                 activation='relu')(fc1)
        action_output = keras.layers.Dense(units=self.n_actions, name='output',
                                           activation=self.output_activation)(fc2)

        return frames_input, action_output

    def compile(self, model_name: str = 'model', 
                loss: Union[str, Callable] = 'mse') -> keras.Model:
        """
        Compile a copy of the model using the provided loss.

        :param model_name: Name of model
        :param loss: Model loss. Default 'mse'. Can be custom callable.
        """
        # Get optimiser
        if self.opt.lower() == 'adam':
            opt = keras.optimizers.Adam
        elif self.opt.lower() == 'rmsprop':
            opt = keras.optimizers.RMSprop
        else:
            raise ValueError(f"Invalid optimiser {self.opt}")

        state_input, action_output = self._model_architecture()
        model = keras.Model(inputs=[state_input], outputs=[action_output], 
                            name=model_name)
        model.compile(optimizer=opt(learning_rate=self.learning_rate), 
                      loss=loss)

        return model

    def plot(self, model_name: str = 'model') -> None:
        keras.utils.plot_model(self.compile(model_name), 
                               to_file=f"{model_name}.png", show_shapes=True)
        plt.show()
    
FN = "saved_model"
KAGGLE_PATH = f"/kaggle_simulations/agent/{FN}"
if os.path.exists(KAGGLE_PATH):
    # On kaggle
    path = KAGGLE_PATH
else:
    path = FN

tf_mod = keras.models.load_model(path)
obs_buffer = SMMFrameProcessWrapper()


def agent(obs):

    # Use the existing model and obs buffer on each call to agent
    global tf_mod
    global obs_buffer

    # Get the raw observations return by the environment
    obs = obs['players_raw'][0]
    # Convert these to the same output as the SMMWrapper we used in training
    obs = observation_preprocessing.generate_smm([obs])

    # Use the SMMFrameProcessWrapper to do the buffering, but not enviroment
    # stepping or anything related to the Gym API.
    obs_buffer._obs_buffer.append(obs)

    # Predict actions from keras model
    actions = tf_mod.predict(obs)
    action = np.argmax(actions)

    return [action]

In [None]:
!tar -czvf submission.tar.gz main.py saved_model

# Test submission

The written submission file can be tested with the following block. 

See here: https://github.com/garethjns/kaggle-football/blob/main/scripts/debug_agent.py for a version that can be used to run locally, but also maintain debugability.

In [None]:
from typing import Tuple, Dict, List, Any

from kaggle_environments import make

env = make("football", debug=True,configuration={"save_video": True,
                                                 "scenario_name": "11_vs_11_kaggle"})

# Define players
left_player = "main.py"  # A custom agent, eg. random_agent.py or example_agent.py
right_player = "run_right"  # eg. A built in 'AI' agent or the agent again


output: List[Tuple[Dict[str, Any], Dict[str, Any]]] = env.run([left_player, right_player])

print(f"Final score: {sum([r['reward'] for r in output[0]])} : {sum([r['reward'] for r in output[1]])}")
env.render(mode="human", width=800, height=600)

# Conclusions and next steps

This code should be enough to get started with, but like the dense model in the previous notebook, I haven't properly tested this model yet. It should be able to learn something in this environment but will probably require a lot of tweaking and significant training time. 

One potential optimisation would be to reduce the inputs for the ball and active player, treating those as whole frames is expensive, and they could as well be represented by an array of (4,) ie. [x_last, y_last, x_now, y_now].

The submission still needs work too. It appears to intermittently time out during when running in the notebook enviroment, which forfeights the game. Similarly, when submitted some games end in an [Err] rather than a [Win], [Loss] or [Tie]. The log output isn't available, but it's likely this is also caused by intermittent time outs.