## About

**I published [this notebook](https://www.kaggle.com/kuto0633/gfootball-rainbow-dqn-pfrl) before, but I cannot update it. The detail is [here](https://www.kaggle.com/product-feedback/191763). So I modified it and publish it again in this notebook. I'm so sorry to split up into multiple notebooks.**

In this notebook, I will introduce [Rainbow](https://arxiv.org/abs/1710.02298) and [PFRL](https://github.com/pfnet/pfrl ).  

[21/10/2020] Update:
1. modify setting rewards when creating environment  
https://www.kaggle.com/c/google-football/discussion/191594   
2. modify submission process(Submittable)  
https://www.kaggle.com/c/google-football/discussion/190967  
https://www.kaggle.com/c/google-football/discussion/187381#1054470


### Rainbow
Rainbow is an Reinforcement Learning(RL) algorithm that extends the DQN. It has performed well in Atari games (benchmarking for RL) **only single GPU**. Modern high-performance RL algorithms(Ape-X, R2D2, etc) are mainly distributed RL method that use multiple distributed environments, and I guess distributed RL is effective approach in this competition. But these are little difficult to run well on the kaggle notebook and google colab environment because these approach need massively distributed computing resource. So, I try to use Rainbow-DQN in this notebook.  

Rainbow consists of the following seven elements.
- DQN
- Double D-learnig
- Prioritized replay
- Dueling networks
- Multi-step learning
- Distributional RL
- Noisy nets

Please click [here](https://arxiv.org/abs/1710.02298) for details.
I will write these components by PFRL.

### PFRL
<div align="center"><img src="https://raw.githubusercontent.com/pfnet/pfrl/master/assets/PFRL.png" width=30%/></div>
PFRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using PyTorch.  

Most of the published notebooks are written in keras(TensorFlow). But there are many people who would like to use PyTorch. So, I propose PFRL. In this notebook, there is not much PyTorch-specific code since I use existing modules. But we can rewrite by PyTorch in detail if you want. 
  
Please check [here](https://github.com/pfnet/pfrl) for details. 



In [None]:
# Update kaggle-environments to the newest version.
!pip3 install kaggle-environments -U

# GFootball environment.
!apt-get update -y
!apt-get install -y libsdl2-gfx-dev libsdl2-ttf-dev

# Make sure that the Branch in git clone and in wget call matches !!
!git clone -b v2.7 https://github.com/google-research/football.git
!mkdir -p football/third_party/gfootball_engine/lib

!wget https://storage.googleapis.com/gfootball/prebuilt_gameplayfootball_v2.7.so -O football/third_party/gfootball_engine/lib/prebuilt_gameplayfootball.so
!cd football && GFOOTBALL_USE_PREBUILT_SO=1 pip3 install .

## Install

In [None]:
!pip install pfrl==0.1.0

In [None]:
import os
import cv2
import sys
import glob 
import random
import imageio
import pathlib
import collections
from collections import deque
import numpy as np
import argparse
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

from gym import spaces
from tqdm import tqdm
from logging import getLogger, StreamHandler, FileHandler, DEBUG, INFO
from typing import Union, Callable, List, Tuple, Iterable, Any, Dict
from dataclasses import dataclass
from IPython.display import Image, display
sns.set()


# PyTorch
import pfrl
from pfrl.agents import CategoricalDoubleDQN
from pfrl import experiments
from pfrl import explorers
from pfrl import nn as pnn
from pfrl import utils
from pfrl import replay_buffers
from pfrl.wrappers import atari_wrappers
from pfrl.q_functions import DistributionalDuelingDQN

import torch
from torch import nn

# Env
import gym
import gfootball
import gfootball.env as football_env
from gfootball.env import observation_preprocessing

## Config

In [None]:
# Check we can use GPU
print(torch.cuda.is_available())

# set gpu id
if torch.cuda.is_available(): 
    # NOTE: it is not number of gpu but id which start from 0
    gpu = 0
else:
    # cpu=>-1
    gpu = -1

In [None]:
# set logger
def logger_config():
    logger = getLogger(__name__)
    handler = StreamHandler()
    handler.setLevel("DEBUG")
    logger.setLevel("DEBUG")
    logger.addHandler(handler)
    logger.propagate = False

    filepath = './result.log'
    file_handler = FileHandler(filepath)
    logger.addHandler(file_handler)
    return logger

logger = logger_config()

In [None]:
# fixed random seed
# but this is NOT enough to fix the result of rewards.Please tell me the reason.
def seed_everything(seed=1234):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    utils.set_random_seed(seed)  # for PFRL
    
# Set a random seed used in PFRL.
seed = 5046
seed_everything(seed)

# Set different random seeds for train and test envs.
train_seed = seed
test_seed = 2 ** 31 - 1 - seed

## Environment

In [None]:
# wrapper for env(resize and transpose channel order)
class TransEnv(gym.ObservationWrapper):
    def __init__(self, env, channel_order="hwc"):

        gym.ObservationWrapper.__init__(self, env)
        self.height = 84
        self.width = 84
        self.ch = env.observation_space.shape[2]
        shape = {
            "hwc": (self.height, self.width, self.ch),
            "chw": (self.ch, self.height, self.width),
        }
        self.observation_space = spaces.Box(
            low=0, high=255, shape=shape[channel_order], dtype=np.uint8
        )
        

    def observation(self, frame):
        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
        return frame.reshape(self.observation_space.low.shape)

In [None]:
def make_env(test):
    # Use different random seeds for train and test envs
    env_seed = test_seed if test else train_seed
    
    # env = gym.make('GFootball-11_vs_11_kaggle-SMM-v0')
    env = football_env.create_environment(
      env_name='11_vs_11_easy_stochastic',  # easy mode
      stacked=False,
      representation='extracted',  # SMM
      rewards='scoring,checkpoints',
      write_goal_dumps=False,
      write_full_episode_dumps=False,
      render=False,
      write_video=False,
      dump_frequency=1,
      logdir='./',
      extra_players=None,
      number_of_left_players_agent_controls=1,
      number_of_right_players_agent_controls=0
    )
    env = TransEnv(env, channel_order="chw")

    env.seed(int(env_seed))
    if test:
        # Randomize actions like epsilon-greedy in evaluation as well
        env = pfrl.wrappers.RandomizeAction(env, random_fraction=0.0)
    return env

env = make_env(test=False)
eval_env = make_env(test=True)

In [None]:
print('observation space:', env.observation_space.low.shape)
print('action space:', env.action_space)

In [None]:
env.reset()
action = env.action_space.sample()
obs, r, done, info = env.step(action)
print('next observation:', obs.shape)
print('reward:', r)
print('done:', done)
print('info:', info)

## Model

In [None]:
obs_n_channels = env.observation_space.low.shape[0]
n_actions = env.action_space.n
print("obs_n_channels: ", obs_n_channels)
print("n_actions: ", n_actions)

# params based the original paper
n_atoms = 51
v_max = 10
v_min = -10
q_func = DistributionalDuelingDQN(n_actions, n_atoms, v_min, v_max, obs_n_channels)
print(q_func)

In [None]:
# Noisy nets
pnn.to_factorized_noisy(q_func, sigma_scale=0.5)

# Turn off explorer
explorer = explorers.Greedy()

# Use the same hyper parameters as https://arxiv.org/abs/1710.02298
opt = torch.optim.Adam(q_func.parameters(), 6.25e-5, eps=1.5 * 10 ** -4)

# Prioritized Replay
# Anneal beta from beta0 to 1 throughout training
update_interval = 4
betasteps = 5 * 10 ** 7 / update_interval
rbuf = replay_buffers.PrioritizedReplayBuffer(
        10 ** 5,  # Default value is 10 ** 6 but it is too large in this notebook. I chose 10 ** 5.
        alpha=0.5,
        beta0=0.4,
        betasteps=betasteps,
        num_steps=3,
        normalize_by_max="memory",
    )


def phi(x):
    # Feature extractor
    return np.asarray(x, dtype=np.float32) / 255

In [None]:
agent = CategoricalDoubleDQN(
        q_func,
        opt,
        rbuf,
        gpu=gpu,  
        gamma=0.99,
        explorer=explorer,
        minibatch_size=32,
        replay_start_size=2 * 10 ** 4,
        target_update_interval=32000,
        update_interval=update_interval,
        batch_accumulator="mean",
        phi=phi,
    )

In [None]:
# if you have a pretrained model, agent can load pretrained weight. 
use_pretrained = False
pretrained_path = None
if use_pretrained:
    agent.load(pretrained_path)

## Train

In this notebook, I set 100K steps to reduce processing times(spend about 1 hour).    
To improve the agent, we will need to try following.
- set high value steps & high value replay buffers
- add some innovations 
- use other RL agent

In [None]:
num_steps = 100000

In [None]:
%%time
experiments.train_agent_with_evaluation(
    agent=agent,
    env=env,
    steps=num_steps,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=3000,
    outdir="./",
    checkpoint_freq=100000,
    save_best_so_far_agent=True,
    eval_env=eval_env,
    logger=logger
)

In [None]:
import csv

def text_csv_converter(datas):
    file_csv = datas.replace("txt", "csv")
    with open(datas) as rf:
        with open(file_csv, "w") as wf:
            readfile = rf.readlines()
            for read_text in readfile:
                read_text = read_text.split()
                writer = csv.writer(wf, delimiter=',')
                writer.writerow(read_text)

filename = "scores.txt"
text_csv_converter(filename)

In [None]:
import pandas as pd
scores = pd.read_csv("scores.csv")
scores.tail()

In [None]:
# visualize reward each episodes
fig = plt.figure(figsize=(15, 5))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
ax1.set_title("median reward")
ax2.set_title("average loss")
sns.lineplot(x="episodes", y="median", data=scores, ax=ax1)
sns.lineplot(x="episodes", y="average_loss", data=scores,ax=ax2)
plt.show()

## Submission

To include pfrl library to submission file, I prepare pfrl repository.

In [None]:
# clone pfrl repo and move to the directory with main.py 
!git clone https://github.com/pfnet/pfrl.git
!mkdir sub
!mv ./pfrl/pfrl sub
!rm -r ./pfrl

To concat pfrl module and model weight to submission file, I use `stickytape` module.  
  
I refered [this topic](https://www.kaggle.com/c/halite/discussion/164005) in halite competition.Thanks [@higepon](https://www.kaggle.com/higepon).

First, I encode model weight to .py script by using base64. 

In [None]:
import base64

with open(f'./{num_steps}_finish/model.pt', 'rb') as f:
    encoded_string = base64.b64encode(f.read())

with open('./sub/model_weights.py', 'w') as f:
    f.write(f'model_string={encoded_string}')

In [None]:
%cd sub
!ls -la 

Then, make main.py.  
  
I've been plagued with validation episode errors for a long time.  
Howerver it was solved by his advice. Thanks [@Tom Van de Wiele](https://www.kaggle.com/tvdwiele).

In [None]:
%%writefile main.py
import os
import sys
import cv2
import collections
import numpy as np
from gfootball.env import observation_preprocessing

# PFRL
import torch
import pfrl

import base64
from model_weights import model_string

def make_model():
    # Q_function
    model = pfrl.q_functions.DistributionalDuelingDQN(n_actions=19, n_atoms=51, v_min=-10, v_max=10, n_input_channels=4)
    
    # Noisy nets
    pfrl.nn.to_factorized_noisy(model, sigma_scale=0.5)
    
    # load weights
    with open("model.dat", "wb") as f:
        f.write(base64.b64decode(model_string))
    weights = torch.load("model.dat", map_location=torch.device('cpu'))
    model.load_state_dict(weights)
    return model


model = make_model()

def agent(obs):
    global model
    
    # Get observations for the first (and only one) player we control.
    obs = obs['players_raw'][0]
    # Agent we trained uses Super Mini Map (SMM) representation.
    # See https://github.com/google-research/seed_rl/blob/master/football/env.py for details.
    obs = observation_preprocessing.generate_smm([obs])[0]
    # preprocess for obs
    obs = cv2.resize(obs, (84,84))    # resize
    obs = np.transpose(obs, [2,0,1])  # transpose to chw
    obs = torch.tensor(obs).float()   # to tensor
    obs = torch.unsqueeze(obs,0)      # add batch

    actions = model(obs)
    action = int(actions.greedy_actions.numpy()[0])  # modified
    return [action]

Finally I'll concat script file by using stickytape.  
main.py + pfrl library + model_weight.py => submission.py

In [None]:
!pip install stickytape

In [None]:
!stickytape main.py --add-python-path pfrl --add-python-path . > /kaggle/working/submission.py
!rm -r pfrl
%cd /kaggle/working
!ls -la

### check submission agent

In [None]:
from kaggle_environments import make
from submission import agent
env = make("football", configuration={"save_video": True, "scenario_name": "11_vs_11_kaggle", "running_in_notebook": True})
obs = env.state[0]["observation"]
action = agent(obs)
print(action)

In [None]:
from kaggle_environments import make
env = make("football", configuration={"save_video": True, "scenario_name": "11_vs_11_kaggle", "running_in_notebook": True}, debug=True)
agent = "submission.py"
output = env.run([agent, agent])[-1]
print('Left player: action = %s, reward = %s, status = %s, info = %s' % (output[0]["action"], output[0]['reward'], output[0]['status'], output[0]['info']))
print('Right player: action = %s, reward = %s, status = %s, info = %s' % (output[1]["action"], output[1]['reward'], output[1]['status'], output[1]['info']))
env.render(mode="human", width=800, height=600)

In [None]:
# to clean output folder
!rm -r /kaggle/working/football
!rm -r /kaggle/working/kaggle-environments

### Submit to Competition
1. "Save & Run All" (commit) this Notebook
2. Go to the notebook viewer
3. Go to "Output" section and find submission.py file.
4. Click "Submit to Competition"

Go to My Submissions to view your score and episodes being played.  
Thank you.