## About
In this notebook, I will introduce [Rainbow](https://arxiv.org/abs/1710.02298) and [PFRL](https://github.com/pfnet/pfrl ).  
### Rainbow
Rainbow is an Reinforcement Learning(RL) algorithm that extends the DQN. It has performed well in Atari games (benchmarking for RL) **only single GPU**. Modern high-performance RL algorithms(Ape-X, R2D2, etc) are mainly distributed RL method that use multiple distributed environments, and I guess distributed RL is effective approach in this competition. But these are little difficult to run well on the kaggle notebook and google colab environment because these approach need massively distributed computing resource. So, I try to use Rainbow-DQN in this notebook.  

Rainbow consists of the following seven elements.
- DQN
- Double D-learnig
- Prioritized replay
- Dueling networks
- Multi-step learning
- Distributional RL
- Noisy nets

Please click [here](https://arxiv.org/abs/1710.02298) for details.
I will write these components by PFRL.

### PFRL
<div align="center"><img src="https://raw.githubusercontent.com/pfnet/pfrl/master/assets/PFRL.png" width=30%/></div>
PFRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using PyTorch.  

Most of the published notebooks are written in keras(TensorFlow). But there are many people who would like to use PyTorch. So, I propose PFRL. In this notebook, there is not much PyTorch-specific code since I use existing modules. But we can rewrite by PyTorch in detail if you want. 
  
Please check [here](https://github.com/pfnet/pfrl) for details. 



In [None]:
# Install:
# Kaggle environments.
!git clone https://github.com/Kaggle/kaggle-environments.git
!cd kaggle-environments && pip install .

# GFootball environment.
!apt-get update -y
!apt-get install -y libsdl2-gfx-dev libsdl2-ttf-dev

# Make sure that the Branch in git clone and in wget call matches !!
!git clone -b v2.3 https://github.com/google-research/football.git
!mkdir -p football/third_party/gfootball_engine/lib

!wget https://storage.googleapis.com/gfootball/prebuilt_gameplayfootball_v2.3.so -O football/third_party/gfootball_engine/lib/prebuilt_gameplayfootball.so
!cd football && GFOOTBALL_USE_PREBUILT_SO=1 pip3 install .

## Install

In [None]:
!pip install pfrl==0.1.0

In [None]:
import os
import cv2
import sys
import glob 
import random
import imageio
import pathlib
import collections
from collections import deque
import numpy as np
import argparse
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

from gym import spaces
from tqdm import tqdm
from logging import getLogger, StreamHandler, FileHandler, DEBUG, INFO
from typing import Union, Callable, List, Tuple, Iterable, Any, Dict
from dataclasses import dataclass
from IPython.display import Image, display
sns.set()


# PyTorch
import pfrl
from pfrl.agents import CategoricalDoubleDQN
from pfrl import experiments
from pfrl import explorers
from pfrl import nn as pnn
from pfrl import utils
from pfrl import replay_buffers
from pfrl.wrappers import atari_wrappers
from pfrl.q_functions import DistributionalDuelingDQN

import torch
from torch import nn

# Env
import gym
import gfootball
import gfootball.env as football_env
from gfootball.env import observation_preprocessing

## Config

In [None]:
# Check we can use GPU
print(torch.cuda.is_available())

# set gpu id
if torch.cuda.is_available(): 
    # NOTE: it is not number of gpu but id which start from 0
    gpu = 0
else:
    # cpu=>-1
    gpu = -1

In [None]:
# set logger
def logger_config():
    logger = getLogger(__name__)
    handler = StreamHandler()
    handler.setLevel("DEBUG")
    logger.setLevel("DEBUG")
    logger.addHandler(handler)
    logger.propagate = False

    filepath = './result.log'
    file_handler = FileHandler(filepath)
    logger.addHandler(file_handler)
    return logger

logger = logger_config()

In [None]:
# fixed random seed
# but this is NOT enough to fix the result of rewards.Please tell me the reason.
def seed_everything(seed=1234):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    utils.set_random_seed(seed)  # for PFRL
    
# Set a random seed used in PFRL.
seed = 5046
seed_everything(seed)

# Set different random seeds for train and test envs.
train_seed = seed
test_seed = 2 ** 31 - 1 - seed

## Environment

In [None]:
# wrapper for env(resize and transpose channel order)
class TransEnv(gym.ObservationWrapper):
    def __init__(self, env, channel_order="hwc"):

        gym.ObservationWrapper.__init__(self, env)
        self.height = 84
        self.width = 84
        self.ch = env.observation_space.shape[2]
        shape = {
            "hwc": (self.height, self.width, self.ch),
            "chw": (self.ch, self.height, self.width),
        }
        self.observation_space = spaces.Box(
            low=0, high=255, shape=shape[channel_order], dtype=np.uint8
        )
        

    def observation(self, frame):
        frame = cv2.resize(frame, (self.width, self.height), interpolation=cv2.INTER_AREA)
        return frame.reshape(self.observation_space.low.shape)

In [None]:
def make_env(test):
    # Use different random seeds for train and test envs
    env_seed = test_seed if test else train_seed
    
    # env = gym.make('GFootball-11_vs_11_kaggle-SMM-v0')
    env = football_env.create_environment(
      env_name='11_vs_11_easy_stochastic',  # easy mode
      stacked=False,
      representation='extracted',  # SMM
      rewards='scoring, checkpoints',
      write_goal_dumps=False,
      write_full_episode_dumps=False,
      render=False,
      write_video=False,
      dump_frequency=1,
      logdir='./',
      extra_players=None,
      number_of_left_players_agent_controls=1,
      number_of_right_players_agent_controls=0
    )
    env = TransEnv(env, channel_order="chw")

    env.seed(int(env_seed))
    if test:
        # Randomize actions like epsilon-greedy in evaluation as well
        env = pfrl.wrappers.RandomizeAction(env, random_fraction=0.0)
    return env

env = make_env(test=False)
eval_env = make_env(test=True)

In [None]:
print('observation space:', env.observation_space.low.shape)
print('action space:', env.action_space)

In [None]:
env.reset()
action = env.action_space.sample()
obs, r, done, info = env.step(action)
print('next observation:', obs.shape)
print('reward:', r)
print('done:', done)
print('info:', info)

## Model

In [None]:
obs_n_channels = env.observation_space.low.shape[0]
n_actions = env.action_space.n
print("obs_n_channels: ", obs_n_channels)
print("n_actions: ", n_actions)

# params based the original paper
n_atoms = 51
v_max = 10
v_min = -10
q_func = DistributionalDuelingDQN(n_actions, n_atoms, v_min, v_max, obs_n_channels)
print(q_func)

In [None]:
# Noisy nets
pnn.to_factorized_noisy(q_func, sigma_scale=0.5)

# Turn off explorer
explorer = explorers.Greedy()

# Use the same hyper parameters as https://arxiv.org/abs/1710.02298
opt = torch.optim.Adam(q_func.parameters(), 6.25e-5, eps=1.5 * 10 ** -4)

# Prioritized Replay
# Anneal beta from beta0 to 1 throughout training
update_interval = 4
betasteps = 5 * 10 ** 7 / update_interval
rbuf = replay_buffers.PrioritizedReplayBuffer(
        10 ** 5,  # Default value is 10 ** 6 but it is too large in this notebook. I chose 10 ** 5.
        alpha=0.5,
        beta0=0.4,
        betasteps=betasteps,
        num_steps=3,
        normalize_by_max="memory",
    )


def phi(x):
    # Feature extractor
    return np.asarray(x, dtype=np.float32) / 255

In [None]:
agent = CategoricalDoubleDQN(
        q_func,
        opt,
        rbuf,
        gpu=gpu,  
        gamma=0.99,
        explorer=explorer,
        minibatch_size=32,
        replay_start_size=2 * 10 ** 4,
        target_update_interval=32000,
        update_interval=update_interval,
        batch_accumulator="mean",
        phi=phi,
    )

In [None]:
# if you have a pretrained model, agent can load pretrained weight. 
use_pretrained = False
pretrained_path = None
if use_pretrained:
    agent.load(pretrained_path)

## Train

In this notebook, I set 100K steps(spend about 1 hour).  
But it is not enough to improve the agent. To improve the agent, we will need to try set high value step, high value replay buffers and consider using distributed RL to reduce computation time.

In [None]:
%%time
experiments.train_agent_with_evaluation(
    agent=agent,
    env=env,
    steps=100000,
    eval_n_steps=None,
    eval_n_episodes=1,
    eval_interval=3000,
    outdir="./kaggle_simulations/agent",
    save_best_so_far_agent=True,
    eval_env=eval_env,
    logger=logger
)

In [None]:
import csv

def text_csv_converter(datas):
    file_csv = datas.replace("txt", "csv")
    with open(datas) as rf:
        with open(file_csv, "w") as wf:
            readfile = rf.readlines()
            for read_text in readfile:
                read_text = read_text.split()
                writer = csv.writer(wf, delimiter=',')
                writer.writerow(read_text)

filename = "./kaggle_simulations/agent/scores.txt"
text_csv_converter(filename)

In [None]:
!ls -la ./kaggle_simulations/agent

In [None]:
import pandas as pd
scores = pd.read_csv("./kaggle_simulations/agent/scores.csv")
scores.head()

In [None]:
# visualize reward each episodes
fig = plt.figure(figsize=(15, 5))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
ax1.set_title("median reward")
ax2.set_title("average loss")
sns.lineplot(x="episodes", y="median", data=scores, ax=ax1)
sns.lineplot(x="episodes", y="average_loss", data=scores,ax=ax2)
plt.show()

## Submission

In [None]:
%%writefile ./kaggle_simulations/agent/main.py
import cv2
import collections
import gym
import numpy as np
import os
import sys
import torch

from gfootball.env import observation_preprocessing
from gfootball.env import wrappers

# PFRL
import pfrl
from pfrl.agents import CategoricalDoubleDQN
from pfrl import experiments
from pfrl import explorers
from pfrl import nn as pnn
from pfrl import utils
from pfrl import replay_buffers
from pfrl.q_functions import DistributionalDuelingDQN


def phi(x):
    # Feature extractor
    return np.asarray(x, dtype=np.float32) / 255

def make_model():
    global device
    # Q_function
    n_atoms = 51
    v_max = 10
    v_min = -10
    obs_n_channels = 4
    n_actions = 19
    q_func = DistributionalDuelingDQN(n_actions, n_atoms, v_min, v_max, obs_n_channels)

    # Noisy nets
    pnn.to_factorized_noisy(q_func, sigma_scale=0.5)

    # Turn off explorer
    explorer = explorers.Greedy()

    # Use the same hyper parameters as https://arxiv.org/abs/1710.02298
    opt = torch.optim.Adam(q_func.parameters(), 6.25e-5, eps=1.5 * 10 ** -4)

    # Prioritized Replay
    # Anneal beta from beta0 to 1 throughout training
    update_interval = 4
    betasteps = 5 * 10 ** 7 / update_interval
    rbuf = replay_buffers.PrioritizedReplayBuffer(
            10 ** 6,
            alpha=0.5,
            beta0=0.4,
            betasteps=betasteps,
            num_steps=3,
            normalize_by_max="memory",
        )


    # prepare agent
    model = CategoricalDoubleDQN(
            q_func,
            opt,
            rbuf,
            gpu=-1,  
            gamma=0.99,
            explorer=explorer,
            minibatch_size=32,
            replay_start_size=2 * 10 ** 4,
            target_update_interval=32000,
            update_interval=update_interval,
            batch_accumulator="mean",
            phi=phi,
        )
    
    model.load("./kaggle_simulations/agent/100000_finish")
    # model.load("./kaggle_simulations/agent/best")
    return model.model.to(device)


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = make_model()

def agent(obs):
    global device
    global model
    
    # Get observations for the first (and only one) player we control.
    obs = obs['players_raw'][0]
    # Agent we trained uses Super Mini Map (SMM) representation.
    # See https://github.com/google-research/seed_rl/blob/master/football/env.py for details.
    obs = observation_preprocessing.generate_smm([obs])[0]
    # preprocess for obs
    obs = cv2.resize(obs, (84,84))           # resize
    obs = np.transpose(obs, [2,0,1])         # transpose to chw
    obs = torch.tensor(obs).float()          # to tensor
    obs = torch.unsqueeze(obs,0).to(device)  # add batch

    action = model(obs)
    action = action.greedy_actions.cpu().numpy()
    return list(action)

### check submission agent

In [None]:
from kaggle_environments import make
from kaggle_simulations.agent import main
env = make("football", configuration={"save_video": True, "scenario_name": "11_vs_11_kaggle", "running_in_notebook": True})
obs = env.state[0]["observation"]
action = main.agent(obs)
print(action)

In [None]:
from kaggle_environments import make
env = make("football", configuration={"save_video": True, "scenario_name": "11_vs_11_kaggle", "running_in_notebook": True}, debug=True)
output = env.run(["./kaggle_simulations/agent/main.py", "run_right"])[-1]
print('Left player: action = %s, reward = %s, status = %s, info = %s' % (output[0]["action"], output[0]['reward'], output[0]['status'], output[0]['info']))
print('Right player: action = %s, reward = %s, status = %s, info = %s' % (output[1]["action"], output[1]['reward'], output[1]['status'], output[1]['info']))
env.render(mode="human", width=800, height=600)

In [None]:
# Prepare a submision package containing trained model and the main execution logic.
!cd ./kaggle_simulations/agent && tar -czvf /kaggle/working/submit.tar.gz main.py best

It looks like I can create a submit agent, but I failed to submit by error.  
Please teach me any idea. Thank you!  