Along with the Google Research Football competition, I'm trying to implement key algorithms of reinforcement learning in [https://github.com/yamatokataoka/reinforcement-learning-replications](https://github.com/yamatokataoka/reinforcement-learning-replications).

I've implemented VPG, TRPO and PPO so far.

In this notebook, you'll train a TRPO agent with the Reinforcement Learning Replications (rl-replicas).

If you're not familiar with TRPO or RL, I highly recomend to read through the [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/user/introduction.html)

install gfootball required tools

In [None]:
# Install:
# Kaggle environments.
!git clone -q https://github.com/Kaggle/kaggle-environments.git
!cd kaggle-environments && pip install .

# GFootball environment.
!apt-get update -y -qq > /dev/null
!apt-get install -y -qq libsdl2-gfx-dev libsdl2-ttf-dev

# Make sure that the Branch in git clone and in wget call matches !!
!git clone -q -b v2.8 https://github.com/google-research/football.git
!mkdir -p football/third_party/gfootball_engine/lib

!wget -q https://storage.googleapis.com/gfootball/prebuilt_gameplayfootball_v2.8.so -O football/third_party/gfootball_engine/lib/prebuilt_gameplayfootball.so
!cd football && GFOOTBALL_USE_PREBUILT_SO=1 pip install .

install rl-replicas

In [None]:
!pip install rl-replicas

### Train with rl-replicas

In the rl-replicas, you'll set up the TRPO algorithm with `Policy` and `ValueFunction`. Those two core components have their own neural network and optimizer.

The trained model is saved in `./trpo/model.pt`.
You'll use this for test run.

##### Note
TRPO uses `ConjugateGradientOptimizer` as an optimizer.

In [None]:
import os

import gfootball
import gym
import torch
import torch.nn as nn

from rl_replicas.algorithms import TRPO
from rl_replicas.common.policy import Policy
from rl_replicas.common.value_function import ValueFunction
from rl_replicas.common.optimizers import ConjugateGradientOptimizer
from rl_replicas.common.torch_net import mlp

algorithm_name = 'trpo'
environment_name = 'GFootball-11_vs_11_kaggle-simple115v2-v0'
epochs = 5
steps_per_epoch = 4000
policy_network_architecture = [64, 64]
value_function_network_architecture = [64, 64]
value_function_learning_rate = 1e-3
output_dir = './trpo'

env: gym.Env = gym.make(environment_name)

policy_network = mlp(
    sizes = [env.observation_space.shape[0]]+policy_network_architecture+[env.action_space.n]
)

policy: Policy = Policy(
    network = policy_network,
    optimizer = ConjugateGradientOptimizer(params=policy_network.parameters())
)

value_function_network = mlp(
    sizes = [env.observation_space.shape[0]]+value_function_network_architecture+[1]
)
value_function: ValueFunction = ValueFunction(
    network = value_function_network,
    optimizer = torch.optim.Adam(value_function_network.parameters(), lr=value_function_learning_rate)
)

model: TRPO = TRPO(policy, value_function, env, seed=0)

print('an experiment to: {}'.format(output_dir))

print('algorithm:           {}'.format(algorithm_name))
print('epochs:              {}'.format(epochs))
print('steps_per_epoch:     {}'.format(steps_per_epoch))
print('environment:         {}'.format(environment_name))

print('value_function_learning_rate: {}'.format(value_function_learning_rate))
print('policy network:')
print(policy.network)

Start learning

In [None]:
model.learn(
    epochs=epochs,
    steps_per_epoch=steps_per_epoch,
    output_dir=output_dir,
    model_saving=True
)

### Run inference with rl-replicas

To run inference, you'll initialize `Policy` with the trained network. The optimizer is uncessary here.

In [None]:
%%writefile ./agent.py
import time

import torch
import gfootball
import gym
from gfootball.env.wrappers import Simple115StateWrapper

from rl_replicas.common.policy import Policy
from rl_replicas.common.torch_net import mlp

start_setup_time: float = time.time()

num_observation = 115
num_action = 19
policy_network_architecture = [64, 64]
model_location = './trpo/model.pt'
model = torch.load(model_location)

policy_network = mlp(
    sizes = [num_observation]+policy_network_architecture+[num_action]
)

policy_network.load_state_dict(model['policy_state_dict'])

policy: Policy = Policy(
    network = policy_network,
    optimizer = None
)

current_step: int = 0

print('Set up Time: {:<8.3g}'.format(time.time() - start_setup_time))

def agent(observation):
    global policy
    global current_step

    start_time: float = time.time()
    current_step += 1

    raw_observation = observation['players_raw']
    simple_115_observation = Simple115StateWrapper.convert_observation(raw_observation, fixed_positions=False)
    observation_tensor: torch.Tensor = torch.from_numpy(simple_115_observation).float()

    action = policy.predict(observation_tensor)
    
    if (current_step%100) == 0:
        print('Current Step: {}'.format(current_step))

    one_step_time = time.time() - start_time
    if one_step_time >= 0.2:
        print('One Step Time exceeded 0.2 seconds: {:<8.3g}'.format(one_step_time))

    return [action.item()]


test run

In [None]:
from kaggle_environments import make

env = make("football", 
           configuration={
             "save_video": True, 
             "scenario_name": "11_vs_11_kaggle",
             "running_in_notebook": True,
           })

output = env.run(["./agent.py", "do_nothing"])[-1]

print('Left player: reward = {}, status = {}, info = {}'.format(output[0]['reward'], output[0]['status'], output[0]['info']))
print('Right player: reward = {}, status = {}, info = {}'.format(output[1]['reward'], output[1]['status'], output[1]['info']))

env.render(mode="human", width=800, height=600)