You can check [my first post about training TRPO agent with the Reinforcement Learning Replications (rl-replicas)](https://www.kaggle.com/yamatokataoka/train-agent-with-rl-replicas).

The rl-replicas is an on-going project for implementaing key RL algorithms, VPG, TRPO, PPO. You can find more detail here, [https://github.com/yamatokataoka/reinforcement-learning-replications](https://github.com/yamatokataoka/reinforcement-learning-replications)

In this notebook, you'll learn how to continue your training with pre-trained model with rl-replicas.

install gfootball required tools

In [None]:
# Install:
# Kaggle environments.
!git clone -q https://github.com/Kaggle/kaggle-environments.git
!cd kaggle-environments && pip install .

# GFootball environment.
!apt-get update -y -qq > /dev/null
!apt-get install -y -qq libsdl2-gfx-dev libsdl2-ttf-dev

# Make sure that the Branch in git clone and in wget call matches !!
!git clone -q -b v2.8 https://github.com/google-research/football.git
!mkdir -p football/third_party/gfootball_engine/lib

!wget -q https://storage.googleapis.com/gfootball/prebuilt_gameplayfootball_v2.8.so -O football/third_party/gfootball_engine/lib/prebuilt_gameplayfootball.so
!cd football && GFOOTBALL_USE_PREBUILT_SO=1 pip install .

install rl-replicas

In [None]:
!pip install rl-replicas

### Train with rl-replicas

you'll set up the TRPO algorithm with `ConjugateGradientOptimizer`.

The trained model is saved in `./trpo/model.pt`.

You'll use this to continue your training later.

##### Note
it's the same code with [my first post](https://www.kaggle.com/yamatokataoka/train-agent-with-rl-replicas#Train-with-rl-replicas)

In [None]:
import os

import gfootball
import gym
import torch
import torch.nn as nn

from rl_replicas.algorithms import TRPO
from rl_replicas.common.policy import Policy
from rl_replicas.common.value_function import ValueFunction
from rl_replicas.common.optimizers import ConjugateGradientOptimizer
from rl_replicas.common.torch_net import mlp

algorithm_name = 'trpo'
environment_name = 'GFootball-11_vs_11_kaggle-simple115v2-v0'
epochs = 5
steps_per_epoch = 4000
policy_network_architecture = [64, 64]
value_function_network_architecture = [64, 64]
value_function_learning_rate = 1e-3
output_dir = './trpo'

env: gym.Env = gym.make(environment_name)

policy_network = mlp(
    sizes = [env.observation_space.shape[0]]+policy_network_architecture+[env.action_space.n]
)

policy: Policy = Policy(
    network = policy_network,
    optimizer = ConjugateGradientOptimizer(params=policy_network.parameters())
)

value_function_network = mlp(
    sizes = [env.observation_space.shape[0]]+value_function_network_architecture+[1]
)
value_function: ValueFunction = ValueFunction(
    network = value_function_network,
    optimizer = torch.optim.Adam(value_function_network.parameters(), lr=value_function_learning_rate)
)

model: TRPO = TRPO(policy, value_function, env, seed=0)

print('an experiment to: {}'.format(output_dir))

print('algorithm:           {}'.format(algorithm_name))
print('epochs:              {}'.format(epochs))
print('steps_per_epoch:     {}'.format(steps_per_epoch))
print('environment:         {}'.format(environment_name))

print('value_function_learning_rate: {}'.format(value_function_learning_rate))
print('policy network:')
print(policy.network)

Start learning

In [None]:
model.learn(
    epochs=epochs,
    steps_per_epoch=steps_per_epoch,
    output_dir=output_dir,
    model_saving=True
)

### Continue your training with rl-replicas



You'll initialize all neccessary components for training again.

Then load pre-trained model, `./trpo/model.pt`.

To continue your work, you need to load state_dict for policy and value function.
```
previous_model = torch.load(previous_model_location)

policy.network.load_state_dict(previous_model['policy_state_dict'])
policy.optimizer.load_state_dict(previous_model['policy_optimizer_state_dict'])
value_function.network.load_state_dict(previous_model['value_fn_state_dict'])
value_function.optimizer.load_state_dict(previous_model['value_fn_optimizer_state_dict'])
```

##### Note
implementation for saving `model.pt`is 

https://github.com/yamatokataoka/reinforcement-learning-replications/blob/master/rl_replicas/common/base_algorithms/on_policy_algorithm.py#L106-L114

In [None]:
import os

import gfootball
import gym
import torch
import torch.nn as nn

from rl_replicas.algorithms import TRPO
from rl_replicas.common.policy import Policy
from rl_replicas.common.value_function import ValueFunction
from rl_replicas.common.optimizers import ConjugateGradientOptimizer
from rl_replicas.common.torch_net import mlp

algorithm_name = 'trpo'
environment_name = 'GFootball-11_vs_11_kaggle-simple115v2-v0'
epochs = 5
steps_per_epoch = 4000
policy_network_architecture = [64, 64]
value_function_network_architecture = [64, 64]
output_dir = './trpo'

previous_model_location = os.path.join(output_dir, 'model.pt')

env: gym.Env = gym.make(environment_name)

policy_network = mlp(
    sizes = [env.observation_space.shape[0]]+policy_network_architecture+[env.action_space.n]
)

policy: Policy = Policy(
    network = policy_network,
    optimizer = ConjugateGradientOptimizer(params=policy_network.parameters())
)

value_function_network = mlp(
    sizes = [env.observation_space.shape[0]]+value_function_network_architecture+[1]
)
value_function: ValueFunction = ValueFunction(
    network = value_function_network,
    optimizer = torch.optim.Adam(value_function_network.parameters())
)
    
if os.path.isfile(previous_model_location):
    print('Load previous model: {}'.format(previous_model_location))
    previous_model = torch.load(previous_model_location)

    policy.network.load_state_dict(previous_model['policy_state_dict'])
    policy.optimizer.load_state_dict(previous_model['policy_optimizer_state_dict'])
    value_function.network.load_state_dict(previous_model['value_fn_state_dict'])
    value_function.optimizer.load_state_dict(previous_model['value_fn_optimizer_state_dict'])

model: TRPO = TRPO(policy, value_function, env, seed=0)

print('an experiment to: {}'.format(output_dir))

print('algorithm:           {}'.format(algorithm_name))
print('epochs:              {}'.format(epochs))
print('steps_per_epoch:     {}'.format(steps_per_epoch))
print('environment:         {}'.format(environment_name))

print('policy network:')
print(policy.network)

Continue your training

In [None]:
print('Re-start your training from {} epochs'.format(previous_model['epoch']))
model.learn(
    epochs=epochs,
    steps_per_epoch=steps_per_epoch,
    output_dir=output_dir,
    model_saving=True
)