# Trainer
## Overview
Trainer is the highest-level encapsulation in Tianshou. It controls the training loop and the evaluation method. It also controls the interaction between the Collector and the Policy, with the Replay Buffer serving as the media.

## Usages
There are three types of Trainer, designed to be used in on-policy training, off-policy training and offline training respectively.

### Training without trainer

In [1]:
import gymnasium as gym
import numpy as np
import torch

from tianshou.data import Collector, VectorReplayBuffer
from tianshou.env import DummyVectorEnv
from tianshou.policy import PGPolicy
from tianshou.utils.net.common import Net
from tianshou.utils.net.discrete import Actor

import warnings
warnings.filterwarnings('ignore')

train_env_num = 4
buffer_size = 2000  # since REINFORCE is an on-policy algoritm, we don't need a very large buffer size

# create the environments, used for training and evaluation
env = gym.make("CartPole-v0")
test_envs = DummyVectorEnv([lambda: gym.make("CartPole-v0") for _ in range(2)])
train_envs = DummyVectorEnv([lambda: gym.make("CartPole-v0") for _ in range(train_env_num)])

# create the policy instance
net = Net(env.observation_space.shape, hidden_sizes=[16,])
actor = Actor(net, env.action_space.shape)
optim = torch.optim.Adam(actor.parameters(), lr=0.001)
policy = PGPolicy(actor, optim, dist_fn=torch.distributions.Categorical)

# create the replay buffer and the collector
replaybuffer = VectorReplayBuffer(buffer_size, train_env_num)
test_collector = Collector(policy, test_envs)
train_collector = Collector(policy, train_envs, replaybuffer)

Now we can try training our policy network. The logic is simple. We collect some data into the buffer and then we use the data to train our policy.

In [2]:
train_collector.reset()
train_envs.reset()
test_collector.reset()
test_envs.reset()
replaybuffer.reset()
for i in range(10):
    evaluation_result = test_collector.collect(n_episode=10)
    print("Evalaution reward is {}". format(evaluation_result["rew"]))
    train_collector.collect(n_step=2000)
    # 0 means taking all data stored in train_collector.buffer
    policy.update(0, train_collector.buffer, batch_size=512, repeat=1)
    train_collector.reset_buffer(keep_statistics=True)

Evalaution reward is 9.8
Evalaution reward is 9.3
Evalaution reward is 9.2
Evalaution reward is 9.8
Evalaution reward is 9.3
Evalaution reward is 9.6
Evalaution reward is 8.8
Evalaution reward is 9.4
Evalaution reward is 9.3
Evalaution reward is 9.0


The evaluation reward doesn't seem to improve. That is simply because we haven't trained it for enough time. Plus, the network size is too small and REINFORCE algorithm is actually not very stable.

### Training with trainer

In [3]:
from tianshou.trainer import onpolicy_trainer

train_collector.reset()
train_envs.reset()
test_collector.reset()
test_envs.reset()
replaybuffer.reset()

result = onpolicy_trainer(
    policy,
    train_collector,
    test_collector,
    max_epoch=10,
    step_per_epoch=1,
    repeat_per_collect=1,
    episode_per_test=10,
    step_per_collect=2000,
    batch_size=512
)
print(result)

Epoch #1: 2000it [00:00, 3811.17it/s, env_step=2000, len=9, loss=0.000, n/ep=213, n/st=2000, rew=9.37]

Epoch #1: test_reward: 9.400000 ± 0.489898, best_reward: 9.500000 ± 0.500000 in #0



Epoch #2: 2000it [00:00, 4509.43it/s, env_step=4000, len=9, loss=0.000, n/ep=211, n/st=2000, rew=9.39]

Epoch #2: test_reward: 9.000000 ± 0.632456, best_reward: 9.500000 ± 0.500000 in #0



Epoch #3: 2000it [00:00, 5348.88it/s, env_step=6000, len=9, loss=0.000, n/ep=213, n/st=2000, rew=9.39]

Epoch #3: test_reward: 9.400000 ± 0.663325, best_reward: 9.500000 ± 0.500000 in #0



Epoch #4: 2000it [00:00, 5575.69it/s, env_step=8000, len=9, loss=0.000, n/ep=214, n/st=2000, rew=9.42]

Epoch #4: test_reward: 9.700000 ± 0.458258, best_reward: 9.700000 ± 0.458258 in #4



Epoch #5: 2000it [00:00, 5656.60it/s, env_step=10000, len=9, loss=0.000, n/ep=212, n/st=2000, rew=9.39]

Epoch #5: test_reward: 8.900000 ± 0.830662, best_reward: 9.700000 ± 0.458258 in #4



Epoch #6: 2000it [00:00, 5658.24it/s, env_step=12000, len=9, loss=0.000, n/ep=215, n/st=2000, rew=9.30]

Epoch #6: test_reward: 9.500000 ± 0.500000, best_reward: 9.700000 ± 0.458258 in #4



Epoch #7: 2000it [00:00, 5352.18it/s, env_step=14000, len=9, loss=0.000, n/ep=212, n/st=2000, rew=9.45]

Epoch #7: test_reward: 9.300000 ± 0.640312, best_reward: 9.700000 ± 0.458258 in #4



Epoch #8: 2000it [00:00, 5411.20it/s, env_step=16000, len=9, loss=0.000, n/ep=215, n/st=2000, rew=9.31]

Epoch #8: test_reward: 9.400000 ± 0.800000, best_reward: 9.700000 ± 0.458258 in #4



Epoch #9: 2000it [00:00, 4629.43it/s, env_step=18000, len=9, loss=0.000, n/ep=214, n/st=2000, rew=9.38]

Epoch #9: test_reward: 9.200000 ± 0.748331, best_reward: 9.700000 ± 0.458258 in #4



Epoch #10: 2000it [00:00, 5454.67it/s, env_step=20000, len=9, loss=0.000, n/ep=212, n/st=2000, rew=9.36]

Epoch #10: test_reward: 9.100000 ± 0.538516, best_reward: 9.700000 ± 0.458258 in #4
{'duration': '4.44s', 'train_time/model': '0.19s', 'test_step': 1024, 'test_episode': 110, 'test_time': '0.42s', 'test_speed': '2465.46 step/s', 'best_reward': 9.7, 'best_result': '9.70 ± 0.46', 'train_step': 20000, 'train_episode': 2131, 'train_time/collector': '3.84s', 'train_speed': '4963.75 step/s'}



