This example trains a PPO agent (Proximal Policy Optimization Algorithms) on MuJoCo benchmarks from OpenAI Gym.
We follow the training and evaluation settings of Deep Reinforcement Learning that Matters, which provides thorough, highly tuned benchmark results.
- MuJoCo Pro 1.5
- mujoco_py>=1.50, <2.1
python train_ppo.py [options]
--gpu
. Specifies the GPU. If you do not have a GPU on your machine, run the example with the option--gpu -1
. E.g.python train_ppo.py --gpu -1
.--env
. Specifies the environment. E.g.python train_ppo.py --env HalfCheetah-v2
.--render
. Add this option to render the states in a GUI window.--seed
. This option specifies the random seed used.--outdir
This option specifies the output directory to which the results are written.
To view the full list of options, either view the code or run the example with the --help
option.
- While the original paper initialized weights by normal distribution (https://github.com/Breakend/baselines/blob/50ffe01d254221db75cdb5c2ba0ab51a6da06b0a/baselines/ppo1/mlp_policy.py#L28), we use orthogonal initialization as the latest openai/baselines does (https://github.com/openai/baselines/blob/9b68103b737ac46bc201dfb3121cfa5df2127e53/baselines/a2c/utils.py#L61).
These scores are evaluated by average return +/- standard error of 100 evaluation episodes after 2M training steps.
Reported scores are taken from Table 1 of Deep Reinforcement Learning that Matters.
ChainerRL scores are based on 20 trials using different random seeds, using the following command.
python train_ppo.py --gpu -1 --seed [0-19] --env [env]
Environment | ChainerRL Score | Reported Score |
---|---|---|
HalfCheetah-v2 | 2404+/-185 | 2201+/-323 |
Hopper-v2 | 2719+/-67 | 2790+/-62 |
Walker2d-v2 | 2994+/-113 | N/A |
Swimmer-v2 | 111+/-4 | N/A |
These training times were obtained by running train_ppo.py
on a single CPU and no GPU.
Game | ChainerRL Time |
---|---|
HalfCheetah | 2.054 hours |
Hopper | 2.057 hours |
Swimmer | 2.051 hours |
Walker2d | 2.065 hours |
Statistic | ||
---|---|---|
Mean time (in hours) across all domains | 2.057 | |
Fastest Domain | Swimmer | 2.051 |
Slowest Domain | Walker2d | 2.065 |
The shaded region represents a standard deviation of the average evaluation over 20 trials.