Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307

Closed
ChenDRAG opened this issue Mar 8, 2021 · 3 comments · Fixed by #308, #313, #318, #319 or #320
Closed

Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307

ChenDRAG opened this issue Mar 8, 2021 · 3 comments · Fixed by #308, #313, #318, #319 or #320
Assignees
Labels
discussion Discussion of a typical issue

Comments

@ChenDRAG
Copy link
Collaborator

ChenDRAG commented Mar 8, 2021

Purpose

The purpose of this issue(discussion) is to introduce a series of pr in the near future targeted to releasing tianshou's benchmark for MuJoCo Gym task suite using onpolicy algorithms already supported by tianshou(VPG, A2C, PPO).

Introduction

This issue is closely related #274, which mainly focus on benchmarking mujoco environments using offpolicy algorithm and enhancing tianshou along the way. Since Mujoco is a widely used task suite in literature, and most drl libraries haven't released satisfying benchmark on such important tasks. (While there are open-source implementations available that can be used by practitioners, lacking of graphs, source data, comparasion, and specific deatils still make it uneasy for starters to compare performance between algorithms.) We decided to try buiding a complete benchmark for MuJoCo Gym task suite. First step is #274, and the second is this one.

The most closely related work to ours is probably spining up(Pytorch), which benchmarked 3 onpolicy algorithms(VPG, PPO, TRPO) on 5 mujoco environments, while our benchmark will try supporting more algorithms on 9 out of 13 environments. (Pusher, Thrower, Striker and HumanoidStandup have not been considered to be supported because they are not commonly seen in literatures.) While spinningup is intended for beginners and thus hasn't implemented standard tricks in drl (such as observation normalization and normalized value regression targets), we intend to do so.

Beyond that, like offpolicy benchmark. for each supported algorithm, we will try providing:

  • Default hyperparameters used for benchmark and scripts to reproduce the benchmark.
  • A comparasion of performance(or code level details) with other open source implementations or classic papers.
  • graphs and raw data that can be used for research purposes.
  • Log details obtained during training.
  • Pretrained agents.
  • Some hints on how to tune the algorithm.

I make a plan and hope to finish the tasks above in the coming few weeks. Some features of tianshou will be enhanced along the way. I have done some experiments on my fork of Tianshou, which makes the benchmark below of PPO.

Environment Tianshou ikostrikov/pytorch-a2c-ppo-acktr-gail PPO paper baselines spinningup(pytorch)
Ant 3253.5+-894.3 N N N ~650
HalfCheetah 5460.5+-1319.3 ~3120 ~1800 ~1700 ~1670
Hopper 2139.7+-776.6 ~2300 ~2330 ~2400 ~1850
Walker2d 3407.3+-1011.3 ~4000 ~3460 ~3510 ~1230
Swimmer 99.8+-33.9 N ~108 ~111 ~120
Humanoid 597.4+-29.7 N N N N
Reacher -3.9+-0.4 ~-5 ~-7 ~-6 N
InvertedPendulum N N ~1000 ~940 N
InvertedDoublePendulum 8407.6+-1136.0 N ~8000 ~7350 N

(This figure is outdated by now check examples/mujoco for better SOTA benchmark)

* Reward metric: The meaning of the table value is the max average return over 10 trails(different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered. ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. The shaded region on the graph also represents a single standard deviation. (Note that in TD3 paper shaded region represents only half of that)

** ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided.

*** We used the latest version of all mujoco environments in gym(0.17.3), but it's not often the case with other papers. Please check for details yourself in the original paper. (Different version's outcomes are usually similar, though)

Example graph:
figure

Plans

Here I briefly list 7 to do prs:

  1. Add a observation normalization wrapper of vecenv, because obs normalization affect onpolicy algorithm much.

  2. Refactor VPG algorithm

  • support value normalization and obs normalization.
  • try different versions of VPG algorithm and make a benchmark on mujoco.
  1. Probably support natural policy gradient and make a benchmark.

  2. Possibly support learning rate decay.

  3. Refactor A2C algorithm and make a benchmark

  4. Refactor and benchmark PPO algorithm

  5. Other enhancement

  • providing drawing tools to reproduce benchmark to fix Provide curve-drawing examples #161 .
  • more loggger that can be used right away.
  • maybe a fine_tuned version of ppo to use tricks guided by this paper.

Future work

1.. Adding support(benchmark in the same way) for other algorithms(TRPO, ACER etc).

@ChenDRAG
Copy link
Collaborator Author

VPG’s (REINFORCE) benchmark is ready to be released now. source code in on my fork of Tianshou.

Environment Tianshou
Ant 1061.9+-306.3
HalfCheetah 1284.8+-441.3
Hopper 449.5+-92.1
Walker2d 383.0+-107.9
Swimmer 35.3+-4.2
Humanoid 428.1+-48.9
Reacher -5.5+-0.3
InvertedPendulum 1000.0+-0.0
InvertedDoublePendulum 8250.8+-598.0

Reward metric is the same as PPO's but considers first 10M steps.

Example graph:
vpg

For now I cannot find Any public benchmark for REINFORCE algorithm on mujoco environments. Spinning's vpg algorithm is more close to A2C algorithm because they both use GAE and a dnn for critic, so it cannot be called REINFORCE algorithm. This is debatable because vpg itself is not a very formal algorithm in the literature (first appears in Spinningup's docs, I think) and is loosely defined, see. As a result, I suggest we do not use the definition of vpg, but use REINFORCE instead.
I will compare our REINFORCE benchmark' performance with spinning up VPG(A2C, strictly) anyway. You can see that we are roughly at-parity with each other even when we do not use a critic or GAE. Performances are clipped at 3M steps.

Environment Tianshou Spinning Up(Pytorch)
Ant 524.0+-81.3 ~5
HalfCheetah 897.8+-188.2 ~600
Hopper 421.8+-61.8 ~800
Walker2d 361.6+-77.3 ~460
Swimmer 34.2+-2.3 ~51
Humanoid 381.9+-61.6 N
Reacher -10.3+-0.7 N
InvertedPendulum 983.8+-26.5 N
InvertedDoublePendulum 1914.1+-720.3 N

@ChenDRAG
Copy link
Collaborator Author

A2C benchmark is ready! source code can be seen on #325 .
Reward metric is the same as PPO's but considers the first 3/1M steps for comparison.

Environment Tianshou(3M steps) Spinning Up(Pytorch)
Ant 5236.8+-236.7 ~5
HalfCheetah 2377.3+-1363.7 ~600
Hopper 1608.6+-529.5 ~800
Walker2d 1805.4+-1055.9 ~460
Swimmer 40.2+-1.8 ~51
Humanoid 5316.6+-554.8 N
Reacher -5.2+-0.5 N
InvertedPendulum 1000.0+-0.0 N
InvertedDoublePendulum 9351.3+-12.8 N
Environment Tianshou PPO paper A2C PPO paper A2C+Trust Region
Ant 3485.4+-433.1 N N
HalfCheetah 1829.9+-1068.3 ~1000 ~930
Hopper 1253.2+-458.0 ~900 ~1220
Walker2d 1091.6+-709.2 ~850 ~700
Swimmer 36.6+-2.1 ~31 ~36
Humanoid 1726.0+-1070.1 N N
Reacher -6.7+-2.3 ~-24 ~-27
InvertedPendulum 1000.0+-0.0 ~1000 ~1000
InvertedDoublePendulum 9257.7+-277.4 ~7100 ~8100

Note that we compare Tianshou's A2C implementation with SpinningUP's VPG, which we consider fair because we both use a critic and GAE.

@ChenDRAG
Copy link
Collaborator Author

Work finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment