Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307

ChenDRAG · 2021-03-08T11:34:50Z

Purpose

The purpose of this issue(discussion) is to introduce a series of pr in the near future targeted to releasing tianshou's benchmark for MuJoCo Gym task suite using onpolicy algorithms already supported by tianshou(VPG, A2C, PPO).

Introduction

This issue is closely related #274, which mainly focus on benchmarking mujoco environments using offpolicy algorithm and enhancing tianshou along the way. Since Mujoco is a widely used task suite in literature, and most drl libraries haven't released satisfying benchmark on such important tasks. (While there are open-source implementations available that can be used by practitioners, lacking of graphs, source data, comparasion, and specific deatils still make it uneasy for starters to compare performance between algorithms.) We decided to try buiding a complete benchmark for MuJoCo Gym task suite. First step is #274, and the second is this one.

The most closely related work to ours is probably spining up(Pytorch), which benchmarked 3 onpolicy algorithms(VPG, PPO, TRPO) on 5 mujoco environments, while our benchmark will try supporting more algorithms on 9 out of 13 environments. (Pusher, Thrower, Striker and HumanoidStandup have not been considered to be supported because they are not commonly seen in literatures.) While spinningup is intended for beginners and thus hasn't implemented standard tricks in drl (such as observation normalization and normalized value regression targets), we intend to do so.

Beyond that, like offpolicy benchmark. for each supported algorithm, we will try providing:

Default hyperparameters used for benchmark and scripts to reproduce the benchmark.
A comparasion of performance(or code level details) with other open source implementations or classic papers.
graphs and raw data that can be used for research purposes.
Log details obtained during training.
Pretrained agents.
Some hints on how to tune the algorithm.

I make a plan and hope to finish the tasks above in the coming few weeks. Some features of tianshou will be enhanced along the way. I have done some experiments on my fork of Tianshou, which makes the benchmark below of PPO.

Environment	Tianshou	ikostrikov/pytorch-a2c-ppo-acktr-gail	PPO paper	baselines	spinningup(pytorch)
Ant	3253.5+-894.3	N	N	N	~650
HalfCheetah	5460.5+-1319.3	~3120	~1800	~1700	~1670
Hopper	2139.7+-776.6	~2300	~2330	~2400	~1850
Walker2d	3407.3+-1011.3	~4000	~3460	~3510	~1230
Swimmer	99.8+-33.9	N	~108	~111	~120
Humanoid	597.4+-29.7	N	N	N	N
Reacher	-3.9+-0.4	~-5	~-7	~-6	N
InvertedPendulum	N	N	~1000	~940	N
InvertedDoublePendulum	8407.6+-1136.0	N	~8000	~7350	N

(This figure is outdated by now check examples/mujoco for better SOTA benchmark)

* Reward metric: The meaning of the table value is the max average return over 10 trails(different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered. ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. The shaded region on the graph also represents a single standard deviation. (Note that in TD3 paper shaded region represents only half of that)

** ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided.

*** We used the latest version of all mujoco environments in gym(0.17.3), but it's not often the case with other papers. Please check for details yourself in the original paper. (Different version's outcomes are usually similar, though)

Example graph:

Plans

Here I briefly list 7 to do prs:

Add a observation normalization wrapper of vecenv, because obs normalization affect onpolicy algorithm much.
Refactor VPG algorithm

support value normalization and obs normalization.
try different versions of VPG algorithm and make a benchmark on mujoco.

Probably support natural policy gradient and make a benchmark.
Possibly support learning rate decay.
Refactor A2C algorithm and make a benchmark
Refactor and benchmark PPO algorithm
Other enhancement

providing drawing tools to reproduce benchmark to fix Provide curve-drawing examples #161 .
more loggger that can be used right away.
maybe a fine_tuned version of ppo to use tricks guided by this paper.

Future work

1.. Adding support(benchmark in the same way) for other algorithms(TRPO, ACER etc).

ChenDRAG · 2021-03-21T09:09:27Z

VPG’s (REINFORCE) benchmark is ready to be released now. source code in on my fork of Tianshou.

Environment	Tianshou
Ant	1061.9+-306.3
HalfCheetah	1284.8+-441.3
Hopper	449.5+-92.1
Walker2d	383.0+-107.9
Swimmer	35.3+-4.2
Humanoid	428.1+-48.9
Reacher	-5.5+-0.3
InvertedPendulum	1000.0+-0.0
InvertedDoublePendulum	8250.8+-598.0

Reward metric is the same as PPO's but considers first 10M steps.

Example graph:

For now I cannot find Any public benchmark for REINFORCE algorithm on mujoco environments. Spinning's vpg algorithm is more close to A2C algorithm because they both use GAE and a dnn for critic, so it cannot be called REINFORCE algorithm. This is debatable because vpg itself is not a very formal algorithm in the literature (first appears in Spinningup's docs, I think) and is loosely defined, see. As a result, I suggest we do not use the definition of vpg, but use REINFORCE instead.
I will compare our REINFORCE benchmark' performance with spinning up VPG(A2C, strictly) anyway. You can see that we are roughly at-parity with each other even when we do not use a critic or GAE. Performances are clipped at 3M steps.

Environment	Tianshou	Spinning Up(Pytorch)
Ant	524.0+-81.3	~5
HalfCheetah	897.8+-188.2	~600
Hopper	421.8+-61.8	~800
Walker2d	361.6+-77.3	~460
Swimmer	34.2+-2.3	~51
Humanoid	381.9+-61.6	N
Reacher	-10.3+-0.7	N
InvertedPendulum	983.8+-26.5	N
InvertedDoublePendulum	1914.1+-720.3	N

ChenDRAG · 2021-03-26T13:29:50Z

A2C benchmark is ready! source code can be seen on #325 .
Reward metric is the same as PPO's but considers the first 3/1M steps for comparison.

Environment	Tianshou(3M steps)	Spinning Up(Pytorch)
Ant	5236.8+-236.7	~5
HalfCheetah	2377.3+-1363.7	~600
Hopper	1608.6+-529.5	~800
Walker2d	1805.4+-1055.9	~460
Swimmer	40.2+-1.8	~51
Humanoid	5316.6+-554.8	N
Reacher	-5.2+-0.5	N
InvertedPendulum	1000.0+-0.0	N
InvertedDoublePendulum	9351.3+-12.8	N

Environment	Tianshou	PPO paper A2C	PPO paper A2C+Trust Region
Ant	3485.4+-433.1	N	N
HalfCheetah	1829.9+-1068.3	~1000	~930
Hopper	1253.2+-458.0	~900	~1220
Walker2d	1091.6+-709.2	~850	~700
Swimmer	36.6+-2.1	~31	~36
Humanoid	1726.0+-1070.1	N	N
Reacher	-6.7+-2.3	~-24	~-27
InvertedPendulum	1000.0+-0.0	~1000	~1000
InvertedDoublePendulum	9257.7+-277.4	~7100	~8100

Note that we compare Tianshou's A2C implementation with SpinningUP's VPG, which we consider fair because we both use a critic and GAE.

ChenDRAG · 2021-04-14T06:34:09Z

Work finished.

ChenDRAG mentioned this issue Mar 9, 2021

support observation normalization in BaseVectorEnv #308

Merged

ChenDRAG assigned ChenDRAG and Trinkle23897 Mar 9, 2021

ChenDRAG added the discussion Discussion of a typical issue label Mar 11, 2021

ChenDRAG mentioned this issue Mar 18, 2021

Remap action to fit gym's action space #313

Merged

Trinkle23897 pinned this issue Mar 24, 2021

ChenDRAG mentioned this issue Mar 26, 2021

A2C benchmark for mujoco #325

Merged

This was referenced Mar 27, 2021

refactor ppo #329

Merged

ppo benchmark released #330

Merged

ChenDRAG unpinned this issue Mar 30, 2021

This was referenced Apr 11, 2021

add figure plotter #335

Merged

Plans of implementing 3 classic model-free algorithm (TRPO/TNPG/ACKTR), and benchmarking them in mujoco settings #338

Closed

ChenDRAG closed this as completed Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307

Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307

ChenDRAG commented Mar 8, 2021 •

edited

ChenDRAG commented Mar 21, 2021

ChenDRAG commented Mar 26, 2021

ChenDRAG commented Apr 14, 2021

Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307

Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307

Comments

ChenDRAG commented Mar 8, 2021 • edited

Purpose

Introduction

Plans

Future work

ChenDRAG commented Mar 21, 2021

ChenDRAG commented Mar 26, 2021

ChenDRAG commented Apr 14, 2021

ChenDRAG commented Mar 8, 2021 •

edited