New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307
Comments
VPG’s (REINFORCE) benchmark is ready to be released now. source code in on my fork of Tianshou.
Reward metric is the same as PPO's but considers first 10M steps. For now I cannot find Any public benchmark for REINFORCE algorithm on mujoco environments. Spinning's vpg algorithm is more close to A2C algorithm because they both use GAE and a dnn for critic, so it cannot be called REINFORCE algorithm. This is debatable because vpg itself is not a very formal algorithm in the literature (first appears in Spinningup's docs, I think) and is loosely defined, see. As a result, I suggest we do not use the definition of vpg, but use REINFORCE instead.
|
A2C benchmark is ready! source code can be seen on #325 .
Note that we compare Tianshou's A2C implementation with SpinningUP's VPG, which we consider fair because we both use a critic and GAE. |
Work finished. |
Purpose
The purpose of this issue(discussion) is to introduce a series of pr in the near future targeted to releasing tianshou's benchmark for MuJoCo Gym task suite using onpolicy algorithms already supported by tianshou(VPG, A2C, PPO).
Introduction
This issue is closely related #274, which mainly focus on benchmarking mujoco environments using offpolicy algorithm and enhancing tianshou along the way. Since Mujoco is a widely used task suite in literature, and most drl libraries haven't released satisfying benchmark on such important tasks. (While there are open-source implementations available that can be used by practitioners, lacking of graphs, source data, comparasion, and specific deatils still make it uneasy for starters to compare performance between algorithms.) We decided to try buiding a complete benchmark for MuJoCo Gym task suite. First step is #274, and the second is this one.
The most closely related work to ours is probably spining up(Pytorch), which benchmarked 3 onpolicy algorithms(VPG, PPO, TRPO) on 5 mujoco environments, while our benchmark will try supporting more algorithms on 9 out of 13 environments. (Pusher, Thrower, Striker and HumanoidStandup have not been considered to be supported because they are not commonly seen in literatures.) While spinningup is intended for beginners and thus hasn't implemented standard tricks in drl (such as observation normalization and normalized value regression targets), we intend to do so.
Beyond that, like offpolicy benchmark. for each supported algorithm, we will try providing:
I make a plan and hope to finish the tasks above in the coming few weeks. Some features of tianshou will be enhanced along the way. I have done some experiments on my fork of Tianshou, which makes the benchmark below of PPO.
(This figure is outdated by now check examples/mujoco for better SOTA benchmark)
* Reward metric: The meaning of the table value is the max average return over 10 trails(different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered. ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. The shaded region on the graph also represents a single standard deviation. (Note that in TD3 paper shaded region represents only half of that)
** ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided.
*** We used the latest version of all mujoco environments in gym(0.17.3), but it's not often the case with other papers. Please check for details yourself in the original paper. (Different version's outcomes are usually similar, though)
Example graph:
Plans
Here I briefly list 7 to do prs:
Add a observation normalization wrapper of vecenv, because obs normalization affect onpolicy algorithm much.
Refactor VPG algorithm
Probably support natural policy gradient and make a benchmark.
Possibly support learning rate decay.
Refactor A2C algorithm and make a benchmark
Refactor and benchmark PPO algorithm
Other enhancement
Future work
1.. Adding support(benchmark in the same way) for other algorithms(TRPO, ACER etc).
The text was updated successfully, but these errors were encountered: