Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Atari PPO example #780

Merged
merged 9 commits into from
Dec 4, 2022
Merged

Conversation

nuance1979
Copy link
Collaborator

  • I have marked all applicable categories:
    • exception-raising fix
    • algorithm implementation fix
    • documentation modification
    • new feature
  • I have reformatted the code using make format (required)
  • I have checked the code using make commit-checks (required)
  • If applicable, I have mentioned the relevant/related issue(s)
  • If applicable, I have listed every items in this Pull Request below

While trying to debug Atari PPO+LSTM, I found significant gap between our Atari PPO example vs CleanRL's Atari PPO w/ EnvPool. I tried to align our implementation with CleaRL's version, mostly in hyper parameter choices, and got significant gain in Breakout, Qbert, SpaceInvaders while on par in other games. After this fix, I would suggest updating our Atari Benchmark PPO experiments.

A few interesting findings:

  • Layer initialization helps stabilize the training and enable the use of larger learning rates; without it, larger learning rates will trigger NaN gradient very quickly;
  • ppo.py#L97-L101: this change helps training stability for reasons I do not understand; also it makes the GPU usage higher.

Shoutout to CleanRL for a well-tuned Atari PPO reference implementation!

@codecov-commenter
Copy link

codecov-commenter commented Nov 30, 2022

Codecov Report

Merging #780 (d2f422e) into master (929508b) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #780      +/-   ##
==========================================
- Coverage   91.08%   91.07%   -0.01%     
==========================================
  Files          71       71              
  Lines        5082     5077       -5     
==========================================
- Hits         4629     4624       -5     
  Misses        453      453              
Flag Coverage Δ
unittests 91.07% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tianshou/trainer/base.py 96.90% <ø> (ø)
tianshou/policy/modelfree/ppo.py 92.06% <100.00%> (-0.59%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Trinkle23897 Trinkle23897 merged commit 662af52 into thu-ml:master Dec 4, 2022
@nuance1979 nuance1979 deleted the fix_atari_ppo branch December 5, 2022 21:35
BFAnas pushed a commit to BFAnas/tianshou that referenced this pull request May 5, 2024
- [x] I have marked all applicable categories:
    + [ ] exception-raising fix
    + [x] algorithm implementation fix
    + [ ] documentation modification
    + [ ] new feature
- [x] I have reformatted the code using `make format` (**required**)
- [x] I have checked the code using `make commit-checks` (**required**)
- [x] If applicable, I have mentioned the relevant/related issue(s)
- [x] If applicable, I have listed every items in this Pull Request
below

While trying to debug Atari PPO+LSTM, I found significant gap between
our Atari PPO example vs [CleanRL's Atari PPO w/
EnvPool](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_atari_envpoolpy).
I tried to align our implementation with CleaRL's version, mostly in
hyper parameter choices, and got significant gain in Breakout, Qbert,
SpaceInvaders while on par in other games. After this fix, I would
suggest updating our [Atari
Benchmark](https://tianshou.readthedocs.io/en/master/tutorials/benchmark.html)
PPO experiments.

A few interesting findings:

- Layer initialization helps stabilize the training and enable the use
of larger learning rates; without it, larger learning rates will trigger
NaN gradient very quickly;
- ppo.py#L97-L101: this change helps training stability for reasons I do
not understand; also it makes the GPU usage higher.

Shoutout to [CleanRL](https://github.com/vwxyzjn/cleanrl) for a
well-tuned Atari PPO reference implementation!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants