-
Notifications
You must be signed in to change notification settings - Fork 149
Implement Proximal Policy Optimization #655
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally, this looks great. It reliably solves CartPole on my machines here. It looks good on my end to pull in.
There's some replicated code across our various RL examples, but I have ideas for how we can consolidate those once this is in.
The only other thing I'd add would be a small entry in the shared Readme for the Gym targets, but someone's already working on issue #657 to add further DQN documentation and they could add more about this target in that same update.
Awesome! I can review the relevant documentation (for both DQN and PPO) if needed. |
Like DQN (PR #617), Proximal Policy Optimization (PPO) is another widely used reinforcement learning algorithm. Proposed by Schulman et al. in 2017, PPO is an on-policy policy gradient algorithm that serves as a standard baselines for both environments with discrete and continuous action spaces.
There are two versions of PPO: PPO-Clip and PPO-Penalty. This code implements PPO-Clip, the more popular version.
TODO
actorNet
andcriticNet
loss1
using the the minimum of surrogate lossessurr1
andsurr2
If performance is subpar, will implement GAECategorical
distribution from swift-rl