Skip to content
This repository was archived by the owner on Apr 23, 2025. It is now read-only.

Implement Proximal Policy Optimization #655

Merged
merged 32 commits into from
Aug 26, 2020

Conversation

seungjaeryanlee
Copy link
Contributor

@seungjaeryanlee seungjaeryanlee commented Aug 8, 2020

Like DQN (PR #617), Proximal Policy Optimization (PPO) is another widely used reinforcement learning algorithm. Proposed by Schulman et al. in 2017, PPO is an on-policy policy gradient algorithm that serves as a standard baselines for both environments with discrete and continuous action spaces.

There are two versions of PPO: PPO-Clip and PPO-Penalty. This code implements PPO-Clip, the more popular version.

TODO

  • Fix existing issues
    • Optimize both actorNet and criticNet
    • Compute loss1 using the the minimum of surrogate losses surr1 and surr2
    • Fix gradients not being computed correctly and set to zero
  • Find hyperparameters with consistent performance on CartPole
    • If performance is subpar, will implement GAE
  • Refactor and document code
    • Refactor the Categorical distribution from swift-rl
    • Add documentation comments for hyperparameters
    • Add documentation comments for structs and classes
      • PPOMemory
      • ActorNetwork
      • CriticNetwork
      • ActorCritic
      • PPOAgent

@BradLarson BradLarson added the gsoc Google Summer of Code label Aug 12, 2020
@seungjaeryanlee seungjaeryanlee marked this pull request as ready for review August 25, 2020 18:26
Copy link
Member

@dan-zheng dan-zheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments!

Copy link
Contributor

@BradLarson BradLarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, this looks great. It reliably solves CartPole on my machines here. It looks good on my end to pull in.

There's some replicated code across our various RL examples, but I have ideas for how we can consolidate those once this is in.

The only other thing I'd add would be a small entry in the shared Readme for the Gym targets, but someone's already working on issue #657 to add further DQN documentation and they could add more about this target in that same update.

@seungjaeryanlee
Copy link
Contributor Author

Awesome! I can review the relevant documentation (for both DQN and PPO) if needed.

@dan-zheng dan-zheng merged commit a1a61d9 into tensorflow:master Aug 26, 2020
@seungjaeryanlee seungjaeryanlee deleted the ppo branch August 26, 2020 02:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
gsoc Google Summer of Code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants