Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Proximal Policy Optimization #655

Merged
merged 32 commits into from Aug 26, 2020

Conversation

seungjaeryanlee
Copy link
Contributor

@seungjaeryanlee seungjaeryanlee commented Aug 8, 2020

Like DQN (PR #617), Proximal Policy Optimization (PPO) is another widely used reinforcement learning algorithm. Proposed by Schulman et al. in 2017, PPO is an on-policy policy gradient algorithm that serves as a standard baselines for both environments with discrete and continuous action spaces.

There are two versions of PPO: PPO-Clip and PPO-Penalty. This code implements PPO-Clip, the more popular version.

TODO

  • Fix existing issues
    • Optimize both actorNet and criticNet
    • Compute loss1 using the the minimum of surrogate losses surr1 and surr2
    • Fix gradients not being computed correctly and set to zero
  • Find hyperparameters with consistent performance on CartPole
    • If performance is subpar, will implement GAE
  • Refactor and document code
    • Refactor the Categorical distribution from swift-rl
    • Add documentation comments for hyperparameters
    • Add documentation comments for structs and classes
      • PPOMemory
      • ActorNetwork
      • CriticNetwork
      • ActorCritic
      • PPOAgent

@BradLarson BradLarson added the gsoc Google Summer of Code label Aug 12, 2020
Gym/PPO/ActorCritic.swift Outdated Show resolved Hide resolved
Gym/PPO/Categorical.swift Outdated Show resolved Hide resolved
Gym/PPO/ActorCritic.swift Outdated Show resolved Hide resolved
Gym/PPO/Memory.swift Outdated Show resolved Hide resolved
Gym/PPO/main.swift Outdated Show resolved Hide resolved
@seungjaeryanlee seungjaeryanlee marked this pull request as ready for review Aug 25, 2020
Copy link
Member

@dan-zheng dan-zheng left a comment

Some minor comments!

Gym/PPO/ActorCritic.swift Outdated Show resolved Hide resolved
Gym/PPO/ActorCritic.swift Outdated Show resolved Hide resolved
Gym/PPO/Agent.swift Outdated Show resolved Hide resolved
Gym/PPO/Agent.swift Outdated Show resolved Hide resolved
Gym/PPO/Agent.swift Outdated Show resolved Hide resolved
Gym/PPO/main.swift Outdated Show resolved Hide resolved
Gym/PPO/Gathering.swift Outdated Show resolved Hide resolved
Gym/PPO/main.swift Outdated Show resolved Hide resolved
Copy link
Contributor

@BradLarson BradLarson left a comment

Functionally, this looks great. It reliably solves CartPole on my machines here. It looks good on my end to pull in.

There's some replicated code across our various RL examples, but I have ideas for how we can consolidate those once this is in.

The only other thing I'd add would be a small entry in the shared Readme for the Gym targets, but someone's already working on issue #657 to add further DQN documentation and they could add more about this target in that same update.

@seungjaeryanlee
Copy link
Contributor Author

seungjaeryanlee commented Aug 26, 2020

Awesome! I can review the relevant documentation (for both DQN and PPO) if needed.

@dan-zheng dan-zheng merged commit a1a61d9 into tensorflow:master Aug 26, 2020
2 checks passed
@seungjaeryanlee seungjaeryanlee deleted the ppo branch Aug 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gsoc Google Summer of Code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants