Proximal Policy Optimization

PyTorch implementation of Proximal Policy Optimization introduced from Schulman et al., 2017.

Figures: Learning curves for the three OpenAI Gym (Box2D) control tasks:

LunarLander-v3 (proprioceptive states, discrete action space)
BipedalWalker-v3 (proprioceptive states, continuous action space)

The shaded region represents the standard deviation of the average evaluation over 10 trials (across 10 different seeds). Curves are smoothed with an average filter.

Note: This repository implements PPO for discrete and continuous action spaces. It also supports parallelization.

Algorithm

Quick Facts

PPO is a model-free algorithm
PPO is an on-policy algorithm
PPO can be used in discrete and continuous action spaces



PPO algorithm. Taken from Schulman et al., 2017.	PPO-Clip algorithm. Taken from OpenAI Spinning Up.



PPO diagram. Taken from Tan et al., 2023.

Usage

To train on an env of choice (i.e. here LunarLander-v3) just run:

python3 main.py --env_id=LunarLander-v3 --horizon=2048 --n_timesteps=1000000 --learning_rate=0.0003 --batch_size=64 --entropy_coef=0.0

Or:

import gymnasium as gym
from ppo import PPO, ActorCriticDiscreteMLP

env = make_env("LunarLander-v3")

actor_critic = ActorCriticDiscreteMLP(
    state_dim=env.observation_space.shape[0],
    action_dim=env.action_space.n,
    h1_dim=256,
    h2_dim=256,
)

ppo = PPO(
    actor_critic=actor_critic,
    n_envs=32,
    learning_rate=0.0003,
    time_steps=1_000_000,
    horizon=2048,
    batch_size=8 * 32,
    n_epochs=3,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.1,
    clip_grad_norm=0.5,
    entropy_coef=0.0001,
    vf_coef=0.5,
    device="cpu", # or "cuda"
)

Experimental setup

OS: Fedora Linux 42 (Workstation Edition) x86_64
CPU: AMD Ryzen 5 2600X (12) @ 3.60 GHz
GPU: NVIDIA GeForce RTX 3060 ti (8GB VRAM)
RAM: 32 GB DDR4 3200 MHz

All experiments (LunarLander-v3, BipedalWalker-v3 and CarRacer-v3) shared the following hyperparameters:

Hyperparameter/Setting	LunarLander-v3	BipedalWalker-v3
n_envs	1	1
learning_rate	0.0003	0.0003
time_steps	1e-6	1e-6
horizon	2048	2048
n_epochs	10	10
gamma	0.99	0.99
GAE lambda	0.95	0.95
clip_range	0.2	0.1
clip_grad_norm	0.5	0.1
entropy_coef	0.0	0.0
vf_coef	0.5	0.5
Clip reward	None	+/-10.0

Environment	Average Return
LunarLander-v3	276.70 ± 3.42
BipedalWalker-v3	271.77 ± 16.85

Citations

@misc{schulman2017proximalpolicyoptimizationalgorithms,
      title={Proximal Policy Optimization Algorithms}, 
      author={John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov},
      year={2017},
      eprint={1707.06347},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1707.06347}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
notebooks		notebooks
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Proximal Policy Optimization

Algorithm

Quick Facts

Usage

Experimental setup

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Proximal Policy Optimization

Algorithm

Quick Facts

Usage

Experimental setup

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages