Skip to content

raphaelsenn/PPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Proximal Policy Optimization

PyTorch implementation of Proximal Policy Optimization introduced from Schulman et al., 2017.

lunarlander_gif bipedalwalker_gif
lunarlander_learning_curve bipedalwalker_learning_curve

Figures: Learning curves for the three OpenAI Gym (Box2D) control tasks:

  • LunarLander-v3 (proprioceptive states, discrete action space)

  • BipedalWalker-v3 (proprioceptive states, continuous action space)

The shaded region represents the standard deviation of the average evaluation over 10 trials (across 10 different seeds). Curves are smoothed with an average filter.

Note: This repository implements PPO for discrete and continuous action spaces. It also supports parallelization.

Algorithm

Quick Facts

  • PPO is a model-free algorithm
  • PPO is an on-policy algorithm
  • PPO can be used in discrete and continuous action spaces
ppo_pseudocode_1 ppo_pseudocode_2
PPO algorithm. Taken from Schulman et al., 2017. PPO-Clip algorithm. Taken from OpenAI Spinning Up.
ppo_diagram
PPO diagram. Taken from Tan et al., 2023.

Usage

To train on an env of choice (i.e. here LunarLander-v3) just run:

python3 main.py --env_id=LunarLander-v3 --horizon=2048 --n_timesteps=1000000 --learning_rate=0.0003 --batch_size=64 --entropy_coef=0.0

Or:

import gymnasium as gym
from ppo import PPO, ActorCriticDiscreteMLP

env = make_env("LunarLander-v3")

actor_critic = ActorCriticDiscreteMLP(
    state_dim=env.observation_space.shape[0],
    action_dim=env.action_space.n,
    h1_dim=256,
    h2_dim=256,
)

ppo = PPO(
    actor_critic=actor_critic,
    n_envs=32,
    learning_rate=0.0003,
    time_steps=1_000_000,
    horizon=2048,
    batch_size=8 * 32,
    n_epochs=3,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.1,
    clip_grad_norm=0.5,
    entropy_coef=0.0001,
    vf_coef=0.5,
    device="cpu", # or "cuda"
)

Experimental setup

  • OS: Fedora Linux 42 (Workstation Edition) x86_64
  • CPU: AMD Ryzen 5 2600X (12) @ 3.60 GHz
  • GPU: NVIDIA GeForce RTX 3060 ti (8GB VRAM)
  • RAM: 32 GB DDR4 3200 MHz

All experiments (LunarLander-v3, BipedalWalker-v3 and CarRacer-v3) shared the following hyperparameters:

Hyperparameter/Setting LunarLander-v3 BipedalWalker-v3
n_envs 1 1
learning_rate 0.0003 0.0003
time_steps 1e-6 1e-6
horizon 2048 2048
n_epochs 10 10
gamma 0.99 0.99
GAE lambda 0.95 0.95
clip_range 0.2 0.1
clip_grad_norm 0.5 0.1
entropy_coef 0.0 0.0
vf_coef 0.5 0.5
Clip reward None +/-10.0
Environment Average Return
LunarLander-v3 276.70 ± 3.42
BipedalWalker-v3 271.77 ± 16.85

Citations

@misc{schulman2017proximalpolicyoptimizationalgorithms,
      title={Proximal Policy Optimization Algorithms}, 
      author={John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov},
      year={2017},
      eprint={1707.06347},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1707.06347}, 
}

About

PyTorch reimplementation of the Proximal Policy Optimization algorithm, with support for parallelization, discrete and continuous action spaces, and both MLP and CNN policies.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages