Skip to content

Proximal Policy Optimization (PPO) written in C++ with PyTorch (LibTorch)

License

Notifications You must be signed in to change notification settings

thowell/ppo.cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ppo.cpp

Proximal Policy Optimization (PPO) written in C++.

Collect simulated experience with multi-threaded batch environments implemented in C++ and train policy + value function with PyTorch C++ (LibTorch).

Train humanoid

Train humanoid locomotion behavior in ~10 minutes (reward $\approx$ 5000) using 20 threads with Intel Core i9-14900K CPU to collect simulated experience from MuJoCo (C) and Nvidia RTX 4090 to train a neural network policy and value function with PyTorch (LibTorch) on Ubuntu 22.04.4 LTS.

drawing

From the build/ directory, run:

./run --env humanoid --train --visualize --checkpoint humanoid --device {cpu|cuda|mps}

The saved policy can be visualized:

./run --env humanoid --load humanoid_{x}_{y} --visualize --device {cpu|cuda|mps}

Visualize pretrained policy (requires Apple ARM CPU):

./run --env humanoid --load pretrained/humanoid_apple_arm --visualize --device cpu

Installation

ppo.cpp should work with Ubuntu and macOS.

Dependencies: abseil, libtorch, mujoco

Prerequisites

Operating system specific dependencies:

macOS

Install Xcode.

Install ninja:

brew install ninja

Ubuntu

sudo apt-get update && sudo apt-get install cmake libgl1-mesa-dev libxinerama-dev libxcursor-dev libxrandr-dev libxi-dev ninja-build clang-12

Clone ppo.cpp

git clone https://github.com/thowell/ppo.cpp

LibTorch

LibTorch (ie PyTorch C++) should automatically be installed by CMake. Manual installation can be performed (perform steps 1 and 2 below to create a /build directory first):

macOS

Install LibTorch 2.3.1 download and extract to ppo.cpp/build.

If you encounter warnings for malicious software for Torch: System Settings -> Security & Privacy -> Allow

You might also need to:

brew install libomp
install_name_tool -add_rpath /opt/homebrew/opt/libomp/lib PATH_TO/ppo.cpp/libtorch/lib/libtorch_cpu.dylib

Ubuntu

Install LibTorch CUDA 12.1 2.3.1 download and extract to ppo.cpp/build

Build and Run

  1. Change directory:
cd ppo.cpp
  1. Create and change to build directory:
mkdir build
cd build
  1. Configure:

macOS

cmake .. -DCMAKE_BUILD_TYPE:STRING=Release -G Ninja

Ubuntu

cmake .. -DCMAKE_BUILD_TYPE:STRING=Release -G Ninja -DCMAKE_C_COMPILER:STRING=clang-12 -DCMAKE_CXX_COMPILER:STRING=clang++-12
  1. Build
cmake --build . --config=Release

Build and Run ppo.cpp using VSCode

VSCode and 2 of its extensions (CMake Tools and C/C++) can simplify the build process.

  1. Open the cloned directory ppo.cpp.
  2. Configure the project with CMake (a pop-up should appear in VSCode)
  3. Set compiler to clang-12.
  4. Build and run the ppo target in "release" mode (VSCode defaults to "debug").

Command-line interface

Setup:

  • --env: humanoid
  • --train: train policy and value function with PPO
  • --checkpoint: filename in checkpoint/ to save policy
  • --load: provide string in checkpoint/ directory to load policy from checkpoint
  • --visualize: visualize policy

Hardware settings:

  • --num_threads: number of threads/workers for collecting simulation experience [default: 20]
  • --device: learning device [default: cpu, cuda, mps]
  • --device_sim: simulation device [default: cpu, cuda, mps]
  • --device_type: data type for device [default: float]
  • --device_sim_type: data type for device_sim [default: double]

PPO settings:

  • --num_envs: number of parallel learning environments for collecting simulation experience
  • --num_steps: number of environment steps for each environment used for learning
  • --minibatch_size: size of minibatch
  • --learning_rate: initial learning rate for policy and value function optimizer
  • --max_env_steps: total number of environment steps to collect
  • --anneal_lr: flag to anneal learning rate
  • --kl_threshold: maximum KL divergence between old and new policies
  • --gamma: discount factor for rewards
  • --gae_lambda: factor for Generalized Advantage Estimation
  • --update_epochs: number of iterations complete batch of experience is used to improve policy and value function
  • --norm_adv: flag for normalizing advantages
  • --clip_coef: value for PPO clip parameter
  • --clip_vloss: flag for clipping value function loss
  • --ent_coef: weight for entropy loss
  • --vf_coef: weight for value function loss
  • --max_grad_norm: maximum value for global L2-norm of parameter gradients
  • --optimizer_eps: epsilson value for Adam optimizer
  • --normalize_observation: normalize observations with running statistics
  • --normalize_reward: normalize rewards with running statistics

Evaluation settings:

  • --num_eval_envs: number of environments for evaluating policy performance
  • --max_eval_steps: number of simulation steps (per environment) for evaluating policy performance
  • --num_iter_per_eval: number of iterations per policy evaluation

Notes

This repository was developed to:

  • understand the Proximal Policy Optimization algorithm
  • understand the details of Gym environments, including autoresets, normalization, batch environments, etc
  • understand the normal distribution neural network policy formulation for continuous control environments
  • gain experience with PyTorch C++ API
  • experiment with code generation tools that are useful for improving development times, including: ChatGPT and Claude
  • gain a better understanding of where performance bottlenecks exist for PPO
  • gain a better understanding of how MuJoCo models can be modified to improve steps/time
  • gain more experience using CMake

MuJoCo models use resources from IsaacGym environments, MuJoCo Menagerie, MJX Tutorial, and dm_control

PPO implementation is based on cleanrl: ppo_continuous_action.py.

Releases

No releases published

Packages

No packages published