Proximal Policy Optimization (PPO) written in C++.
Collect simulated experience with multi-threaded batch environments implemented in C++ and train policy + value function with PyTorch C++ (LibTorch).
Train humanoid locomotion behavior in ~10 minutes (reward
From the build/
directory, run:
./run --env humanoid --train --visualize --checkpoint humanoid --device {cpu|cuda|mps}
The saved policy can be visualized:
./run --env humanoid --load humanoid_{x}_{y} --visualize --device {cpu|cuda|mps}
Visualize pretrained policy (requires Apple ARM CPU):
./run --env humanoid --load pretrained/humanoid_apple_arm --visualize --device cpu
ppo.cpp
should work with Ubuntu and macOS.
Dependencies: abseil, libtorch, mujoco
Operating system specific dependencies:
Install Xcode.
Install ninja
:
brew install ninja
sudo apt-get update && sudo apt-get install cmake libgl1-mesa-dev libxinerama-dev libxcursor-dev libxrandr-dev libxi-dev ninja-build clang-12
git clone https://github.com/thowell/ppo.cpp
LibTorch (ie PyTorch C++) should automatically be installed by CMake. Manual installation can be performed (perform steps 1 and 2 below to create a /build directory first):
Install LibTorch 2.3.1 download and extract to ppo.cpp/build
.
If you encounter warnings for malicious software for Torch: System Settings -> Security & Privacy -> Allow
You might also need to:
brew install libomp
install_name_tool -add_rpath /opt/homebrew/opt/libomp/lib PATH_TO/ppo.cpp/libtorch/lib/libtorch_cpu.dylib
Install LibTorch CUDA 12.1 2.3.1 download and extract to ppo.cpp/build
- Change directory:
cd ppo.cpp
- Create and change to build directory:
mkdir build
cd build
- Configure:
cmake .. -DCMAKE_BUILD_TYPE:STRING=Release -G Ninja
cmake .. -DCMAKE_BUILD_TYPE:STRING=Release -G Ninja -DCMAKE_C_COMPILER:STRING=clang-12 -DCMAKE_CXX_COMPILER:STRING=clang++-12
- Build
cmake --build . --config=Release
VSCode and 2 of its extensions (CMake Tools and C/C++) can simplify the build process.
- Open the cloned directory
ppo.cpp
. - Configure the project with CMake (a pop-up should appear in VSCode)
- Set compiler to
clang-12
. - Build and run the
ppo
target in "release" mode (VSCode defaults to "debug").
Setup:
--env
:humanoid
--train
: train policy and value function with PPO--checkpoint
: filename incheckpoint/
to save policy--load
: provide string incheckpoint/
directory to load policy from checkpoint--visualize
: visualize policy
Hardware settings:
--num_threads
: number of threads/workers for collecting simulation experience [default: 20]--device
: learning device [default: cpu, cuda, mps]--device_sim
: simulation device [default: cpu, cuda, mps]--device_type
: data type for device [default: float]--device_sim_type
: data type for device_sim [default: double]
PPO settings:
--num_envs
: number of parallel learning environments for collecting simulation experience--num_steps
: number of environment steps for each environment used for learning--minibatch_size
: size of minibatch--learning_rate
: initial learning rate for policy and value function optimizer--max_env_steps
: total number of environment steps to collect--anneal_lr
: flag to anneal learning rate--kl_threshold
: maximum KL divergence between old and new policies--gamma
: discount factor for rewards--gae_lambda
: factor for Generalized Advantage Estimation--update_epochs
: number of iterations complete batch of experience is used to improve policy and value function--norm_adv
: flag for normalizing advantages--clip_coef
: value for PPO clip parameter--clip_vloss
: flag for clipping value function loss--ent_coef
: weight for entropy loss--vf_coef
: weight for value function loss--max_grad_norm
: maximum value for global L2-norm of parameter gradients--optimizer_eps
: epsilson value for Adam optimizer--normalize_observation
: normalize observations with running statistics--normalize_reward
: normalize rewards with running statistics
Evaluation settings:
--num_eval_envs
: number of environments for evaluating policy performance--max_eval_steps
: number of simulation steps (per environment) for evaluating policy performance--num_iter_per_eval
: number of iterations per policy evaluation
This repository was developed to:
- understand the Proximal Policy Optimization algorithm
- understand the details of Gym environments, including autoresets, normalization, batch environments, etc
- understand the normal distribution neural network policy formulation for continuous control environments
- gain experience with PyTorch C++ API
- experiment with code generation tools that are useful for improving development times, including: ChatGPT and Claude
- gain a better understanding of where performance bottlenecks exist for PPO
- gain a better understanding of how MuJoCo models can be modified to improve steps/time
- gain more experience using CMake
MuJoCo models use resources from IsaacGym environments, MuJoCo Menagerie, MJX Tutorial, and dm_control
PPO implementation is based on cleanrl: ppo_continuous_action.py.