
The PPO balancer is a feedforward neural network policy trained by reinforcement learning with a sim-to-real pipeline. Like the MPC balancer and PID balancer, it balances Upkie with straight legs. Training uses the UpkieGroundVelocity
gym environment and the PPO implementation from Stable Baselines3.
An overview video of the training pipeline is given in this video: Sim-to-real RL pipeline for Upkie wheeled bipeds.
Install pixi.
The PPO balancer uses pixi-pack to pack a standalone Python environment to run policies on your Upkie. First, create environment.tar
on your machine and upload it by:
make pack_pixi_env
make upload
Then, unpack the remote environment:
$ ssh user@your-upkie
user@your-upkie:~$ cd ppo_balancer
user@your-upkie:ppo_balancer$ make unpack_pixi_env
To run the deployed policy on your Upkie:
make run_agent
Before that, to test the policy on your machine:
pixi run agent
Here we assumed the spine is already up and running, for instance by running ./start_simulation.sh
on your machine, or by starting a pi3hat spine on the robot.
First, check that training progresses one rollout at a time:
pixi run show_training
Once this works, train for real with more environments and no GUI:
pixi run train <nb_envs>
Adjust the number nb_envs
of parallel environments based on the time/fps
series. The series is reported to the command line (or to TensorBoard if you configure UPKIE_TRAINING_PATH
as detailed below). Increase or decrease the number of environments until you find the sweet spot that maximizes FPS on your machine.
The repository comes with a training directory that will store logs each time a new policy is learned. Set the UPKIE_TRAINING_PATH
environment variable to enable this:
export UPKIE_TRAINING_PATH="${HOME}/src/ppo_balancer/training"
Trainings will be grouped automatically by day. You can start TensorBoard for today by:
pixi run tensorboard
To run a policy saved to a custom path, use for instance:
python ppo_balancer/run.py --policy ppo_balancer/training/2023-11-15/final.zip