A flexible reinforcement \ imitation learning framework supporting multiple algorithms (SAC, IQL, BC), reward models (Robometer, RoboReward), and distributed training. We will soon include detailed guides for DSRL (Diffusion-steering RL) with Pi0/0.5 on LIBERO and Real World tasks.
Warning
This repository is under active development, so some modules and features may change over time. You may encounter issues when using features that are not yet documented in this README. Please feel free to open an issue — we will do our best to help.
- Setup
- Training
- [Real-World Online RL with DSRL + Remote Reward Labeling - Coming Soon]
- Project Structure
- Git
- Python 3.10+
- NVIDIA Drivers (for GPU support)
- Install
uv(if not already installed):
# On macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh- Clone and setup submodules (required for DSRL/LIBERO):
git submodule init
git submodule update --recursive- Create and sync the virtual environment:
# Install dependencies from pyproject.toml
uv sync
# Optional: Install with development dependencies
uv sync --extra devActivate the environment:
source .venv/bin/activateNote that this repo assumes robometer is installed as a git submodule and located at ./robometer. If you made any changes to robometer/have your own robometer fork, replace the submodule
Train with ground truth rewards (default, no reward model - Online only):
uv run python scripts/train.py \
--config-path=../robometer_policy_learning/configs \
--config-name=config \
algorithm@online_algorithm=sac \
alg.online_alg_name=sac \
env.use_gt_rewards=trueOffline pretraining with online fine-tuning:
uv run python scripts/train.py \
--config-path=../robometer_policy_learning/configs \
--config-name=config \
algorithm@offline_algorithm=iql \
algorithm@online_algorithm=sac \
alg.offline_alg_name=iql \
alg.online_alg_name=sac \
env.use_gt_rewards=trueTrain with Robometer reward model (Online only):
uv run python scripts/train.py \
--config-path=../robometer_policy_learning/configs \
--config-name=config \
reward_model=robometer \
algorithm@online_algorithm=sac \
alg.online_alg_name=sac \
env.use_gt_rewards=false \
reward_model.model_path=robometer/Robometer-4BTrain with Robometer reward model (Offline-to-online):
uv run python scripts/train.py \
--config-path=../robometer_policy_learning/configs \
--config-name=config \
reward_model=robometer \
algorithm@online_algorithm=sac \
alg.online_alg_name=sac \
algorithm@offline_algorithm=iql \
alg.offline_alg_name=iql \
env.use_gt_rewards=false \
reward_model.model_path=robometer/Robometer-4BTrain a SAC policy in LIBERO using ground-truth rewards:
uv run python scripts/train.py \
--config-path=../robometer_policy_learning/configs \
--config-name=libero_online_rl \
env.env_name=libero_90 \
env.task_id=28 \
env.use_gt_rewards=true \
algorithm@online_algorithm=sac \
alg.online_alg_name=sac \
training.num_rollouts=100000 \
training.seed=100 \
eval.eval_freq=5000 \
eval.eval_num_episodes=20 \
online_algorithm.num_critic_updates_per_actor_update=1 \
online_algorithm.learning_starts=5000 \
online_algorithm.critic_optimizer_lr=1e-5 \
online_algorithm.actor_optimizer_lr=1e-5 \
logging.wandb_name=libero_online_rl_gt_rewards \
logging.wandb_entity=YOUR_WANDB_ENTITYTrain a SAC policy in LIBERO using Robometer rewards:
uv run python scripts/train.py \
--config-path=../robometer_policy_learning/configs \
--config-name=libero_online_rl \
reward_model=robometer \
reward_model.model_path=robometer/Robometer-4B \
reward_model.add_estimated_reward=true \
reward_model.use_success_detection=false \
env.env_name=libero_90 \
env.task_id=28 \
env.use_gt_rewards=false \
algorithm@online_algorithm=sac \
alg.online_alg_name=sac \
training.num_rollouts=100000 \
training.seed=100 \
eval.eval_freq=5000 \
eval.eval_num_episodes=20 \
online_algorithm.num_critic_updates_per_actor_update=1 \
online_algorithm.learning_starts=5000 \
online_algorithm.critic_optimizer_lr=1e-5 \
online_algorithm.actor_optimizer_lr=1e-5 \
logging.wandb_name=libero_online_rl_robometer_rewards \
logging.wandb_entity=YOUR_WANDB_ENTITYYou should see evaluation curves similar to the example below:
Coming soon... Files are in this repo but need to be cleaned up, should be done by mid June.
Also coming soon: DSRL+Pi0 sanity check command before running real world online RL.
-
gRPC service definitions:
robometer_policy_learning/distributed/protos/ -
Configuration files:
robometer_policy_learning/configs/ -
Algorithm-specific configs:
robometer_policy_learning/configs/algorithm/
.
├── docs/ # Documentation assets and figures
├── robometer_policy_learning/ # Main policy learning package
│ ├── algorithms/ # BC, IQL, SAC, and DSRL algorithm code
│ ├── buffers/ # Replay and offline data buffers
│ ├── configs/ # Hydra configs for algorithms, envs, and reward models
│ ├── distributed/ # Distributed training and reward relabeling services
│ ├── envs/ # Environment wrappers and task interfaces
│ ├── loggers/ # Logging integrations
│ ├── modules/ # Policy, critic, and value network modules
│ ├── robots/ # Real-robot interfaces
│ ├── rollouts/ # Rollout collection utilities
│ ├── runners/ # Training and evaluation runners
│ └── utils/ # Shared helpers
├── scripts/ # Training, evaluation, and server entrypoints
├── tests/ # Test suite
