Robometer Policy Learning

A flexible reinforcement \ imitation learning framework supporting multiple algorithms (SAC, IQL, BC), reward models (Robometer, RoboReward), and distributed training. We will soon include detailed guides for DSRL (Diffusion-steering RL) with Pi0/0.5 on LIBERO and Real World tasks.

Warning

This repository is under active development, so some modules and features may change over time. You may encounter issues when using features that are not yet documented in this README. Please feel free to open an issue — we will do our best to help.

Setup

Prerequisites

Git
Python 3.10+
NVIDIA Drivers (for GPU support)

Installation

Install uv (if not already installed):

 # On macOS and Linux
 curl -LsSf https://astral.sh/uv/install.sh | sh

Clone and setup submodules (required for DSRL/LIBERO):

 git submodule init
 git submodule update --recursive

Create and sync the virtual environment:

 # Install dependencies from pyproject.toml
 uv sync

 # Optional: Install with development dependencies
 uv sync --extra dev

Activate the environment:

 source .venv/bin/activate

Note that this repo assumes robometer is installed as a git submodule and located at ./robometer. If you made any changes to robometer/have your own robometer fork, replace the submodule

Training

Basic Training (Ground Truth Rewards)

Train with ground truth rewards (default, no reward model - Online only):

uv run python scripts/train.py \
  --config-path=../robometer_policy_learning/configs \
  --config-name=config \
  algorithm@online_algorithm=sac \
  alg.online_alg_name=sac \
  env.use_gt_rewards=true

Offline pretraining with online fine-tuning:

uv run python scripts/train.py \
  --config-path=../robometer_policy_learning/configs \
  --config-name=config \
  algorithm@offline_algorithm=iql \
  algorithm@online_algorithm=sac \
  alg.offline_alg_name=iql \
  alg.online_alg_name=sac \
  env.use_gt_rewards=true

Training with Robometer Reward Model

Train with Robometer reward model (Online only):

uv run python scripts/train.py \
  --config-path=../robometer_policy_learning/configs \
  --config-name=config \
  reward_model=robometer \
  algorithm@online_algorithm=sac \
  alg.online_alg_name=sac \
  env.use_gt_rewards=false \
  reward_model.model_path=robometer/Robometer-4B

Train with Robometer reward model (Offline-to-online):

uv run python scripts/train.py   \
  --config-path=../robometer_policy_learning/configs   \
  --config-name=config   \
  reward_model=robometer   \
  algorithm@online_algorithm=sac   \
  alg.online_alg_name=sac   \
  algorithm@offline_algorithm=iql   \
  alg.offline_alg_name=iql   \
  env.use_gt_rewards=false   \
  reward_model.model_path=robometer/Robometer-4B

Example: Online RL in LIBERO

Train a SAC policy in LIBERO using ground-truth rewards:

uv run python scripts/train.py   \
  --config-path=../robometer_policy_learning/configs  \
  --config-name=libero_online_rl  \
  env.env_name=libero_90  \
  env.task_id=28  \
  env.use_gt_rewards=true  \
  algorithm@online_algorithm=sac   \
  alg.online_alg_name=sac   \
  training.num_rollouts=100000  \
  training.seed=100  \
  eval.eval_freq=5000  \
  eval.eval_num_episodes=20  \
  online_algorithm.num_critic_updates_per_actor_update=1  \
  online_algorithm.learning_starts=5000  \
  online_algorithm.critic_optimizer_lr=1e-5  \
  online_algorithm.actor_optimizer_lr=1e-5  \
  logging.wandb_name=libero_online_rl_gt_rewards  \
  logging.wandb_entity=YOUR_WANDB_ENTITY

Train a SAC policy in LIBERO using Robometer rewards:

uv run python scripts/train.py   \
  --config-path=../robometer_policy_learning/configs  \
  --config-name=libero_online_rl  \
  reward_model=robometer  \
  reward_model.model_path=robometer/Robometer-4B  \
  reward_model.add_estimated_reward=true  \
  reward_model.use_success_detection=false  \
  env.env_name=libero_90  \
  env.task_id=28  \
  env.use_gt_rewards=false  \
  algorithm@online_algorithm=sac   \
  alg.online_alg_name=sac   \
  training.num_rollouts=100000  \
  training.seed=100  \
  eval.eval_freq=5000  \
  eval.eval_num_episodes=20  \
  online_algorithm.num_critic_updates_per_actor_update=1  \
  online_algorithm.learning_starts=5000  \
  online_algorithm.critic_optimizer_lr=1e-5  \
  online_algorithm.actor_optimizer_lr=1e-5  \
  logging.wandb_name=libero_online_rl_robometer_rewards  \
  logging.wandb_entity=YOUR_WANDB_ENTITY

You should see evaluation curves similar to the example below:

Real-World Online RL with DSRL + Remote Reward Labeling

Coming soon... Files are in this repo but need to be cleaned up, should be done by mid June.

Also coming soon: DSRL+Pi0 sanity check command before running real world online RL.

Additional Resources

gRPC service definitions: robometer_policy_learning/distributed/protos/
Configuration files: robometer_policy_learning/configs/
Algorithm-specific configs: robometer_policy_learning/configs/algorithm/

Project Structure

.
├── docs/                         # Documentation assets and figures
├── robometer_policy_learning/    # Main policy learning package
│   ├── algorithms/               # BC, IQL, SAC, and DSRL algorithm code
│   ├── buffers/                  # Replay and offline data buffers
│   ├── configs/                  # Hydra configs for algorithms, envs, and reward models
│   ├── distributed/              # Distributed training and reward relabeling services
│   ├── envs/                     # Environment wrappers and task interfaces
│   ├── loggers/                  # Logging integrations
│   ├── modules/                  # Policy, critic, and value network modules
│   ├── robots/                   # Real-robot interfaces
│   ├── rollouts/                 # Rollout collection utilities
│   ├── runners/                  # Training and evaluation runners
│   └── utils/                    # Shared helpers
├── scripts/                      # Training, evaluation, and server entrypoints
├── tests/                        # Test suite

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LIBERO @ 87edbd1		LIBERO @ 87edbd1
docs		docs
dsrl_openpi @ b0e3cfb		dsrl_openpi @ b0e3cfb
robometer @ a669dff		robometer @ a669dff
robometer_policy_learning		robometer_policy_learning
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robometer Policy Learning

Table of Contents

Setup

Prerequisites

Installation

Training

Basic Training (Ground Truth Rewards)

Training with Robometer Reward Model

Example: Online RL in LIBERO

Real-World Online RL with DSRL + Remote Reward Labeling

Additional Resources

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Robometer Policy Learning

Table of Contents

Setup

Prerequisites

Installation

Training

Basic Training (Ground Truth Rewards)

Training with Robometer Reward Model

Example: Online RL in LIBERO

Real-World Online RL with DSRL + Remote Reward Labeling

Additional Resources

Project Structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages