1KAIST 2Seoul National University 3RLWRLD
Summary: Trust Region Q Adjoint Matching (TRQAM) is a stable off-policy RL algorithm for fine-tuning pretrained flow policies under a path-space KL trust region against the pretrained policy, enforced via dual descent. On 50 OGBench tasks, TRQAM reaches 68% aggregate offline success, compared to 46% for the strongest baseline.
-
Create and activate a conda environment:
conda create -n trqam python=3.11 -y conda activate trqam
-
Install robomimic from source:
git clone https://github.com/ARISE-Initiative/robomimic.git cd robomimic pip install -e . cd ..
-
Install robosuite from source:
git clone https://github.com/ARISE-Initiative/robosuite.git cd robosuite pip install -r requirements.txt cd ..
-
Patch robomimic for JAX compatibility (makes an unused
diffusersimport non-fatal):python -c " path = '$HOME/robomimic/robomimic/algo/__init__.py' with open(path) as f: text = f.read() old = 'from robomimic.algo.diffusion_policy import DiffusionPolicyUNet' new = '''try: from robomimic.algo.diffusion_policy import DiffusionPolicyUNet except (ImportError, AttributeError): pass''' with open(path, 'w') as f: f.write(text.replace(old, new)) print('Done') "
-
Install TRQAM dependencies:
pip install -r requirements.txt
The paper's pipeline is two-stage: (1) pretrain a flow policy with behavior cloning (BC) for 300K steps, then (2) fine-tune it with TRQAM (or a baseline) for 1M offline + 500K online steps, loading the BC checkpoint as the initialization.
The key TRQAM hyperparameter is --agent.kl_budget (ε_KL in the paper); see Table 4 of the paper for per-domain recommended values.
Network size (per-domain widths and layer norm)
- 1M data domains (e.g.
antmaze-large,humanoidmaze-*,cube-double,scene,puzzle-3x3, Robomimic): width 512,actor_layer_norm=False. - 10M / 100M data domains (
cube-triple-10M,antmaze-giant-10M,puzzle-4x4-10M,cube-quadruple-100M): width 1024,actor_layer_norm=True.
The default agent configs ship with the 1M setting. For 10M/100M data domains, append the following flags to both the BC pretrain and fine-tuning commands so the saved checkpoint matches the fine-tuning model:
--agent.actor_hidden_dims='(1024,1024,1024,1024)' \
--agent.value_hidden_dims='(1024,1024,1024,1024)' \
--agent.actor_layer_norm=TrueDiscount and pessimism (fine-tuning only)
- Default (manipulation, antmaze, Robomimic):
--agent.discount=0.995 --agent.rho=0.5. Matches the shipped agent configs, so no override is needed. - humanoidmaze-* (longer horizons):
--agent.discount=0.999 --agent.rho=0.0. Override both flags on humanoidmaze runs.
These only affect the critic update, so they matter for fine-tuning (Step 2) but not for BC pretraining (Step 1, where the critic and adjoint matching are skipped). The only Step-1 hyperparameter that must match Step 2 is the network size.
Train a BC-only flow policy with the TRQAM agent (agents/trqam.py --bc_only=True). The resulting checkpoint at exp/trqam/bc_pretrain/<env_name>/<exp_name>/params_300000.pkl is reusable across TRQAM, QAM, QAM-E, FQL, DSRL, CGQL, IFQL.
Example command (cube-triple-task2)
MUJOCO_GL=egl python main.py --run_group=bc_pretrain --agent=agents/trqam.py --tags=BC --seed=10001 \
--env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5 \
--ogbench_dataset_dir=~/.ogbench/data/cube-triple-play-10m-v0 \
--agent.action_chunking=True --bc_only=True --offline_steps=300000 --online_steps=0 \
--agent.actor_hidden_dims='(1024,1024,1024,1024)' \
--agent.value_hidden_dims='(1024,1024,1024,1024)' \
--agent.actor_layer_norm=TrueLoad the BC checkpoint via --pretrained_actor_path. Network-size flags must match Step 1.
Example commands (cube-triple-task2; TRQAM / QAM / QAM-E)
# Path to the BC checkpoint from Step 1
BC_CKPT=exp/trqam/bc_pretrain/cube-triple-play-singletask-task2-v0/<exp_name>/params_300000.pkl
# Common flags
COMMON="--env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5 \
--ogbench_dataset_dir=~/.ogbench/data/cube-triple-play-10m-v0 \
--agent.action_chunking=True --pretrained_actor_path=$BC_CKPT \
--agent.actor_hidden_dims='(1024,1024,1024,1024)' \
--agent.value_hidden_dims='(1024,1024,1024,1024)' --agent.actor_layer_norm=True"
# TRQAM (ours)
MUJOCO_GL=egl python main.py --run_group=reproduce --agent=agents/trqam.py --tags=TRQAM --seed=10001 \
$COMMON --agent.kl_budget=0.5
# QAM
MUJOCO_GL=egl python main.py --run_group=reproduce --agent=agents/qam.py --tags=QAM --seed=10001 \
$COMMON --agent.inv_temp=3.0 --agent.fql_alpha=0.0 --agent.edit_scale=0.0
# QAM-E (edit variant)
MUJOCO_GL=egl python main.py --run_group=reproduce --agent=agents/qam.py --tags=QAM_EDIT --seed=10001 \
$COMMON --agent.inv_temp=3.0 --agent.fql_alpha=0.0 --agent.edit_scale=0.1AntMaze-Giant-Navigate 10M Dataset
Download from Hugging Face:
conda activate trqam
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
import shutil, os, glob
# Download dataset
repo_path = snapshot_download(
repo_id='yonghoon96/antmaze-giant-navigate-10m-v0',
repo_type='dataset'
)
# Save to ~/.ogbench/data/antmaze-giant-navigate-10m-v0/
target_dir = os.path.expanduser('~/.ogbench/data/antmaze-giant-navigate-10m-v0')
os.makedirs(target_dir, exist_ok=True)
for file in glob.glob(os.path.join(repo_path, '*.npz')):
shutil.copy(file, target_dir)
print(f'Dataset saved to: {target_dir}')
"Reproduction: Generated using OGBench v1.2.1 with the following commands:
cd ogbench/data_gen_scripts
wget https://rail.eecs.berkeley.edu/datasets/ogbench/experts.tar.gz
tar xf experts.tar.gz && rm experts.tar.gz
for i in {0..9}; do
PYTHONPATH="../impls:${PYTHONPATH}" python generate_locomaze.py \
--env_name=antmaze-giant-v0 \
--save_path=data/antmaze-giant-navigate-10m-v0/antmaze-giant-navigate-v0-00${i}.npz \
--dataset_type=navigate \
--num_episodes=500 \
--max_episode_steps=2001 \
--restore_path=experts/ant \
--restore_epoch=400000 \
--seed=${i}
doneCube-Triple 10M / Puzzle-4x4 10M Datasets
10M subset of the official 100M release:
- Download
cube-triple-play-100m-v0and/orpuzzle-4x4-play-100m-v0from the horizon-reduction repo. - Copy
*-000.npzthrough*-009.npzinto~/.ogbench/data/cube-triple-play-10m-v0/(orpuzzle-4x4-play-10m-v0/), then pass that path via--ogbench_dataset_dir.
Cube-Quadruple 100M Dataset
For cube-quadruple-100M-*, please follow the instructions here to obtain the full official 100M release.
Robomimic Datasets (lift / can / square, multi-human low-dim)
python ~/robomimic/robomimic/scripts/download_datasets.py \
--download_dir ~/.robomimic/ \
--tasks lift can square \
--dataset_types mh \
--hdf5_types low_dimThis codebase is built on top of QC and QAM.
@misc{dong2026trqam,
title = {Trust Region Q Adjoint Matching},
author = {Yonghoon Dong and Kyungmin Lee and Changyeon Kim and Jaehyuk Kim and Jinwoo Shin},
url = {https://arxiv.org/abs/2605.27079},
year = {2026}
}
