Trust Region Q Adjoint Matching

Yonghoon Dong¹ · Kyungmin Lee¹ · Changyeon Kim¹ · Jaehyuk Kim² · Jinwoo Shin^1,3
¹KAIST ²Seoul National University ³RLWRLD

Paper Blog

Summary: Trust Region Q Adjoint Matching (TRQAM) is a stable off-policy RL algorithm for fine-tuning pretrained flow policies under a path-space KL trust region against the pretrained policy, enforced via dual descent. On 50 OGBench tasks, TRQAM reaches 68% aggregate offline success, compared to 46% for the strongest baseline.

Installation

Create and activate a conda environment:

conda create -n trqam python=3.11 -y
conda activate trqam

Install robomimic from source:

git clone https://github.com/ARISE-Initiative/robomimic.git
cd robomimic
pip install -e .
cd ..

Install robosuite from source:

git clone https://github.com/ARISE-Initiative/robosuite.git
cd robosuite
pip install -r requirements.txt
cd ..

Patch robomimic for JAX compatibility (makes an unused diffusers import non-fatal):

python -c "
path = '$HOME/robomimic/robomimic/algo/__init__.py'
with open(path) as f:
    text = f.read()
old = 'from robomimic.algo.diffusion_policy import DiffusionPolicyUNet'
new = '''try:
    from robomimic.algo.diffusion_policy import DiffusionPolicyUNet
except (ImportError, AttributeError):
    pass'''
with open(path, 'w') as f:
    f.write(text.replace(old, new))
print('Done')
"

Install TRQAM dependencies:
```
pip install -r requirements.txt
```

Reproducing paper results

The paper's pipeline is two-stage: (1) pretrain a flow policy with behavior cloning (BC) for 300K steps, then (2) fine-tune it with TRQAM (or a baseline) for 1M offline + 500K online steps, loading the BC checkpoint as the initialization.

The key TRQAM hyperparameter is --agent.kl_budget (ε_KL in the paper); see Table 4 of the paper for per-domain recommended values.

Network size (per-domain widths and layer norm)

1M data domains (e.g. antmaze-large, humanoidmaze-*, cube-double, scene, puzzle-3x3, Robomimic): width 512, actor_layer_norm=False.
10M / 100M data domains (cube-triple-10M, antmaze-giant-10M, puzzle-4x4-10M, cube-quadruple-100M): width 1024, actor_layer_norm=True.

The default agent configs ship with the 1M setting. For 10M/100M data domains, append the following flags to both the BC pretrain and fine-tuning commands so the saved checkpoint matches the fine-tuning model:

--agent.actor_hidden_dims='(1024,1024,1024,1024)' \
--agent.value_hidden_dims='(1024,1024,1024,1024)' \
--agent.actor_layer_norm=True

Discount and pessimism (fine-tuning only)

Default (manipulation, antmaze, Robomimic): --agent.discount=0.995 --agent.rho=0.5. Matches the shipped agent configs, so no override is needed.
humanoidmaze-* (longer horizons): --agent.discount=0.999 --agent.rho=0.0. Override both flags on humanoidmaze runs.

These only affect the critic update, so they matter for fine-tuning (Step 2) but not for BC pretraining (Step 1, where the critic and adjoint matching are skipped). The only Step-1 hyperparameter that must match Step 2 is the network size.

Step 1: BC pretraining (300K steps)

Train a BC-only flow policy with the TRQAM agent (agents/trqam.py --bc_only=True). The resulting checkpoint at exp/trqam/bc_pretrain/<env_name>/<exp_name>/params_300000.pkl is reusable across TRQAM, QAM, QAM-E, FQL, DSRL, CGQL, IFQL.

Example command (cube-triple-task2)

MUJOCO_GL=egl python main.py --run_group=bc_pretrain --agent=agents/trqam.py --tags=BC --seed=10001 \
  --env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5 \
  --ogbench_dataset_dir=~/.ogbench/data/cube-triple-play-10m-v0 \
  --agent.action_chunking=True --bc_only=True --offline_steps=300000 --online_steps=0 \
  --agent.actor_hidden_dims='(1024,1024,1024,1024)' \
  --agent.value_hidden_dims='(1024,1024,1024,1024)' \
  --agent.actor_layer_norm=True

Step 2: Off-policy fine-tuning

Load the BC checkpoint via --pretrained_actor_path. Network-size flags must match Step 1.

Example commands (cube-triple-task2; TRQAM / QAM / QAM-E)

# Path to the BC checkpoint from Step 1
BC_CKPT=exp/trqam/bc_pretrain/cube-triple-play-singletask-task2-v0/<exp_name>/params_300000.pkl

# Common flags
COMMON="--env_name=cube-triple-play-singletask-task2-v0 --sparse=False --horizon_length=5 \
        --ogbench_dataset_dir=~/.ogbench/data/cube-triple-play-10m-v0 \
        --agent.action_chunking=True --pretrained_actor_path=$BC_CKPT \
        --agent.actor_hidden_dims='(1024,1024,1024,1024)' \
        --agent.value_hidden_dims='(1024,1024,1024,1024)' --agent.actor_layer_norm=True"

# TRQAM (ours)
MUJOCO_GL=egl python main.py --run_group=reproduce --agent=agents/trqam.py --tags=TRQAM --seed=10001 \
  $COMMON --agent.kl_budget=0.5

# QAM
MUJOCO_GL=egl python main.py --run_group=reproduce --agent=agents/qam.py --tags=QAM --seed=10001 \
  $COMMON --agent.inv_temp=3.0 --agent.fql_alpha=0.0 --agent.edit_scale=0.0

# QAM-E (edit variant)
MUJOCO_GL=egl python main.py --run_group=reproduce --agent=agents/qam.py --tags=QAM_EDIT --seed=10001 \
  $COMMON --agent.inv_temp=3.0 --agent.fql_alpha=0.0 --agent.edit_scale=0.1

Datasets

AntMaze-Giant-Navigate 10M Dataset

Download from Hugging Face:

conda activate trqam
pip install huggingface_hub

python -c "
from huggingface_hub import snapshot_download
import shutil, os, glob

# Download dataset
repo_path = snapshot_download(
    repo_id='yonghoon96/antmaze-giant-navigate-10m-v0',
    repo_type='dataset'
)

# Save to ~/.ogbench/data/antmaze-giant-navigate-10m-v0/
target_dir = os.path.expanduser('~/.ogbench/data/antmaze-giant-navigate-10m-v0')
os.makedirs(target_dir, exist_ok=True)

for file in glob.glob(os.path.join(repo_path, '*.npz')):
    shutil.copy(file, target_dir)

print(f'Dataset saved to: {target_dir}')
"

Reproduction: Generated using OGBench v1.2.1 with the following commands:

cd ogbench/data_gen_scripts
wget https://rail.eecs.berkeley.edu/datasets/ogbench/experts.tar.gz
tar xf experts.tar.gz && rm experts.tar.gz

for i in {0..9}; do
  PYTHONPATH="../impls:${PYTHONPATH}" python generate_locomaze.py \
    --env_name=antmaze-giant-v0 \
    --save_path=data/antmaze-giant-navigate-10m-v0/antmaze-giant-navigate-v0-00${i}.npz \
    --dataset_type=navigate \
    --num_episodes=500 \
    --max_episode_steps=2001 \
    --restore_path=experts/ant \
    --restore_epoch=400000 \
    --seed=${i}
done

Cube-Triple 10M / Puzzle-4x4 10M Datasets

10M subset of the official 100M release:

Download cube-triple-play-100m-v0 and/or puzzle-4x4-play-100m-v0 from the horizon-reduction repo.
Copy *-000.npz through *-009.npz into ~/.ogbench/data/cube-triple-play-10m-v0/ (or puzzle-4x4-play-10m-v0/), then pass that path via --ogbench_dataset_dir.

Cube-Quadruple 100M Dataset

For cube-quadruple-100M-*, please follow the instructions here to obtain the full official 100M release.

Robomimic Datasets (lift / can / square, multi-human low-dim)

python ~/robomimic/robomimic/scripts/download_datasets.py \
  --download_dir ~/.robomimic/ \
  --tasks lift can square \
  --dataset_types mh \
  --hdf5_types low_dim

Acknowledgments

This codebase is built on top of QC and QAM.

BibTeX

@misc{dong2026trqam,
    title  = {Trust Region Q Adjoint Matching},
    author = {Yonghoon Dong and Kyungmin Lee and Changyeon Kim and Jaehyuk Kim and Jinwoo Shin},
    url    = {https://arxiv.org/abs/2605.27079},
    year   = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents		agents
assets		assets
envs		envs
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
evaluation.py		evaluation.py
log_utils.py		log_utils.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trust Region Q Adjoint Matching

Paper Blog

Installation

Reproducing paper results

Step 1: BC pretraining (300K steps)

Step 2: Off-policy fine-tuning

Datasets

Acknowledgments

BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Trust Region Q Adjoint Matching

Paper Blog

Installation

Reproducing paper results

Step 1: BC pretraining (300K steps)

Step 2: Off-policy fine-tuning

Datasets

Acknowledgments

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages