Sizhe Lester Li*, Evan Kim*, Xingjian Bai*, Tong Zhao, Tao Pang, Max Simchowitz, Vincent Sitzmann
*equal contribution
[Paper] · [Project Page] · [Models]
post1_teaser_video.mp4
VERA (Video-to-Embodied Robot Action model) is a two-stage, closed-loop video-to-action policy. It leaves a video generative model as-is as an action-free world model that "dreams" the future, and trains an embodiment-specific inverse-dynamics model (IDM) — built on the robot's Jacobian — to translate that dream into actions:
- Video planner (
vera.video_model/vera.idm.dfot) — an action-free diffusion model that generates future frames from the current observation (+ optional text). Embodiment-agnostic. - Jacobian IDM (
vera.idm+vera.policy) — a faithful, data-efficient translator from dreamed future to robot actions. Embodiment-specific, swappable without retraining the planner.
The thesis: decoupled video planning + faithful video-to-action translation is a viable route to zero-shot, cross-embodiment robot control. One video planner, many IDMs.
| Wave | Embodiments | Code | Checkpoints | Status |
|---|---|---|---|---|
| Wave 1 — now | MimicGen (Panda, 2-block stacking) · PushT (planar pusher) | ✅ | ✅ | available |
| Wave 2 — later this week | Allegro-Sim · Allegro-Real · IIWA-Sim · DROID (FR3 real) | ✅ in-tree | 🔜 | code present; checkpoints + docs coming |
This repo already contains the unified code for all embodiments, but Wave 1 documents and ships checkpoints only for MimicGen + PushT. The cross-embodiment OMNI WAN planner and the DROID/Allegro IDMs land with Wave 2.
VERA targets Python 3.11 + PyTorch 2.6 (CUDA 12.4). Self-contained — no sibling repos on sys.path.
git clone git@github.com:sizhe-li/VERA.git && cd VERA
pip install -e ".[idm,video]" # the two stages (IDM + video planner)Simulators (needed to reproduce the results — install the eval extra):
pip install -e ".[eval]" # gymnasium, gym-pusht, robomimic, robosuite, mimicgen, mujoco- PushT runs on
gym-pusht(pullspymunk), but the notebook seeds rollouts from the original PushT replay bufferpusht_cchi_v7_replay.zarr(the known-success initial states it indexes into). Grab it from the Diffusion Policy release:Then point the notebook'swget https://diffusion-policy.cs.columbia.edu/data/training/pusht.zip unzip pusht.zip # -> pusht/pusht_cchi_v7_replay.zarrZARR_PATHat.../pusht/pusht_cchi_v7_replay.zarr. - MimicGen runs on
robosuite+robomimic+mimicgen(all pinned in theevalextra) and needs MuJoCo (pulled automatically). It also needs the task dataset HDF5 (the initial states), e.g.stack_d0.hdf5— download the standard MimicGen datasets from 🤗amandlek/mimicgen_datasets(or follow the MimicGen instructions) and point the notebook at the file. - flash-attn (WAN attention) is optional — the WAN path falls back to SDPA if absent.
- VGGT (the IDM visual backbone — required by both the MimicGen and PushT IDMs) installs
automatically with the
idmextra as a git dependency (facebookresearch/vggt). If your environment blocks git installs, clone and install it manually instead:The VGGT-1B weights are then pulled frompip install "git+https://github.com/facebookresearch/vggt.git" # or: git clone https://github.com/facebookresearch/vggt && pip install -e vggt
facebook/VGGT-1Bon first use.
Verify:
python -c "import vera, vera.policy, vera.idm, vera.server; print('vera ok')"Every embodiment runs the same two steps: start a policy server in one terminal, then run its client notebook in another. The notebook drives the sim, prints the success rate, and inlines the rollout videos.
Terminal 1 — server Jupyter — client notebook
┌─────────────────────────────┐ websocket ┌──────────────────────────────┐
│ python -m vera.server │ ───────────▶ │ open the notebook → Run All │
│ .start_vera_server ... │ :8800/:8820 │ → success rate + videos │
└─────────────────────────────┘ └──────────────────────────────┘
| Task | Server flag | Client notebook (run this) |
|---|---|---|
| PushT — planar push-to-goal | --embodiment pusht |
examples/pusht_dfot_stack.ipynb |
| MimicGen — 2-block stacking | --embodiment mimicgen |
examples/mimicgen_stack.ipynb |
1. Start the server (Terminal 1):
python -m vera.server.start_vera_server --embodiment pusht --port 8820 --vis-port 88212. Run the client: open examples/pusht_dfot_stack.ipynb → Run All.
- it connects to the server, rolls out a known-success initial state, prints the success rate, and inlines the rollout + the composite policy-vis;
- checkpoint paths come from the
VERA_PUSHT_*env vars (seevera/server/start_server_pusht.py).
1. Point at the downloaded checkpoints, then start the server (Terminal 1):
export VERA_WAN_CKPT_ROOT=/path/to/Wan2.1-T2V-1.3B # frozen Wan2.1 base (text-enc + VAE)
export VERA_MIMICGEN_CKPT_DIR=./vera-ckpts/mimicgen-wan-1.3b # specialist DiT + flow decoder
python -m vera.server.start_vera_server --embodiment mimicgen --port 8800 --vis-port 8801 \
--algo-config $VERA_MIMICGEN_CKPT_DIR/algo_config.yaml \
--text "A robot arm stacks one block on top of another block"Set both env vars before launching — the hosted
algo_config.yamlreads the DiT + flow decoder fromVERA_MIMICGEN_CKPT_DIRand the Wan2.1 base fromVERA_WAN_CKPT_ROOT.
2. Run the client: open examples/mimicgen_stack.ipynb → Run All.
- swap pieces live via env vars on the server:
VERA_DYNAMICS_RUN_ID(IDM checkpoint),VERA_TRACKER_BACKEND,VERA_MOTION_PLAN_SCALE,VERA_N_ACTION_STEPS.
Pass --vis-port to any server and open http://localhost:<vis-port>/ for a built-in dashboard that
streams VERA's entire two-stage pipeline live, in one strip, as the rollout runs. The policy is
interpretable by construction — not a black box:
Each row is one camera view, read left → right:
| Panel | What it shows |
|---|---|
| Current | the robot's live observation |
| Dream + tracks | the video model's predicted future, with motion tracks overlaid |
| Dream | the decoded future frames |
| Jacobian field | the map that turns the dream into the next action |
The per-chunk player below scrubs each generated dream chunk frame-by-frame, so the planner's imagination
and the IDM's response sit side-by-side. The notebooks inline this same composite via show_policy_vis();
snapshot it any time with python -m vera.server.save_vis_video --output dream.mp4.
Hosted on HuggingFace — huggingface.co/sizhe-lester-li/VERA. VERA hosts only the trained artifacts;
frozen upstream pieces are pulled from their original homes.
| Group | dir | what |
|---|---|---|
| MimicGen | mimicgen-wan-1.3b/ |
specialist WAN planner (DiT-only bf16, ~2.8 GB) + flow_decoder.ckpt + algo_config.yaml |
idm-mimicgen-37oa162u/ |
MimicGen Jacobian IDM (+ config sidecar) | |
| PushT | pusht-dfot/ |
DFoT flow planner (~39 MB) + run_config.yaml |
pusht-idm/ |
PushT Jacobian IDM (~232 MB) + config.yaml |
|
| Upstream | Wan-AI/Wan2.1-T2V-1.3B, facebook/VGGT-1B |
WAN base + IDM backbone (not re-hosted) |
Download (with the HuggingFace CLI — pip install huggingface_hub):
# (1) MimicGen + PushT only — IDM + video planner for the Wave-1 notebooks (~3.8 GB)
hf download sizhe-lester-li/VERA --local-dir ./vera-ckpts \
--include "mimicgen-wan-1.3b/*" "idm-mimicgen-37oa162u/*" "pusht-dfot/*" "pusht-idm/*"
# (2) everything — also pulls the 33 GB OMNI planner + DROID IDM (Wave 2) (~42 GB)
hf download sizhe-lester-li/VERA --local-dir ./vera-ckptsThe Wave-1 download is ~3.8 GB; the full repo is ~42 GB (the 33 GB OMNI WAN planner dominates).
Then point the server/notebook at the downloaded paths (--algo-config, VERA_PUSHT_* / VERA_WAN_CKPT_ROOT).
OMNI training data (Wave 2): the cross-embodiment OMNI WAN planner is trained on a weighted mixture of Allegro-Sim + Allegro-Real + MimicGen + DROID (each kept at native fps/aspect, black-padded to a 576-wide multiview canvas). PushT is not yet in the OMNI mixture — for now it uses its own DFoT flow planner, and we will release a new OMNI checkpoint that includes PushT soon.
To be added.
This work was supported by the National Science Foundation under Grant No. 2211259, by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) under 140D0423C0075, by the Amazon Science Hub, by the MIT-Google Program for Computing Innovation, by Advanced Micro Devices, Inc. under the AMD University Program's support of the MIT Hardware Consortium, and by a 2025 MIT Office of Research Computing and Data Seed Grant.
Released under the MIT License (see LICENSE); depended-upon code retains its own license (see
NOTICE). VERA builds on Wan2.1 (Apache-2.0), VGGT (Meta), CLIP/open_clip (MIT), and
cotracker/AllTracker; the DFoT/DiT backbones are adapted from facebookresearch/DiT and NVlabs/edm2.
@article{li2026turningvideomodelsgeneralist,
title={Turning Video Models into Generalist Robot Policies},
author={Sizhe Lester Li and Evan Kim and Xingjian Bai and Tong Zhao and Tao Pang and Max Simchowitz and Vincent Sitzmann},
year={2026},
eprint={2605.27817},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2605.27817},
}