A full-stack framework and tutorial for newcomers, rather than a specific model.
minWM is our contribution to the world-model community: a full-stack open-source framework that walks you end-to-end through turning a bidirectional T2V foundation model into an action-conditioned video world model β with example data, runnable scripts, Claude skills capturing our hands-on experience, and onboarding knowledge for newcomers. We hope more researchers and developers join us in growing the community together.
v2_web.mp4
- 2026-05-29 π We release the technical report.
- 2026-05-17 π We release minWM β the first full-stack open-source world model framework.
- π¬ Demo
- π₯ News
- β¨ Why minWM?
- π οΈ Installation
- π§± Model Checkpoints
- π Quick Start
- βοΈ Data & Training & Reproduction
- π Citation
- Contact
- π Acknowledgements
The complete data β training β inference pipeline is open-sourced; every stage exposes input/output checkpoints so you can stop, swap, or fork anywhere.
1.1 Data. We walk you through how to construct training-ready datasets paired with camera poses, and the full data processing pipeline that turns them into latents.
1.2 Training. Including FSDP + sequence parallelism, single-/multi-node training, and the full distillation pipeline from a bidirectional diffusion model to a 4-step AR student:
Phase 1 Phase 2 β Distillation to Causal Few-Step
βββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββ
Bidirectional SFT βββΆ Stage 1 Teacher Forcing AR Diffusion
Stage 2a Causal ODE (proposed in [Causal Forcing](https://arxiv.org/abs/2602.02214))
Stage 2b Causal CD (proposed in [Causal Forcing++](https://arxiv.org/abs/2605.15141))
Stage 3 Asymmetric DMD with Self Rollout
βΌ
4-step real-time
1.3 Inference.
- β
4-step DMD inference for HY Action2V / HY TI2V / Wan Action2V, multi-GPU sequence parallelism, camera-trajectory control via pose strings (
"a*4,w*8,s*7") or JSON files - π§ Inference acceleration [TBD]
minWM supports two paths to arriving at an interactive world model.
The HunyuanVideo 1.5 and Wan 2.1 lines walk through the full 4-stage pipeline β starting from a bidirectional T2V foundation model and ending at a 4-step autoregressive world model.
| Backbone | Architecture | Params | Training | Inference |
|---|---|---|---|---|
| Wan 2.1 | Cross-attention + DiT | 1.3 B | β all 4 stages | β 4-step DMD |
| HunyuanVideo 1.5 | MMDiT | 8 B | β all 4 stages | β 4-step DMD |
Both lines share the same trainer / loss / dataset abstractions, so adding a third backbone is structurally a wrapper-and-config exercise.
The forthcoming worldplay-finetune entry will let you start from an already-trained video world model and adapt it to new conditions, scenes, or resolutions β without rerunning the 4-stage pipeline from scratch.
We aim to support both multiple condition types and multiple injection methods, mixable along either axis.
- β Camera pose
- π§ Human pose [TBD]
- β ProPE
- π§ Latent concat [TBD]
- π§ Cross-attention [TBD]
We are packaging our project experience across the CF / CF++ pipeline as Claude skills, so that an LLM assistant can help users debug failures and integrate new models without reverse-engineering the whole repo.
- π
debug-world-modelβ collected failure modes from the training pipeline (loss NaN, frame-to-frame jitter, camera drift, memory attenuation, distillation collapse, β¦). Claude diagnoses likely root causes from your symptoms instead of guessing. - π
integrate-new-backboneβ step-by-step recipe for plugging a new video DiT into minWM, grounded in the HunyuanVideo and Wan reference integrations β e.g. "look at how HY does teacher forcing here, do the same for your model there".
onboarding-world-model
A third Claude skill aimed at researchers entering the world-model space for the first time. Two parts:
- π Foundations β the minimal background to follow the pipeline: Teacher Forcing for AR diffusion training and Causal Forcing & Causal Forcing++ for AR diffusion distillation.
- πͺ€ Pitfalls β the non-obvious mistakes we hit while building minWM, distilled so you don't repeat them.
Intended audience: graduate students, independent researchers, and junior labs that want to enter the world-model space without spending three months reverse-engineering existing repos.
conda create -n minwm python=3.10 -y
conda activate minwm
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
export PYTHONPATH="$PWD/HY15:$PWD/Wan21:$PWD/shared:$PYTHONPATH"π§± Model Checkpoints (Click to expand)
All weights live under ./ckpts/ after download.
| Checkpoint | Backbone | Stage | Use case | Download |
|---|---|---|---|---|
Wan21/Action2V/{bidirectional,ar_diffusion_tf,causal_ode,causal_cd,dmd} |
Wan 2.1 | Same 4 stages | Wan pipeline | HF |
HunyuanVideo-1.5 (base) |
HY 1.5 | β | Required by both HY pipelines | HF |
Wan2.1-T2V-1.3B (base) |
Wan 2.1 | β | Required by Wan pipeline | HF |
HY15/Action2V/bidirectional |
HY 1.5 | Phase 1 SFT | Starting point for HY Action2V Phase 2 | HF |
HY15/Action2V/ar_diffusion_tf |
HY 1.5 | Phase 2 Stage 1 | Teacher Forcing AR diffusion | HF |
HY15/Action2V/causal_ode |
HY 1.5 | Phase 2 Stage 2a (proposed in Causal Forcing) | DMD initialization | HF |
HY15/Action2V/causal_cd |
HY 1.5 | Phase 2 Stage 2b (proposed in Causal Forcing++) | DMD initialization | HF |
HY15/Action2V/dmd |
HY 1.5 | Phase 2 Stage 3 | 4-step real-time inference | HF |
HY15/TI2V/{bidirectional,ar_diffusion_tf,causal_ode,causal_cd,dmd} |
HY 1.5 | Same 4 stages, TI2V variant | TI2V pipeline | HF |
The fastest path: install β download three DMD checkpoints β run three demo commands. Full reproduction (all 4 training stages Γ 3 model lines) is in Β§ Data & Training & Reproduction.
# Wan base (T2V-1.3B)
hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./ckpts/Wan2.1-T2V-1.3B
# Code hardcodes the load path; create a symlink.
mkdir -p Wan21/wan_models
ln -s "$(realpath ./ckpts/Wan2.1-T2V-1.3B)" Wan21/wan_models/Wan2.1-T2V-1.3B
# HY base + text/vision encoders (required by HY pipelines)
hf download tencent/HunyuanVideo-1.5 --local-dir ./ckpts/HunyuanVideo-1.5 \
--include "vae/*" "scheduler/*" "transformer/480p_i2v/*"
hf download Qwen/Qwen2.5-VL-7B-Instruct --local-dir ./ckpts/HunyuanVideo-1.5/text_encoder/llm
hf download google/byt5-small --local-dir ./ckpts/HunyuanVideo-1.5/text_encoder/byt5-small
modelscope download --model AI-ModelScope/Glyph-SDXL-v2 \
--local_dir ./ckpts/HunyuanVideo-1.5/text_encoder/Glyph-SDXL-v2
hf download black-forest-labs/FLUX.1-Redux-dev \
--local-dir ./ckpts/HunyuanVideo-1.5/vision_encoder/siglip --token <your_hf_token>
# 4-step DMD checkpoints
## Wan Action2V (DMD, 4-step)
hf download MIN-Lab/minWM --local-dir ./ckpts \
--include "Wan21/Action2V/dmd/*"
## HY Action2V (DMD, 4-step, worldplay teacher)
hf download MIN-Lab/minWM --local-dir ./ckpts \
--include "HY15/Action2V/dmd/*"
# HY Action2V (DMD, 4-step, our bidirectional teacher)
# hf download MIN-Lab/minWM --local-dir ./ckpts \
# --include "HY15/Action2V/dmd_ourbi/*"
## HY TI2V (DMD, 4-step)
hf download MIN-Lab/minWM --local-dir ./ckpts \
--include "HY15/TI2V/dmd/*"# 2.1 Wan Action2V (4-step DMD, camera control)
OUTPUT_FOLDER=./outputs/quickstart_wan_action2v \
TRAJECTORY_PATH="Wan21/prompts/trajectories.txt" \
bash Wan21/scripts/inference/run_infer_causal_camera.sh
# 2.2 HY Action2V (4-step DMD, camera control)
TRANSFORMER_DIR=./ckpts/HY15/Action2V/dmd \
OUTPUT_DIR=./outputs/quickstart_hy_action2v \
bash HY15/scripts/inference/run_infer_causal_camera.sh
# 2.3 HY TI2V (4-step DMD)
TRANSFORMER_DIR=./ckpts/HY15/TI2V/dmd \
OUTPUT_DIR=./outputs/quickstart_hy_ti2v \
bash HY15/scripts/inference/run_infer_causal.sh
Camera control. For HY Action2V, trajectories are read per-sample from
assets/example.jsonunder the"trajectory"field. Format:w/s/a/dkeys with*Nrepeats; comma-separated segments β e.g."a*4,w*8,s*7".
Three model lines Γ two phases Γ four stages, each documented as (1) Model download β (2) Data preparation β (3) Training script β (4) Validation. Full reproduction guides are split by backbone:
- π
training_wan.md- Wan Action2V (Wan 2.1 backbone)
- π
training_hunyuan.mdβ HY Action2V (HY1.5-8B backbone)- HY TI2V (HY1.5-8B backbone)
If minWM helps your research, please cite:
# ICML 2026
@article{zhu2026causal,
title={Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation},
author={Zhu, Hongzhou and Zhao, Min and He, Guande and Su, Hang and Li, Chongxuan and Zhu, Jun},
journal={arXiv preprint arXiv:2602.02214},
year={2026}
}
# Technical Report
@article{zhao2026causal,
title={Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation},
author={Zhao, Min and Zhu, Hongzhou and Zheng, Kaiwen and Zhou, Zihan and Yan, Bokai and Li, Xinyuan and Yang, Xiao and Li, Chongxuan and Zhu, Jun},
journal={arXiv preprint arXiv:2605.15141},
year={2026}
}
# Technical Report
@article{zhao2026minwm,
title={minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models},
author={Zhao, Min and Zhu, Hongzhou and Yan, Bokai and Zhou, Zihan and Chen, Yimin and Sun, Wenqiang and Zheng, Kaiwen and He, Guande and Yang, Xiao and Li, Chongxuan and others},
journal={arXiv preprint arXiv:2605.30263},
year={2026}
}
For questions, suggestions, or collaboration, please open a GitHub issue or contact: gracezhao1997@gmail.com.
minWM stands on the shoulders of giants. We thank the authors and maintainers of HunyuanVideo 1.5, HY-WorldPlay, Wan 2.1, Causal-Forcing, and FastVideo for their open-source contributions, which made this framework possible.