Skip to content

xiaomi-research/onevl

Repository files navigation

OneVL Logo OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanations

Tech Report Project Page Model Weights License


Overview

OneVL is a Vision-Language-Action (VLA) framework for autonomous driving that achieves state-of-the-art trajectory prediction accuracy with inference latency matching answer-only AR models. It overcomes the fundamental limitations of prior latent Chain-of-Thought (CoT) methods by introducing dual-modal auxiliary decoders that supervise compact latent tokens to encode both linguistic reasoning and future scene dynamics.

Three CoT Paradigms

Comparison of three CoT paradigms

(a) Explicit CoT generates a full reasoning chain before the answer — interpretable but slow. (b) Implicit CoT compresses reasoning into opaque latent vectors — fast but not interpretable. (c) OneVL (ours) uses visual latent tokens v and language latent tokens l; during training, dual auxiliary decoders decode these into future frames and CoT text respectively. At inference, decoders are discarded and latents are prefilled into the prompt — matching the speed of (b) while recovering the interpretability of (a) in both vision and language.

Architecture

OneVL architecture

During training, hidden states at visual latent positions are routed to the Visual Aux. Decoder (predicts future-frame visual tokens at t+0.5s and t+1.0s) and at language latent positions to the Language Aux. Decoder (reconstructs CoT text). Both decoders are discarded at inference; all latent tokens are prefilled into the prompt, matching answer-only AR prediction latency.

OneVL augments Qwen3-VL-4B-Instruct with:

  • Latent Token Interface — 4 visual latent tokens + 2 language latent tokens placed in the assistant response before the answer, using existing vocabulary tokens (no new special tokens).
  • Visual Auxiliary Decoder — Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (Emu3.5 IBQ, 131k codebook), acting as a world model supervision signal.
  • Language Auxiliary Decoder — Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features.
  • Prefill Inference — Both decoders are discarded at inference; latent tokens are processed in one parallel pass with only the trajectory generated autoregressively.

Key Innovations

  • Dual-Modal Auxiliary Decoders: A language auxiliary decoder reconstructs human-readable CoT reasoning from language latent tokens; a visual auxiliary decoder predicts future scene frames from visual latent tokens, acting as a world model that grounds the latents in physical scene dynamics.
  • Prefill Inference: All latent tokens are prefilled into the prompt context in a single parallel pass — 1.5× faster than explicit CoT on NAVSIM, 2.3× faster on ROADWork — with latency essentially identical to answer-only AR prediction.
  • Compression Drives Generalization: OneVL is the only latent CoT method that outperforms explicit autoregressive CoT across all four benchmarks.

Open-Source Status

Component Status
📄 Technical Report ✅ Release
⚖️ Model Weights ✅ Release
🔍 Inference Code ✅ Release
🏋️ Training Code 🔜 Coming Soon

Results

Accuracy–Efficiency Pareto (NAVSIM & ROADWork)

Teaser: Accuracy-Efficiency Pareto across benchmarks

OneVL lands in the green-shaded optimal corner (lowest latency, best metric) on both benchmarks. All prior latent CoT methods (COCONUT, CODI, SIM-CoT) underperform even the AR Answer baseline on driving tasks — a critical failure that OneVL overcomes.

NAVSIM — Full Comparison

Method Model Size PDM-score ↑ Latency (s) ↓ Interpretability
AdaThinkDrive 8B 86.20 Language
LaST-VLA 8B 87.30
AR Answer 4B 87.47 4.49
AR CoT+Answer 4B 88.29 6.58 Language
COCONUT 4B 84.84 5.93
CODI 4B 83.92 8.62
SIM-CoT 4B 84.21 10.86 Language
OneVL 4B 88.84 4.46 Vision + Language

ROADWork — Full Comparison

Method ADE (px) ↓ FDE (px) ↓ Latency (s) ↓ Interpretability
YNet 22.68 80.78
AR Answer 15.98 40.29 4.74
AR CoT+Answer 13.18 29.98 10.74 Language
COCONUT 15.44 38.60 6.06
CODI 16.45 44.28 6.73
SIM-CoT 16.49 44.32 6.19 Language
OneVL 12.49 28.80 4.71 Vision + Language

Impromptu — Full Comparison

Method ADE (m) ↓ FDE (m) ↓ Latency (s) ↓ Interpretability
Impromptu VLA 1.60 4.28 6.10
AR Answer 1.46 4.03 4.24
AR CoT+Answer 1.42 3.96 6.84 Language
COCONUT 1.49 4.07 5.27
CODI 1.86 5.18 5.24
SIM-CoT 2.43 6.10 5.09 Language
OneVL 1.34 3.70 4.02 Vision + Language

APR1 — Full Comparison

Method ADE (m) ↓ FDE (m) ↓ Latency (s) ↓ Interpretability
Cosmos-Reason 2.86 7.42 Language
AR Answer 3.27 9.59 3.06
AR CoT+Answer 2.99 8.54 3.51 Language
COCONUT 3.29 9.48 3.76
CODI 3.22 9.25 3.85
SIM-CoT 3.40 9.85 3.78 Language
OneVL 2.62 7.53 3.26 Vision + Language

Text CoT Quality (NAVSIM)

Method Meta Action Acc. ↑ STS Score ↑ LLM Judge ↑ Avg. ↑ Latency (s) ↓
AR CoT+Answer 73.20 79.75 81.86 78.27 6.58
SIM-CoT 67.20 76.25 78.73 74.06 10.86
OneVL (lang. aux.) 71.00 78.26 79.13 76.13 4.46

OneVL's language auxiliary decoder recovers 97% of explicit CoT quality while running at answer-only speed.

Ablation Study (NAVSIM PDM-score)

Model Variant Lang. Aux. Dec. Vis. Aux. Dec. Staged Train PDM-score ↑
OneVL w/o vis. dec. 87.97
OneVL w/o lang. dec. 88.53
OneVL w/o staged train 67.13
OneVL (full) 88.84

Both auxiliary decoders contribute measurably; staged training is essential (without it, performance collapses to 67.13).


Qualitative Examples

NAVSIM

NAVSIM qualitative example

Each plot overlays ground-truth (green) and predicted (red) trajectories on the front camera view, along with predicted future frames at t+0.5s and t+1.0s decoded from the visual auxiliary decoder, and the language CoT from the language auxiliary decoder.

ROADWork (Construction Zone Navigation)

ROADWork qualitative example

Environment Setup

Requirements: Python 3.10+, CUDA GPU (≥16 GB VRAM recommended for inference with aux decoders).

# 1. Create and activate virtual environment
uv venv venv/onevl --python 3.12
source venv/onevl/bin/activate

# 2. Install dependencies
pip install -r requirements.txt

Core packages (requirements.txt):

torch==2.10.0
torchvision==0.25.0
transformers==4.57.0
safetensors==0.7.0
Pillow>=10.0.0
omegaconf>=2.3.0
einops>=0.7.0
numpy>=1.24.0

Note: transformers ≥ 4.57.0 is required for Qwen3VLForConditionalGeneration support.


Inference

Quick Start (Single GPU)

source venv/onevl/bin/activate

# Trajectory prediction only (fastest, prefill inference)
python infer_onevl.py \
    --model_path /path/to/OneVL-checkpoint \
    --test_set_path test_data/navsim_test.json \
    --image_base_path ""
    --output_path output/navsim/results.json \
    --device cuda:0 \
    --num_latent 2 --num_latent_vis 4 \
    --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0

# With language explanation (text CoT from language aux decoder)
python infer_onevl.py \
    --model_path /path/to/OneVL-checkpoint \
    --test_set_path test_data/navsim_test.json \
    --image_base_path ""
    --output_path output/navsim/results_explain.json \
    --device cuda:0 \
    --num_latent 2 --num_latent_vis 4 \
    --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0 \
    --decoder_explain --aux_visual_condition \
    --c_thought 2 --max_explain_tokens 1024

# With both language + visual explanation (text CoT + future frame tokens)
python infer_onevl.py \
    --model_path /path/to/OneVL-checkpoint \
    --test_set_path test_data/navsim_test.json \
    --image_base_path "" \
    --output_path output/navsim/results_explain.json \
    --device cuda:0 \
    --num_latent 2 --num_latent_vis 4 \
    --max_new_tokens 1024 --answer_prefix "[" --prefix_k 0 \
    --decoder_explain --aux_visual_condition \
    --c_thought 2 --max_explain_tokens 1024 \
    --visual_decoder_explain --visual_aux_visual_condition \
    --c_thought_visual 4 --max_visual_tokens 2560

Multi-GPU Inference (recommended for full test sets)

export MODEL_PATH=/path/to/OneVL-checkpoint
export TEST_SET_PATH=test_data/navsim_test.json
export OUTPUT_PATH=output/navsim/navsim_results.json

bash run_infer.sh

The launcher auto-detects available GPUs, shards the test set, runs inference in parallel across all GPUs, and merges results.

Per-Benchmark Scripts

bash scripts/infer_navsim.sh       # NAVSIM
bash scripts/infer_ar1.sh          # APR1 (trajectory only)
bash scripts/infer_roadwork.sh     # ROADWork
bash scripts/infer_impromptu.sh    # Impromptu

For visual cot/text cot explain

bash scripts/infer_ar1_explain.sh  # APR1 (language + visual explanations, use APR1 as example)

Evaluation

AR1, Impromptu, and ROADWork can be evaluated directly with the bundled evaluation script:

# AR1
python eval_results.py ar1 \
    --results_json output/ar1/ar1_results.json \
    --test_jsonl test_data/ar1_test.jsonl

# Impromptu
python eval_results.py impromptu \
    --results_json output/impromptu/impromptu_results.json \
    --test_jsonl test_data/impromptu_test.jsonl

# ROADWork
python eval_results.py roadwork \
    --json_path output/roadwork/roadwork_results.json

NAVSIM uses the official NAVSIM evaluation pipeline. First convert OneVL inference results to the NAVSIM test format, then evaluate the converted file with the NAVSIM codebase:

python output/navsim/convert_to_eval.py \
    --input_path output/navsim/navsim_results.json \
    --ref_path output/navsim/navsim_results_eval.json \
    --output_path output/navsim/navsim_results_for_eval.json

Visualizing Future-Frame Predictions

After running inference with --visual_decoder_explain, the output JSON contains visual_decoder_explain fields encoding predicted future-frame visual tokens. Use the visualization script to decode them back to images:

source venv/onevl/bin/activate

python scripts/visualize_predict_image_tokens.py \
    --predict_json output/ar1_explain/ar1_results_explain.json \
    --out_dir output/ar1_explain_visualize \
    --model_root /path/to/emu35_model_root \
    -n 20 \
    --device cuda:0

Output layout per sample:

output/ar1_explain_visualize/
└── sample_0000/
    ├── input_00.jpg                  # original camera frame(s)
    ├── input_01.jpg
    ├── ...
    ├── decoded_from_tokens_00.png    # predicted future frame at t+0.5s
    ├── decoded_from_tokens_01.png    # predicted future frame at t+1.0s
    └── meta.json                     # CoT text + metadata

The script uses the self-contained vq_decoder/ module (bundled Emu3.5 IBQ VQ-VAE) — no external Emu3.5 repo dependency required.

--model_root must contain Emu3.5-VisionTokenizer/config.yaml and Emu3.5-VisionTokenizer/model.ckpt. Download from BAAI/Emu3.5-VisionTokenizer.


Test Data Format

JSON array (NAVSIM, ROADWork)

[
  {
    "messages": [{"role": "user", "content": "<image>Based on the current image, predict ..."}],
    "images": ["path/to/frame.jpg"],
    "GT": "[[1.0, 0.0], [2.5, 0.1], ...]"
  }
]

JSONL (APR1, Impromptu)

One JSON object per line, same schema as above.


Environment variables accepted by all scripts:

Variable Default Description
MODEL_PATH (required) Path to the OneVL checkpoint
TEST_SET_PATH (required) Test JSON / JSONL file
OUTPUT_PATH <MODEL_PATH>/infer_results/onevl_merged.json Where to write merged results
IMAGE_BASE_PATH "" Prepended to relative image paths
NUM_LATENT 2 Number of language latent tokens
NUM_LATENT_VIS 4 Number of visual latent tokens
MAX_NEW_TOKENS 1024 Max answer tokens to generate
ANSWER_PREFIX "" Prefix after <answer> (e.g. [ for NAVSIM, [[ for APR1)
PREFIX_K 0 Prefill first K GT waypoints after <answer> (default: 0), only used on ROADWork
DECODER_EXPLAIN false Enable language auxiliary decoder
AUX_VISUAL_CONDITION true (if DECODER_EXPLAIN=true) Condition language aux decoder on ViT features (--aux_visual_condition)
C_THOUGHT 2 (if DECODER_EXPLAIN=true) Number of latent tokens read by language aux decoder
MAX_EXPLAIN_TOKENS 1024 (if DECODER_EXPLAIN=true) Max tokens generated by language aux decoder
VISUAL_DECODER_EXPLAIN false Enable visual auxiliary decoder
VISUAL_AUX_VISUAL_CONDITION true (if VISUAL_DECODER_EXPLAIN=true) Condition visual aux decoder on ViT features (--visual_aux_visual_condition)
C_THOUGHT_VISUAL 4 (if VISUAL_DECODER_EXPLAIN=true) Number of latent tokens read by visual aux decoder
MAX_VISUAL_TOKENS 2560 (if VISUAL_DECODER_EXPLAIN=true) Max visual tokens generated by visual aux decoder

Citation

If you find this work useful, please cite:

@article{lu2026onevl,
  title={OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation},
  author={Lu, Jinghui and Guan, Jiayi and Huang, Zhijian and Li, Jinlong and Li, Guang and Kong, Lingdong and Li, Yingyan and Wang, Han and Xu, Shaoqing and Luo, Yuechen and others},
  journal={arXiv preprint arXiv:2604.18486},
  year={2026},
  url={https://arxiv.org/abs/2604.18486}
}

License

This project is released under the Apache 2.0 License.

Model weights are built on Qwen3-VL-4B-Instruct and the visual tokenizer is from Emu3.5-VisionTokenizer; please refer to their respective licenses as well.


Acknowledgements

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors