vSTREAM

Real-Time Visual Attribution Streaming in Thinking Models

Seil Kang¹ · Woojung Han¹ · Junhyeok Kim¹ · Jinyeong Kim¹ · Youngeun Kim² · Seong Jae Hwang¹

¹Yonsei University ²Amazon

Existing attribution methods force a choice between faithfulness and efficiency. vSTREAM occupies the top-right region — faithful and real-time.
▶ Live animated demo on the project page

TL;DR

vSTREAM is a real-time visual attribution method for multimodal reasoning models. It explains which image regions ground each step of a thinking trace — while the model is still generating.

Faithful without extra passes. A linear estimator predicts counterfactual region-ablation effects from attention features that are already computed during generation. No extra backward passes, no repeated inference.
Streams as the model thinks. Attribution runs asynchronously in a background worker, so users can watch attributions appear span-by-span rather than waiting until generation finishes.
One estimator, five tasks, four models. Trained once in ~4.5 hours on a single GPU with 2,000 examples, the estimator reaches faithfulness comparable to gradient- and perturbation-based baselines across five task families and four thinking VLMs.

Method

vSTREAM decomposes attribution into three stages:

Semantic Region Unitization. DINOv3 features partition the image into K ∈ [16, 128] semantically coherent regions via agglomerative clustering with Ward's linkage. No external segmentation masks required.
Attention Feature Extraction. For each thinking span S and region Rₖ, mean-pool cross-attention across all layers and heads to form a feature vector f ∈ ℝ^(L·H). Since attention is already computed during generation, extraction cost is negligible.
Amortized Estimator & Streaming. A linear estimator with L·H parameters maps attention features to counterfactual ablation effects. Trained once on 2,000 examples with a Pearson-correlation loss. At inference, attribution runs asynchronously via a producer-consumer queue, adding near-zero latency.

Status

The reference implementation is being packaged for public release and will land here shortly. In the meantime, the paper and the live project page cover the method and results in full. Expected first drop:

vstream/ — core estimator, attention hooks, region clustering
scripts/train.py — train the linear estimator (~4.5h on a single GPU)
scripts/stream.py — run streaming attribution against a thinking VLM
checkpoints/ — pretrained estimators for supported backbones
examples/ — notebooks reproducing figures from the paper

Citation

If vSTREAM is useful in your research, please cite:

@misc{kang2026vstream,
  title         = {Real-Time Visual Attribution Streaming in Thinking Models},
  author        = {Kang, Seil and Han, Woojung and Kim, Junhyeok and Kim, Jinyeong and Kim, Youngeun and Hwang, Seong Jae},
  year          = {2026},
  eprint        = {2604.16587},
  archivePrefix = {arXiv},
  doi           = {10.48550/arXiv.2604.16587}
}

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vSTREAM

Real-Time Visual Attribution Streaming in Thinking Models

TL;DR

Method

Status

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Folders and files

Latest commit

History

Repository files navigation

vSTREAM

Real-Time Visual Attribution Streaming in Thinking Models

TL;DR

Method

Status

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Packages