TL;DR: We identify critical gaps in video token pruning and advance two predominant paradigms: saliency-based token selection (🔑 signal: attention weight) and diversity-oriented token merging (🔑 signal: cosine similarity).
Main contributions: We analyze the characteristics of pivotal visual signals and rethink how to utilize them more effectively (More details and findings in our paper).
(Left) Motivation of our method. (Right) Method overview.
-
Attention Weight: The distribution is multi-modal and long-tailed, which a vanilla Top-k strategy fails to capture accurately.
➡️ Our approach: Expand the candidate set (cover the tail) and perform intra-cluster selection (cover diverse modes). -
Cosine Similarity: Direct similarity-based clustering often creates fragmented clusters, leading to noisy representations after average pooling.
➡️ Our approach: Inject a spatio-temporal locality prior (for smoothness) using our proposed ST-RoPE.
- Create a conda virtual environment and install the required packages.
conda create -n Tango python=3.10
conda activate Tango
pip install -r requirements.txt- Install Flash Attention 2.
pip install -U flash-attn --no-build-isolation- Install evaluation frameworks.
# For main performance evaluation
pip install -e ./VLMEvalKit
# For efficiency analysis
pip install -e ./lmms-evalWe adopt the VLMEvalKit framework for performance evaluation, with retention ratios in {0.1, 0.15, 0.2}.
We currently support LLaVA-OneVision-7B, LLaVA-Video-7B, and Qwen2.5-VL-7B models.
cd VLMEvalKit/
# Evaluate on LLaVA-OneVision-7B
bash run_eval_ov.sh
# Evaluate on LLaVA-Video-7B (w/ intra-LLM pruning)
bash run_eval_video.sh
# Evaluate on Qwen2.5-VL-7B
bash run_eval_qwen.shWe adopt the lmms-eval framework for efficiency profiling.
Here's a sample script for evaluation under retention ratio 0.1.
WRAPPER=tango accelerate launch --num_processes=8 \
-m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,attn_implementation=flash_attention_2 \
--tasks videomme \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_onevision \
--output_path ./logs/For reference, our results with 8 NVIDIA A800 GPUs are:
| Metric | Value |
|---|---|
| Total_runtime (s) | 1315.02 |
| Total_GPU_runtime (s) | 198.31 |
| Peak_mem (GB) | 18.61 |
| Avg_ViT_Time (ms) | 335.57 |
| Avg_Other_Time (ms) | 146.59 |
| Avg_LLM_Prefill_Time (ms) | 80.64 |
| Avg_Total_TTFT (ms) | 562.80 |
| Avg_Decoding_Throughput (token/s) | 83.66 |
- Sparrow: An efficient training scheme for video LLMs.
- Awesome-MLLM: A project keeping track of new papers and the latest developments in the field of MLLMs.
- Inspiring works with open-sourced implementation: VisionZip, FastVID, HoliTom.
- Our efficiency profiling implementation is extended upon VidCom2.
If you find our project useful, please consider citing our paper:
@article{yin2026tango,
title={Tango: Taming Visual Signals for Efficient Video Large Language Models},
author={Yin, Shukang and Zhao, Sirui and Wang, Hanchao and Jia, Baozhi and Wang, Xianquan and Fu, Chaoyou and Chen, Enhong},
journal={arXiv preprint arXiv:2604.09547},
year={2026}
}