Hesong Wang1,2,3,🌟, Xin Jin2,🌟, Lu Lu3,✉️, Chenhaowen Li3, Jian Chen3, Qiang Liu3, Huan Wang2,✉️
1Zhejiang University, 2Westlake University, 3Alibaba Cloud Computing
🌟 Equal Contribution ✉️ Corresponding Author
[2026/5/17] We release the EarlyTom code.
EarlyTom is a training-free token compression method for video large language models (Video-LLMs). It reduces the number of visual tokens by leveraging early-layer attention signals to identify and prune redundant tokens before they propagate through the full model, significantly reducing computation while preserving performance.
EarlyTom supports two complementary compression strategies:
- Outer compression: Prunes redundant visual tokens at the vision encoder output, guided by attention weights from early transformer layers.
- Inner compression (optional): Further merges tokens inside the LLM backbone at specified layers using a DPC-KNN clustering approach.
git clone https://github.com/viridisGreen/EarlyTom
cd EarlyTomEarlyTom is built on top of LLaVA-NeXT. Set up the environment following LLaVA-NeXT's instructions first:
conda create -n earlytom python=3.10 -y
conda activate earlytom
pip install --upgrade pip
cd LLaVA-NeXT
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
cd ..pip install -e .pip install -r requirements.txtEarlyTom wraps any LLaVA-OneVision model with a single function call:
import os
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from earlytom import earlytom
pretrained = "path/to/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
tokenizer, model, image_processor, max_length = load_pretrained_model(
pretrained, None, model_name, device_map="auto", attn_implementation="sdpa",
multimodal=True
)
# Apply EarlyTom compression
model = earlytom(model)
model.eval()We use lmms-eval for evaluation. Scripts are provided under scripts/.
bash scripts/ov/eval_ov-7b_earlytom.shExample configuration for VideoMME at retain ratio 0.2:
export WRAPPER=earlytom
export RETAIN_RATIO=0.20
export T=0.5
export M=6
export INNER_k=18
export INNER_r=0.5
export PRUNE_LAYERS="8,21,23"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
accelerate launch --num_processes=8 --main_process_port=25000 \
-m lmms_eval \
--model llava_onevision \
--model_args pretrained=path/to/model,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32 \
--tasks videomme \
--batch_size 1 \
--output_path ./logs/ov-7b-earlytom/videomme/0.20Supported benchmarks: mvbench, videomme, egoschema, longvideobench_val_v.
This work is built upon LLaVA-NeXT, HoliTom. We thank them for their excellent open-source contributions.
If EarlyTom is useful for your research, please consider citing:
@inproceedings{
wang2025earlytom,
title = {EarlyTom: Early Token Compression Completes Fast Video Understanding},
author = {Wang, Hesong and Jin, Xin and Lu, Lu and Chenhaowen Li, Jian Chen and Qiang Liu and Wang, Huan},
year = {2026},
booktitle={CVPR},
}