Skip to content

viridisGreen/EarlyTom

Repository files navigation

Hesong Wang1,2,3,🌟, Xin Jin2,🌟, Lu Lu3,✉️, Chenhaowen Li3, Jian Chen3, Qiang Liu3, Huan Wang2,✉️

1Zhejiang University, 2Westlake University, 3Alibaba Cloud Computing

🌟 Equal Contribution ✉️ Corresponding Author

📆 News

[2026/5/17] We release the EarlyTom code.

🔍 Introduction

EarlyTom is a training-free token compression method for video large language models (Video-LLMs). It reduces the number of visual tokens by leveraging early-layer attention signals to identify and prune redundant tokens before they propagate through the full model, significantly reducing computation while preserving performance.

EarlyTom supports two complementary compression strategies:

  • Outer compression: Prunes redundant visual tokens at the vision encoder output, guided by attention weights from early transformer layers.
  • Inner compression (optional): Further merges tokens inside the LLM backbone at specified layers using a DPC-KNN clustering approach.

🛠 Installation

1️⃣ Clone the repository

git clone https://github.com/viridisGreen/EarlyTom
cd EarlyTom

2️⃣ Install dependencies

EarlyTom is built on top of LLaVA-NeXT. Set up the environment following LLaVA-NeXT's instructions first:

conda create -n earlytom python=3.10 -y
conda activate earlytom
pip install --upgrade pip
cd LLaVA-NeXT
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
cd ..

3️⃣ Install EarlyTom

pip install -e .

Install from requirements

pip install -r requirements.txt

🚀 Quick Start

EarlyTom wraps any LLaVA-OneVision model with a single function call:

import os
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from earlytom import earlytom

pretrained = "path/to/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
tokenizer, model, image_processor, max_length = load_pretrained_model(
    pretrained, None, model_name, device_map="auto", attn_implementation="sdpa",
    multimodal=True
)

# Apply EarlyTom compression
model = earlytom(model)
model.eval()

📊 Evaluation

We use lmms-eval for evaluation. Scripts are provided under scripts/.

LLaVA-OneVision-7B

bash scripts/ov/eval_ov-7b_earlytom.sh

Example configuration for VideoMME at retain ratio 0.2:

export WRAPPER=earlytom
export RETAIN_RATIO=0.20
export T=0.5
export M=6
export INNER_k=18
export INNER_r=0.5
export PRUNE_LAYERS="8,21,23"

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
accelerate launch --num_processes=8 --main_process_port=25000 \
-m lmms_eval \
--model llava_onevision \
--model_args pretrained=path/to/model,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32 \
--tasks videomme \
--batch_size 1 \
--output_path ./logs/ov-7b-earlytom/videomme/0.20

Supported benchmarks: mvbench, videomme, egoschema, longvideobench_val_v.

❤ Acknowledgement

This work is built upon LLaVA-NeXT, HoliTom. We thank them for their excellent open-source contributions.


🤗 Citation

If EarlyTom is useful for your research, please consider citing:

@inproceedings{
    wang2025earlytom,
    title = {EarlyTom: Early Token Compression Completes Fast Video Understanding},
    author = {Wang, Hesong and Jin, Xin and Lu, Lu and Chenhaowen Li, Jian Chen and Qiang Liu and Wang, Huan},
    year = {2026},
    booktitle={CVPR},
}

About

[CVPR 2026] EarlyTom: Early Token Compression Completes Fast Video Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors