GitHub - viridisGreen/EarlyTom: [CVPR 2026] EarlyTom: Early Token Compression Completes Fast Video Understanding

EarlyTom: Early Token Compression Completes Fast Video Understanding

Hesong Wang^1,2,3,🌟, Xin Jin^2,🌟, Lu Lu^3,✉️, Chenhaowen Li³, Jian Chen³, Qiang Liu³, Huan Wang^2,✉️

¹Zhejiang University, ²Westlake University, ³Alibaba Cloud Computing

^🌟 Equal Contribution ^✉️ Corresponding Author

📆 News

[2026/5/17] We release the EarlyTom code.

🔍 Introduction

EarlyTom is a training-free token compression method for video large language models (Video-LLMs). It reduces the number of visual tokens by leveraging early-layer attention signals to identify and prune redundant tokens before they propagate through the full model, significantly reducing computation while preserving performance.

EarlyTom supports two complementary compression strategies:

Outer compression: Prunes redundant visual tokens at the vision encoder output, guided by attention weights from early transformer layers.
Inner compression (optional): Further merges tokens inside the LLM backbone at specified layers using a DPC-KNN clustering approach.

🛠 Installation

1️⃣ Clone the repository

git clone https://github.com/viridisGreen/EarlyTom
cd EarlyTom

2️⃣ Install dependencies

EarlyTom is built on top of LLaVA-NeXT. Set up the environment following LLaVA-NeXT's instructions first:

conda create -n earlytom python=3.10 -y
conda activate earlytom
pip install --upgrade pip
cd LLaVA-NeXT
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
cd ..

3️⃣ Install EarlyTom

pip install -e .

Install from requirements

pip install -r requirements.txt

🚀 Quick Start

EarlyTom wraps any LLaVA-OneVision model with a single function call:

import os
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from earlytom import earlytom

pretrained = "path/to/llava-onevision-qwen2-7b-ov"
model_name = "llava_qwen"
tokenizer, model, image_processor, max_length = load_pretrained_model(
    pretrained, None, model_name, device_map="auto", attn_implementation="sdpa",
    multimodal=True
)

# Apply EarlyTom compression
model = earlytom(model)
model.eval()

📊 Evaluation

We use lmms-eval for evaluation. Scripts are provided under scripts/.

LLaVA-OneVision-7B

bash scripts/ov/eval_ov-7b_earlytom.sh

Example configuration for VideoMME at retain ratio 0.2:

export WRAPPER=earlytom
export RETAIN_RATIO=0.20
export T=0.5
export M=6
export INNER_k=18
export INNER_r=0.5
export PRUNE_LAYERS="8,21,23"

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
accelerate launch --num_processes=8 --main_process_port=25000 \
-m lmms_eval \
--model llava_onevision \
--model_args pretrained=path/to/model,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=32 \
--tasks videomme \
--batch_size 1 \
--output_path ./logs/ov-7b-earlytom/videomme/0.20

Supported benchmarks: mvbench, videomme, egoschema, longvideobench_val_v.

❤ Acknowledgement

This work is built upon LLaVA-NeXT, HoliTom. We thank them for their excellent open-source contributions.

🤗 Citation

If EarlyTom is useful for your research, please consider citing:

@inproceedings{
    wang2025earlytom,
    title = {EarlyTom: Early Token Compression Completes Fast Video Understanding},
    author = {Wang, Hesong and Jin, Xin and Lu, Lu and Chenhaowen Li, Jian Chen and Qiang Liu and Wang, Huan},
    year = {2026},
    booktitle={CVPR},
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LLaVA-NeXT		LLaVA-NeXT
docs		docs
earlytom		earlytom
fastvid		fastvid
holitom		holitom
lmms_eval		lmms_eval
scripts		scripts
tome		tome
visionzip		visionzip
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.py		example.py
flops.py		flops.py
performance.py		performance.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EarlyTom: Early Token Compression Completes Fast Video Understanding

📆 News

🔍 Introduction

🛠 Installation

1️⃣ Clone the repository

2️⃣ Install dependencies

3️⃣ Install EarlyTom

Install from requirements

🚀 Quick Start

📊 Evaluation

LLaVA-OneVision-7B

❤ Acknowledgement

🤗 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EarlyTom: Early Token Compression Completes Fast Video Understanding

📆 News

🔍 Introduction

🛠 Installation

1️⃣ Clone the repository

2️⃣ Install dependencies

3️⃣ Install EarlyTom

Install from requirements

🚀 Quick Start

📊 Evaluation

LLaVA-OneVision-7B

❤ Acknowledgement

🤗 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages