Official implementation of AttentionPredictor - the first time series prediction of attention score for kv cache compression and learnable attention acceleration approach, which learns a lightweight convolution model to capture spatiotemporal patterns and predict the next-token attention score. An appealing feature is that it accurately predicts the attention score while consuming negligible memory. By retaining most of the attention information, AttentionPredictor achieves 16× KV cache compression with comparable LLM performance, significantly outperforming the state-of-the-art.
Our paper is on https://arxiv.org/abs/2502.04077
- Torchs
- FlashAttention-2
- Transformers >= 4.44.0
- LongChat: lmsys/longchat-7b-v1.5-32k
- LLaMA-3.1: meta-llama/Meta-Llama-3.1-8B-Instruct
The implementation of AttentionPredictor is in src/llama_attention/attnpred_llama_attention.py
.
The pretrained predictor models are in model/
.
To evaluate LongBench dataset:
cd src/evaluation/LongBench
bash eval.sh
bash metrics.sh
If you find AttentionPredictor useful or relevant to your project and research, please kindly cite our paper:
@article{yang2024attentionpredictor,
title={AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference},
author={Yang, Qingyue and Wang, Jie and Li, Xing and Wang, Zhihai and Chen, Chen and Chen, Lei and Yu, Xianzhi and Liu, Wulong and Hao, Jianye and Yuan, Mingxuan and others},
journal={arXiv preprint arXiv:2502.04077},
year={2025}
}