This repo is the official implementation of the KV cache eviction algorithm --- SustainableKV. We guarantee all experiment results are reproducible.
conda create --name sustainablekv python=3.11
conda activate sustainablekv
git clone https://github.com/YUECHE77/SustainableKV.git
cd SustainableKV
pip install -e .
pip install -r requirements.txt
-
Install PyTorch (CUDA 11.8):
pip uninstall torch torchvision torchaudio -y pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
-
Install FlashAttention:
FlashAttention is sensitive to version mismatches. You can find all official wheels here.
A configuration that guaranteed to work: cuda11.8, python3.11, pytorch==2.3.0, flash_attn==2.5.8
Check your ABI setting:
python -c "import torch; print(torch._C._GLIBCXX_USE_CXX11_ABI)" # Just to check to use abiFALSE or abiTRUE
Then install the correct wheel:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu118torch2.3cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
- Download the models from HuggingFace (please refer to model). Currently, we support Mistral / Mixtral / LLaMA Family. Then replace your model path here.
- We prepare a demo. You can modify the
method
,model_to_use
, andmodel2path
to test our methods. Please also modify the path to the SnapKV's paper in line 54, which is the input document.
The detailed algorithm of SustainableKV is in the file sustainablekv_utils.py
You can easily integrate SustainableKV with other models. Just follow the same pattern as those existing models. Currently, we support Llama family/ Mistral/ Mixtral
- LongBench:
cd experiments/LongBench bash longbench.sh
- Needle In A Haystack:
cd experiments/NeedleInHaystack python pred_sustainable.py \ --model-name lwm-text-chat-1m \ -s 1000 \ -e 30000 \ --num-intervals 15 \ --compress \ --save-folder /the/folder/path
Many thanks to SnapKV for their great work!!!