llm-compressor
is an easy-to-use library for optimizing models for deployment with vllm
, including:
- Comprehensive set of quantization algorithms including weight-only and activation quantization
- Seamless integration Hugging Face models and repositories
safetensors
-based file format compatible withvllm
- Mixed Precision: W4A16, W8A16
- Activation Quantization: W8A8 (int8 and fp8)
- 2:4 Semi-structured Sparsity
- Unstructured Sparsity
- PTQ (Post Training Quantization)
- GPTQ
- SmoothQuant
- SparseGPT
llm-compressor
can be installed from the source code via a git clone and local pip install.
git clone https://github.com/vllm-project/llm-compressor.git
pip install -e llm-compressor
The following snippet is a minimal example with 4-bit weight-only quantization via GPTQ and inference of a TinyLlama/TinyLlama-1.1B-Chat-v1.0
. Note that the model can be swapped for a local or remote HF-compatible checkpoint and the recipe
may be changed to target different quantization algorithms or formats.
Compression is easily applied by selecting an algorithm (GPTQ) and calling the oneshot
API.
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization.gptq import GPTQModifier
# Sets parameters for the GPTQ algorithms - target Linear layer weights at 4 bits
recipe = GPTQModifier(scheme="W4A16", targets="Linear", ignore=["lm_head"])
# Apply GPTQ algorithm using open_platypus dataset for calibration.
oneshot(
model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
dataset="open_platypus",
recipe=recipe,
save_compressed=True,
output_dir="llama-compressed-quickstart",
overwrite_output_dir=True,
max_seq_length=2048,
num_calibration_samples=512,
)
The checkpoint is ready to run with vLLM (after install pip install vllm
).
from vllm import LLM
model = LLM("llama-compressed-quickstart")
output = model.generate("I love 4 bit models because")
The llm-compressor
library provides a rich feature-set for model compression. Below are examples
and documentation of a few key flows:
Meta-Llama-3-8B-Instruct
W4A16 With GPTQMeta-Llama-3-8B-Instruct
W8A8-Int8 With GPTQ and SmoothQuantMeta-Llama-3-8B-Instruct
W8A8-Fp8 With PTQ
If you have any questions or requests open an issue and we will add an example or documentation.
We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! Learn how here.