Skip to content

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

License

Notifications You must be signed in to change notification settings

zzc0430/llm-compressor

 
 

Repository files navigation

LLM Compressor

llm-compressor is an easy-to-use library for optimizing models for deployment with vllm, including:

  • Comprehensive set of quantization algorithms including weight-only and activation quantization
  • Seamless integration Hugging Face models and repositories
  • safetensors-based file format compatible with vllm

LLM Compressor Flow

Supported Formats

  • Mixed Precision: W4A16, W8A16
  • Activation Quantization: W8A8 (int8 and fp8)
  • 2:4 Semi-structured Sparsity
  • Unstructured Sparsity

Supported Algorithms

  • PTQ (Post Training Quantization)
  • GPTQ
  • SmoothQuant
  • SparseGPT

Installation

llm-compressor can be installed from the source code via a git clone and local pip install.

git clone https://github.com/vllm-project/llm-compressor.git
pip install -e llm-compressor

Quick Tour

The following snippet is a minimal example with 4-bit weight-only quantization via GPTQ and inference of a TinyLlama/TinyLlama-1.1B-Chat-v1.0. Note that the model can be swapped for a local or remote HF-compatible checkpoint and the recipe may be changed to target different quantization algorithms or formats.

Compression

Compression is easily applied by selecting an algorithm (GPTQ) and calling the oneshot API.

from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization.gptq import GPTQModifier

# Sets parameters for the GPTQ algorithms - target Linear layer weights at 4 bits
recipe = GPTQModifier(scheme="W4A16", targets="Linear", ignore=["lm_head"])

# Apply GPTQ algorithm using open_platypus dataset for calibration.
oneshot(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    dataset="open_platypus",
    recipe=recipe,
    save_compressed=True,
    output_dir="llama-compressed-quickstart",
    overwrite_output_dir=True,
    max_seq_length=2048,
    num_calibration_samples=512,
)

Inference with vLLM

The checkpoint is ready to run with vLLM (after install pip install vllm).

from vllm import LLM

model = LLM("llama-compressed-quickstart")
output = model.generate("I love 4 bit models because")

End-to-End Examples

The llm-compressor library provides a rich feature-set for model compression. Below are examples and documentation of a few key flows:

If you have any questions or requests open an issue and we will add an example or documentation.

Contribute

We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! Learn how here.

About

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.0%
  • Other 1.0%