Skip to content

v0.12.0

Latest

Choose a tag to compare

@dsikka dsikka released this 15 Jun 13:02
6d2a090
c6eabd02-d5f3-47ce-ac77-a7fb0ad7cac9

LLM Compressor v0.12.0 Release Notes

This release upgrades to Transformers v5 with improved MoE support, streamlines the dataset interface, and adds multi-GPU acceleration for model-free PTQ. Major highlights include comprehensive Transformers v5 integration with refactored MoE linearization, a simplified dataset split API that removes legacy multi-stage logic, multi-GPU distribution for model-free PTQ workflows, and expanded model coverage with Nemotron Ultra FP8 examples.

This release contains changes to example scripts with backwards compatibility with previous examples and scripts. Please read Transformers v5 for more information.

Key Highlights ✨

  • Transformers v5 Upgrade (#2647): Full integration with Transformers v5, including refactored MoE linearization with load_context for efficient loading, updated model structure handling, and improved tied embeddings support. Maintains LM eval performance across the transition. Note: LLM Compressor no longer supports installation with transformers<5.0.0.
  • Simplified Dataset Interface (#2551): Removed legacy multi-split logic, replacing splits={"calibration": "train[:100]"} with cleaner split="train[:100]" API. Legacy argument usage is deprecated and will be removed in a future release.
  • Multi-GPU Model-Free PTQ (#2773): Added support to distribute model-free PTQ jobs across multiple GPUs for significant parallelization and speedup for quantization workflows.
  • Nemotron Ultra Support (#2803): Added FP8 quantization example for Nemotron Ultra models in the model-free PTQ examples.

Transformers v5

Examples and Model Loading

  • Example regexes and recipes have been updated to reflect new model structures introduced by Transformers v5

  • Examples which utilize disk offloading or mixture-of-experts (MoE) calibration now load models with load_context provided by llmcompressor.utils. This context is a catch-all context and should be used in all scripts for efficient model loading.

- from compressed_tensors.offload import load_offloaded_model
- from llmcompressor.modeling.moe.linearize import load_quantizable_moe
- 
- with load_offloaded_model(), load_quantizable_moe():
-     model = AutoModelForCausalLM.from_pretrained(model_id)

+ from llmcompressor.utils import load_context
+ 
+ with load_context():
+     model = AutoModelForCausalLM.from_pretrained(model_id)
  • dtype now defaults to "auto", so this explicit argument has been removed to reduce verbosity
- model = AutoModelForCausalLM.from_pretrained(model_id, dtype=”auto”)
+ model = AutoModelForCausalLM.from_pretrained(model_id)
  • from_pretrained no longer supports use_auth_token. This argument has been removed from oneshot

Expanded and Refactored MoE Support

Applying quantization to Mixture-of-Experts (MoE) models requires explicit linearization and class overriding in order to efficiently calibrate experts. This logic has been implemented by LLM Compressor through two pathways:
llmcompressor.modeling.moe.linearize::linearize_moe which replaces experts modules with linearized and calibration-friendly classes AFTER weights have already been loaded
llmcompressor.modeling.moe.linearize::load_quantizable_moe which replaces experts modules with linearized and calibration-friendly classes BEFORE weights have been loaded. This context is more efficient and reduces runtime during model loading.

Both of these pathways are called as needed by llmcompressor.utils::load_context. These implementations are capable of automatically handling >90% of all model definitions provided by transformers. For unconventional or custom model definitions, see Adding MoE Calibration Support for a New Model

Multi-GPU Model-Free PTQ

Model-free PTQ now supports distributing quantization jobs across multiple GPUs when available. This feature automatically detects available GPUs and parallelizes the quantization workflow, significantly reducing processing time for large models.

Simplified Dataset Interface

The dataset split interface has been refactored to remove legacy multi-stage logic that previously supported separate datasets for training, oneshot, and eval in a single command. Since training and eval tasks are no longer supported in the same command, the API has been simplified.

Old interface:

oneshot(
  model,
  dataset="ultrachat",
  splits={"calibration": "train_sft[:100]"}
)

New interface:

oneshot(
  model,
  dataset="ultrachat",
  split="train_sft[:100]"
) 

The new API is backwards compatible and will issue warnings when using the old dictionary-based splits argument.

Nemotron 3 Ultra Examples

This release adds model-free PTQ examples for NVIDIA's Nemotron-3-Ultra-550B model.
Pre-quantized FP8 checkpoints are available on HuggingFace:

Breaking Changes

  • The minimum transformers version has been bumped up to v5.9

New Contributors

Full Changelog: 0.11.0...0.12.0