LLM Compressor v0.12.0 Release Notes
This release upgrades to Transformers v5 with improved MoE support, streamlines the dataset interface, and adds multi-GPU acceleration for model-free PTQ. Major highlights include comprehensive Transformers v5 integration with refactored MoE linearization, a simplified dataset split API that removes legacy multi-stage logic, multi-GPU distribution for model-free PTQ workflows, and expanded model coverage with Nemotron Ultra FP8 examples.
This release contains changes to example scripts with backwards compatibility with previous examples and scripts. Please read Transformers v5 for more information.
Key Highlights ✨
- Transformers v5 Upgrade (#2647): Full integration with Transformers v5, including refactored MoE linearization with
load_contextfor efficient loading, updated model structure handling, and improved tied embeddings support. Maintains LM eval performance across the transition. Note: LLM Compressor no longer supports installation withtransformers<5.0.0. - Simplified Dataset Interface (#2551): Removed legacy multi-split logic, replacing
splits={"calibration": "train[:100]"}with cleanersplit="train[:100]"API. Legacy argument usage is deprecated and will be removed in a future release. - Multi-GPU Model-Free PTQ (#2773): Added support to distribute model-free PTQ jobs across multiple GPUs for significant parallelization and speedup for quantization workflows.
- Nemotron Ultra Support (#2803): Added FP8 quantization example for Nemotron Ultra models in the model-free PTQ examples.
Transformers v5
Examples and Model Loading
-
Example regexes and recipes have been updated to reflect new model structures introduced by Transformers v5
-
Examples which utilize disk offloading or mixture-of-experts (MoE) calibration now load models with
load_contextprovided byllmcompressor.utils. This context is a catch-all context and should be used in all scripts for efficient model loading.
- from compressed_tensors.offload import load_offloaded_model
- from llmcompressor.modeling.moe.linearize import load_quantizable_moe
-
- with load_offloaded_model(), load_quantizable_moe():
- model = AutoModelForCausalLM.from_pretrained(model_id)
+ from llmcompressor.utils import load_context
+
+ with load_context():
+ model = AutoModelForCausalLM.from_pretrained(model_id)dtypenow defaults to"auto", so this explicit argument has been removed to reduce verbosity
- model = AutoModelForCausalLM.from_pretrained(model_id, dtype=”auto”)
+ model = AutoModelForCausalLM.from_pretrained(model_id)from_pretrainedno longer supportsuse_auth_token. This argument has been removed fromoneshot
Expanded and Refactored MoE Support
Applying quantization to Mixture-of-Experts (MoE) models requires explicit linearization and class overriding in order to efficiently calibrate experts. This logic has been implemented by LLM Compressor through two pathways:
llmcompressor.modeling.moe.linearize::linearize_moe which replaces experts modules with linearized and calibration-friendly classes AFTER weights have already been loaded
llmcompressor.modeling.moe.linearize::load_quantizable_moe which replaces experts modules with linearized and calibration-friendly classes BEFORE weights have been loaded. This context is more efficient and reduces runtime during model loading.
Both of these pathways are called as needed by llmcompressor.utils::load_context. These implementations are capable of automatically handling >90% of all model definitions provided by transformers. For unconventional or custom model definitions, see Adding MoE Calibration Support for a New Model
Multi-GPU Model-Free PTQ
Model-free PTQ now supports distributing quantization jobs across multiple GPUs when available. This feature automatically detects available GPUs and parallelizes the quantization workflow, significantly reducing processing time for large models.
Simplified Dataset Interface
The dataset split interface has been refactored to remove legacy multi-stage logic that previously supported separate datasets for training, oneshot, and eval in a single command. Since training and eval tasks are no longer supported in the same command, the API has been simplified.
Old interface:
oneshot(
model,
dataset="ultrachat",
splits={"calibration": "train_sft[:100]"}
)New interface:
oneshot(
model,
dataset="ultrachat",
split="train_sft[:100]"
) The new API is backwards compatible and will issue warnings when using the old dictionary-based splits argument.
Nemotron 3 Ultra Examples
This release adds model-free PTQ examples for NVIDIA's Nemotron-3-Ultra-550B model.
Pre-quantized FP8 checkpoints are available on HuggingFace:
- NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8-dynamic
- NVIDIA-Nemotron-3-Ultra-550B-A55B-FP8-block
- See the model-free PTQ examples for usage details.
Breaking Changes
- The minimum transformers version has been bumped up to v5.9
New Contributors
- @JINO-ROHIT made their first contribution in #2773
- @u7k4rs6 made their first contribution in #2779
- @soyr-redhat made their first contribution in #2794
- @arpitkh101 made their first contribution in #2589
- @Priya95715 made their first contribution in #2768
Full Changelog: 0.11.0...0.12.0