LLM Compressor v0.11.0

This release focuses on distributed computing enhancements, quantization lifecycle improvements, and expanded model support. Major highlights include DDP support for AWQ and SmoothQuant with significant speedups (up to 3.2x), a comprehensive refactor of the Compressed Tensors API, and observer/lifecycle refactors that simplify quantization workflows. New model support includes Qwen 3.5/3.6, Gemma 4, Kimi K2.6, and experimental DeepSeek-V4 support along with quantized checkpoints.

Note: LLM Compressor v0.11.0 removes support for Sparse24 quantization formats and sparse model compression. This decision was made based upon lack of community interest and maintainability concerns. Support for sparse compression may be re-introduced as part of a future release. For Sparse24 compression support, please use LLM Compressor v0.10.0.2.

Key Highlights ✨

DDP Performance: AWQ and SmoothQuant now support DDP with 2.9-3.2x speedups and up to 51% memory reduction per GPU (with 4 GPUs).
Compressed Tensors Refactor: Simplified API with clear entrypoints, removed sparsity support, streamlined compressor architecture
Quantization Lifecycle: Unified calibration timing (now at epoch end), decoupled observation from qparam calculation
Extended Quantization Support: GPTQ actorder now works across all weight strategies, AWQ refactored for NVFP4 compatibility
Converter Entrypoint: New tool and framework for converting from AutoAWQ and ModelOpt NVFP4 to Compressed-Tensors, as well as decompressing Compressed-Tensors checkpoints
Large Model Support: DDP+GPTQ+disk offloading fixes for models like Qwen3-VL-235B-A22B

DDP and Lifecycle Updates

AWQ DDP Support: Added DDP (Distributed Data Parallel) functionality for AWQ resulting in significant speedups and reduced GPU memory usage:

Model	Single-GPU Time	DDP Time (4 GPUs)	Speedup	Single-GPU Memory	DDP Memory	Memory Reduction
Llama-3-8B	7.02 min	2.40 min	2.9x	10.20 GB	4.99 GB	51%
Llama-3-8B (masked)	8.13 min	2.67 min	3.0x	10.14 GB	4.98 GB	51%
Qwen3-30B-A3B	459.65 min	143.90 min	3.2x	4.13 GB	3.36 GB	19%

Accuracy metrics remain comparable between DDP and single-GPU approaches.

SmoothQuant DDP Support: Added DDP support for SmoothQuant resulting in significant speedups:

GPUs Total Time Peak GPU Mem Speedup

1 GPU 94.1 min 8.93 GB 1.00x

2 GPU 58.7 min 7.06 GB 1.60x

4 GPU 28.7 min 7.06 GB 3.28x

GPUs	Total Time	Peak GPU Mem	Speedup
1 GPU	94.1 min	8.93 GB	1.00x
2 GPU	58.7 min	7.06 GB	1.60x
4 GPU	28.7 min	7.06 GB	3.28x

Special thanks to @dzhengAP for their excellent contributions to the SmoothQuantModifier!

Observer Refactor: Decoupled observation from quantization parameter calculation, allowing natural separation of responsibilities where observer.forward() records statistics about observed tensors while get_qparams() performs qparam calculation. This simplifies design and expands the types of observers supported. Key changes:
- Observers now have update_statistics_from_observed() for forward pass and get_qparams() for parameter calculation
- Global scale logic now entirely contained in observers (observers have references to fused weight observers for global_scale calculation)
- Removed module references from observers, simplified observer utilities
- Fixed imatrix observer synchronization in DDP and imatrix+global_scale bug
- Consolidated synchronization logic with new activation_statistics concept for activation observers and one weight observer
DDP Support for Activation Quantization: Added DDP support for quantization schemes with activation quantization. Extended QuantizationModifier to support distributed activation calibration via PR #2391 (merged Mar 27, 2026).

Implementation: At SEQUENTIAL_EPOCH_END and CALIBRATION_EPOCH_END, activation observer min/max values are all-reduced across ranks. Scale/zero-point are then recomputed from the global statistics so all ranks have identical quantization parameters.

Key Changes:
- Added synchronize(), recompute_qparams(), recompute_global_scale() to Observer base class
- Added sync_activation_observers() to QuantizationMixin (shared by QuantizationModifier and GPTQModifier)
- Batch all async dist.all_reduce operations and wait once, matching GPTQ DDP pattern
DDP+GPTQ+Disk Offloading for Large Models: Added fixes and features to enable DDP+GPTQ+disk offloading to work for very large models (e.g., Qwen3-VL-235B-A22B). Key improvements include:
- Reduced shared memory overload and mmap issues for big models with DDP + CPU/disk offloading
- Fixed MoE calibration context to use same offloading as original module (previously reverted to CPU offloading causing issues)
- Only store original modules when needed to avoid mmap issues
- Added synchronization steps during model saving to prevent thread timeout issues
- Added sync points for MoE calibration context to handle NCCL timeout when different threads take varying time on large models
- Fixed NVFP4 DDP support on A100 (NCCL broadcast workaround for FP8)
- Reduced memory requirements of moe_calibration_context by removing retained module references after replacement
Distributed Model Compression: Accelerate the model compression step (bit packing) by assigning modules across ranks and compressing them in parallel, greatly reducing runtime for large models, scaling linearly with the number of GPUs available.
Quantization Lifecycle Refactor: Altered quantization lifecycle so weight and activation calibration both now happen on epoch end (previously weight calibration happened at start for QuantizationModifier but end for other modifiers). Benefits include simpler code, faster runtime due to reduced on/offloading during quantization, and quantization now disabled across the board during calibration (previously modifier-dependent).
Microscale Calibration Refactor: Refactored microscale formats which require fused global_scale calculation. Rather than treating global scale as a generic qparam in the observer with additional post-modifications, the observer is now entirely responsible for global_scale. Observers are now fused (made aware of other observers with which they share a global_scale) so they can calculate a joint global_scale. Note: this requires that all fused observers have generated statistics through their forward method. This massively simplifies global_scale handling while maintaining accuracy.

New Model Support

Qwen 3.5 and Qwen 3.6: Calibration support has been added as part of this release with instructions summarized in the documentation for Qwen3.5 and Qwen3.6. Several quantized checkpoints have also been released, including:
- RedHatAI/Qwen3.6-35B-A3B-NVFP4
- RedHatAI/Qwen3.6-35B-A3B-FP8-dynamic
Gemma 4: Calibration support has been added with details listed in the documentation for Gemma 4. Several quantized checkpoints have also been released, including:
Kimi K2.6: This model was originally released in W4A16 packed quantized format. Decompression support has been enabled through the converters entrypoint and calibration support has also been added with details listed in the documentation for Kimi K2.6. Quantized checkpoints have also been released:
- RedHatAI/Kimi-K2.6-NVFP4
- RedHatAI/Kimi-K2.6-FP8-BLOCK
DeepSeek-V4: Support for quantization of DeepSeekV4 Flash and Pro models. These features are currently available via experimental branches, but are planned for integration as part of the next release of LLM Compressor. More details can be found here. Sample checkpoint:
- RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8

Converter Entrypoint (Compressed-Tensors)

Model Format Conversion: Added Converter entrypoint to enable decompression and conversion of models from various packed quantized formats to Compressed-Tensors format. Currently supports:
- AutoAWQ to CT conversion
- Compressed-Tensors Decompression
- ModelOpt NVFP4 to CT Conversion
- FP8 Block Decompression (popularized by DeepSeek)
More details: https://docs.vllm.ai/projects/llm-compressor/en/latest/guides/entrypoints/convert/

Compressed Tensors

Compressed Tensors Refactor: Major simplification of compression API and architecture to reduce complexity, define easy-to-use APIs for module and state dict compression/decompression, and prepare for distributed parallel compression.

Architectural Changes:
- Simplified compressors: Removed separate "quantization" and "sparsity" compressors and hierarchy. Each format now has exactly one compressor. Compressors define which quantization schemes they support, modules compressed using whichever compressor supports them in priority order
- Module compression API: Each Compressor class implements Compressor.can_compress(), Compressor.compress(), and Compressor.decompress() methods. You can use top-level compress_module() and decompress_module() to automatically infer which compressor to use to compress a module.
- Removed CompressedLinear wrapper class: ModelCompressor now adds pre_forward hook that triggers decompression on first forward pass.
- Added QuantizationStatus.DECOMPRESSED state: weight already been qdqed permanently (distinct from FROZEN which still performs weight qdq during forward pass for emulation)
Breaking Changes:
- Removed all sparsity compressors and deprecated sparsity-related config arguments
- Removed CompressedLinear class
- marlin24 sparsity no longer supported

AWQ Refactor

Transform-Based Modifier: Refactored AWQ to be a transform-based modifier (a modifier that modifies weights in place without applying quantization) as part of an ongoing effort to make AWQ compatible with more quantization formats, including NVFP4. This keeps AWQ separate from static activation calibration and makes for a cleaner implementation.

GPTQ ActOrder Support

Extended Activation Ordering: Extended GPTQ activation ordering (actorder) support beyond the original GROUP-only strategy to work across all weight quantization strategies: GROUP, TENSOR_GROUP, CHANNEL, TENSOR, and BLOCK.

Strategy	Modifier-level actorder (before)	After
GROUP	propagated	propagated
TENSOR_GROUP	silently ignored	propagated
CHANNEL	silently ignored	propagated; GROUP → fallback
TENSOR	silently ignored	propagated; GROUP → fallback
BLOCK	silently ignored	propagated; GROUP → fallback

MXFP4 Linear Quantization

FlashInfer Backend Support: Enabled MXFP4 Linear Quantization support in vLLM using the FlashInfer backend, allowing end-to-end support for MXFP4 checkpoints beyond Marlin. Benchmark results on Meta-Llama-3-8B-Instruct (gsm8k_cot_llama):

Backend	Flexible-Extract	Strict-Match
FlashInfer	0.6892 ± 0.0127	0.6846 ± 0.0128
Marlin (VLLM_MXFP4_USE_MARLIN=1)	0.7604 ± 0.0118	0.7551 ± 0.0118
Dense (meta-llama/Meta-Llama-3-8B-Instruct)	0.7998 ± 0.0110	0.7991 ± 0.0110

Model Saving

MTP Layer Saving: Fixed issue where models with MTP (multi-token prediction) layers were not including MTP layers in final checkpoints due to MTP layers not being loaded through the AutoModel pathway. Updated model saving to detect presence of MTP layers and update the safetensors in the final checkpoint accordingly.

New Contributors

@dik654 made their first contribution in #2368
@Yatimai made their first contribution in #2418
@omkar-334 made their first contribution in #2414
@JinRiYao2001 made their first contribution in #2443
@rtj1 made their first contribution in #2330
@2imi9 made their first contribution in #2467
@dzhengAP made their first contribution in #2471
@markypizz made their first contribution in #2503
@changjonathanc made their first contribution in #2477
@zeel2104 made their first contribution in #2533
@wiliyam made their first contribution in #2556
@vkduy made their first contribution in #2567
@xingzihai made their first contribution in #2555
@liwei109 made their first contribution in #2464
@aayush7511 made their first contribution in #2493
@sakunkun made their first contribution in #2597
@Nottlespike made their first contribution in #2547
@Alone-wl made their first contribution in #2634
@elwhyjay made their first contribution in #2609
@jayakumarpujar made their first contribution in #2639
@prdeepakbabu made their first contribution in #2644
@rk119 made their first contribution in #2616
@changwangss made their first contribution in #2688
@juju812 made their first contribution in #2635
@dshane1903 made their first contribution in #2704
@AsadShahid04 made their first contribution in #2719
@orestis-z made their first contribution in #2725

Full Changelog: 0.10.0...0.11.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.11.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

LLM Compressor v0.11.0

Key Highlights ✨

DDP and Lifecycle Updates

New Model Support

Converter Entrypoint (Compressed-Tensors)

Compressed Tensors

AWQ Refactor

GPTQ ActOrder Support

MXFP4 Linear Quantization

Model Saving

New Contributors

Contributors

Uh oh!