GitHub - nunchaku-tech/nunchaku: [ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Paper | Docs | Website | Blog | Demo | HuggingFace | ModelScope | ComfyUI

English | 中文

Nunchaku is a high-performance inference engine optimized for 4-bit neural networks, as introduced in our paper SVDQuant. For the underlying quantization library, check out DeepCompressor.

Join our user groups on Slack, Discord and WeChat to engage in discussions with the community! More details can be found here. If you have any questions, run into issues, or are interested in contributing, don’t hesitate to reach out!

News

[2025-07-13] 🚀 The official Nunchaku documentation is now live! Explore comprehensive guides and resources to help you get started.
[2025-06-29] 🔥 Support FLUX.1-Kontext! Try out our example script to see it in action! Our demo is available at this link!
[2025-06-01] 🚀 Release v0.3.0! This update adds support for multiple-batch inference, ControlNet-Union-Pro 2.0, initial integration of PuLID, and introduces Double FB Cache. You can now load Nunchaku FLUX models as a single file, and our upgraded 4-bit T5 encoder now matches FP8 T5 in quality!
[2025-04-16] 🎥 Released tutorial videos in both English and Chinese to assist installation and usage.
[2025-04-09] 📢 Published the April roadmap and an FAQ to help the community get started and stay up to date with Nunchaku’s development.
[2025-04-05] 🚀 Nunchaku v0.2.0 released! This release brings multi-LoRA and ControlNet support with even faster performance powered by FP16 attention and First-Block Cache. We've also added compatibility for 20-series GPUs — Nunchaku is now more accessible than ever!

More

[2025-03-07] 🚀 Nunchaku v0.1.4 Released! We've supported 4-bit text encoder and per-layer CPU offloading, reducing FLUX's minimum memory requirement to just 4 GiB while maintaining a 2–3× speedup. This update also fixes various issues related to resolution, LoRA, pin memory, and runtime stability. Check out the release notes for full details!
[2025-02-20] 🚀 Support NVFP4 precision on NVIDIA RTX 5090! NVFP4 delivers superior image quality compared to INT4, offering ~3× speedup on the RTX 5090 over BF16. Learn more in our blog, checkout examples for usage and try our demo online!
[2025-02-18] 🔥 Customized LoRA conversion and model quantization instructions are now available! ComfyUI workflows now support customized LoRA, along with FLUX.1-Tools!
[2025-02-11] 🎉 SVDQuant has been selected as a ICLR 2025 Spotlight! FLUX.1-tools Gradio demos are now available! Check here for the usage details! Our new depth-to-image demo is also online—try it out!
[2025-02-04] 🚀 4-bit FLUX.1-tools is here! Enjoy a 2-3× speedup over the original models. Check out the examples for usage. ComfyUI integration is coming soon!
[2025-01-23] 🚀 4-bit SANA support is here! Experience a 2-3× speedup compared to the 16-bit model. Check out the usage example and the deployment guide for more details. Explore our live demo at svdquant.mit.edu!
[2025-01-22] 🎉 SVDQuant has been accepted to ICLR 2025!
[2024-12-08] Support ComfyUI. Please check mit-han-lab/ComfyUI-nunchaku for the usage.
[2024-11-07] 🔥 Our latest W4A4 Diffusion model quantization work SVDQuant is publicly released! Check DeepCompressor for the quantization library.

Overview

Nunchaku is a high-performance inference engine for low-bit neural networks. It implements SVDQuant, a post-training quantization technique for 4-bit weights and activations that well maintains visual fidelity. On 12B FLUX.1-dev, it achieves 3.6× memory reduction compared to the BF16 model. By eliminating CPU offloading, it offers 8.7× speedup over the 16-bit model when on a 16GB laptop 4090 GPU, 3× faster than the NF4 W4A16 baseline. On PixArt-∑, it demonstrates significantly superior visual quality over other W4A4 or even W4A8 baselines. "E2E" means the end-to-end latency including the text encoder and VAE decoder.

SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Muyang Li*, Yujun Lin*, Zhekai Zhang*, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han
MIT, NVIDIA, CMU, Princeton, UC Berkeley, SJTU, and Pika Labs

demo.mov

Method

Quantization Method -- SVDQuant

Overview of SVDQuant. Stage1: Originally, both the activation $\boldsymbol{X}$ and weights $\boldsymbol{W}$ contain outliers, making 4-bit quantization challenging. Stage 2: We migrate the outliers from activations to weights, resulting in the updated activation $\hat{\boldsymbol{X}}$ and weights $\hat{\boldsymbol{W}}$. While $\hat{\boldsymbol{X}}$ becomes easier to quantize, $\hat{\boldsymbol{W}}$ now becomes more difficult. Stage 3: SVDQuant further decomposes $\hat{\boldsymbol{W}}$ into a low-rank component $\boldsymbol{L}_1\boldsymbol{L}_2$ and a residual $\hat{\boldsymbol{W}}-\boldsymbol{L}_1\boldsymbol{L}_2$ with SVD. Thus, the quantization difficulty is alleviated by the low-rank branch, which runs at 16-bit precision.

Nunchaku Engine Design

(a) Naïvely running low-rank branch with rank 32 will introduce 57% latency overhead due to extra read of 16-bit inputs in Down Projection and extra write of 16-bit outputs in Up Projection. Nunchaku optimizes this overhead with kernel fusion. (b) Down Projection and Quantize kernels use the same input, while Up Projection and 4-Bit Compute kernels share the same output. To reduce data movement overhead, we fuse the first two and the latter two kernels together.

Performance

SVDQuant reduces the 12B FLUX.1 model size by 3.6× and cuts the 16-bit model's memory usage by 3.5×. With Nunchaku, our INT4 model runs 3.0× faster than the NF4 W4A16 baseline on both desktop and laptop NVIDIA RTX 4090 GPUs. Notably, on the laptop 4090, it achieves a total 10.1× speedup by eliminating CPU offloading. Our NVFP4 model is also 3.1× faster than both BF16 and NF4 on the RTX 5090 GPU.

Getting Started

Roadmap

Please check here for the roadmap for the Summer.

Contact Us

For enterprises interested in adopting SVDQuant or Nunchaku, including technical consulting, sponsorship opportunities, or partnership inquiries, please contact us at muyangli@mit.edu.

Related Projects

Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models, NeurIPS 2022 & T-PAMI 2023
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models, ICML 2023
Q-Diffusion: Quantizing Diffusion Models, ICCV 2023
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration, MLSys 2024
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models, CVPR 2024
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, MLSys 2025
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers, ICLR 2025
Radial Attention: $O(n \log n)$ Sparse Attention with Energy Decay for Long Video Generation, ArXiv 2025

Citation

If you find nunchaku useful or relevant to your research, please cite our paper:

@inproceedings{
  li2024svdquant,
  title={SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models},
  author={Li*, Muyang and Lin*, Yujun and Zhang*, Zhekai and Cai, Tianle and Li, Xiuyu and Guo, Junxian and Xie, Enze and Meng, Chenlin and Zhu, Jun-Yan and Han, Song},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025}
}

Acknowledgments

We thank MIT-IBM Watson AI Lab, MIT and Amazon Science Hub, MIT AI Hardware Program, National Science Foundation, Packard Foundation, Dell, LG, Hyundai, and Samsung for supporting this research. We thank NVIDIA for donating the DGX server.

We use img2img-turbo to train the sketch-to-image LoRA. Our text-to-image and image-to-image UI is built upon playground-v.25 and img2img-turbo, respectively. Our safety checker is borrowed from hart.

Nunchaku is also inspired by many open-source libraries, including (but not limited to) TensorRT-LLM, vLLM, QServe, AWQ, FlashAttention-2, and Atom.

Name		Name	Last commit message	Last commit date
Latest commit History 415 Commits
.github		.github
app		app
assets		assets
docker		docker
docs		docs
examples		examples
nunchaku		nunchaku
scripts		scripts
src		src
tests		tests
third_party		third_party
.clang-format		.clang-format
.clang-format-ignore		.clang-format-ignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENCE.txt		LICENCE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
README_ZH.md		README_ZH.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Paper | Docs | Website | Blog | Demo | HuggingFace | ModelScope | ComfyUI

English | 中文

News

Overview

Method

Quantization Method -- SVDQuant

Nunchaku Engine Design

Performance

Getting Started

Roadmap

Contact Us

Related Projects

Citation

Acknowledgments

Star History

About

Uh oh!

Releases 15

Packages

Uh oh!

Contributors 20

Languages

License

nunchaku-tech/nunchaku

Folders and files

Latest commit

History

Repository files navigation

Paper | Docs | Website | Blog | Demo | HuggingFace | ModelScope | ComfyUI

English | 中文

News

Overview

Method

Quantization Method -- SVDQuant

Nunchaku Engine Design

Performance

Getting Started

Roadmap

Contact Us

Related Projects

Citation

Acknowledgments

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Contributors 20

Languages

Packages