🚀 Cache-DiT v1.5.0 Release Notes
Release Date: 2026-06-16
Comparison: v1.3.0...v1.5.0 (176 commits)
Full Changelog: GitHub Releases
📋 Overview
Cache-DiT v1.5.0 is a major feature release spanning 3 months (2026-03-12 ~ 2026-06-16), covering 176 PRs. This release is organized around four core modules: SVDQuant W4A4 Quantization (PTQ/DQ/NVFP4/Converter CLI), DMD Calibrator (exponential-basis Dynamic Mode Decomposition), Bucket-style Layerwise CPU Offload (compute-communication overlapped offloading), and Ray Wrapper (transparent distributed inference). Additional highlights include FLUX.2-klein-kv series support, full CUDA Graph integration, extensive kernel refactoring, quantization enhancements, parallelism framework improvements, and comprehensively updated documentation.
✨ Core Highlights
1. 💎 SVDQuant W4A4 Quantization
Cache-DiT v1.5.0 natively integrates a complete SVDQuant W4A4 quantization workflow — the most significant feature of this release. Instead of relying on third-party libraries, users can now perform end-to-end W4A4 quantization directly through Cache-DiT's cache_dit.quantize() / cache_dit.load() API.
📊 PTQ (Post-Training Quantization)
Supports svdq_int4_r{rank} and svdq_nvfp4_r{rank} quant types:
- INT4 PTQ (≥sm80): Collect activation statistics via
calibrate_fn→ SVD low-rank decomposition → INT4 packing → runtime W4A4 GEMM. Supports three calibration precision levels:low(recommended default, ~18× speedup),medium,high. Serialize to{quant_type}.safetensors+quant_config.json; restore viacache_dit.load(). - NVFP4 PTQ (≥sm120, Blackwell): Designed for RTX 5090 and other Blackwell GPUs. Currently only
runtime_kernel="v1"is supported for NVFP4.
Performance (FLUX.2-klein-4B, 1024×1024, L20):
| Stage | Latency (s) | Memory (GiB) | Transformer Weight (GiB) |
|---|---|---|---|
| BF16 baseline | 2.13 | 17.32 | 7.22 |
| SVDQuant INT4 | 1.24 | 12.39 | 2.28 |
| SVDQuant + compile | 1.02 | 12.39 | 2.28 |
- Transformer weight reduction: ~3.2× compression (7.22 → 2.28 GiB)
- End-to-end latency: ~1.7× speedup (2.13 → 1.24s), ~2.1× with compile (2.13 → 1.02s)
- PSNR > 29 dB, near-lossless visual quality
NVFP4 Performance (RTX 5090, FLUX.2-klein-4B, 1024×1024):
| Stage | Latency (s) | Speedup | Memory (GiB) |
|---|---|---|---|
| BF16 baseline | 0.97 | 1.00× | 17.32 |
| NVFP4 PTQ | 0.58 | 1.69× | 12.50 |
| NVFP4 + compile | 0.47 | 2.05× | 12.50 |
⚡ DQ (Dynamic Quantization)
Zero-calibration quantization via _dq suffix types (e.g., svdq_int4_r128_dq):
- identity (default): Apply SVD low-rank decomposition directly to the original weight matrix — no calibration, no serialization.
- weight / weight_inv: Weight-statistics-only heuristic smooth strategies (experimental).
- few_shot: Collect a small number of real inference forwards at runtime, then quantize in-place with configurable relaxation strategies (7 strategies:
auto/stable_auto/power/log/rank/top/fixed). Supportsfew_shot_auto_compilefor deferred compilation after quantization.
DQ Performance (FLUX.2-klein-4B, 1024×1024, identity, rank=128): 1.28s, PSNR 28.71 dB.
🔧 SVDQ Converter CLI
New cache-dit-convert command-line tool for one-click model conversion to SVDQ W4A4:
cache-dit-convert --model-path black-forest-labs/FLUX.2-klein-4B \
--save-dir ./FLUX.2-klein-4B-svdq \
--quant-type svdq-int4-r128-dqSupports INT4/NVFP4 DQ conversion, custom smooth strategies, and multiple precision options. Outputs {quant_type}.safetensors + quant_config.json.
🔀 Fused MLP
New fused_gelu_mlp / fused_gelu_proj passes (enable via svdq_kwargs["fused_mlp"]=True), fusing the first GEMM + GELU activation into a single kernel for reduced kernel launch overhead and intermediate activation memory.
🔗 Parallelism Compatibility (Cache-DiT Exclusive)
SVDQuant in Cache-DiT fully supports Context Parallelism (Ulysses / Ring / USP) and Tensor Parallelism (PyTorch DTensor). Users can layer quantization on top of distributed parallelism for extreme memory compression and throughput gains. SVDQW4A4ShardLinear (dtensor.py) provides native TP sharding support. This is a differentiating capability of Cache-DiT's SVDQuant implementation versus other W4A4 solutions.
⚙️ Quantization Configuration Enhancements
- Regional Quantization (
regional_quantize=True+repeated_blocks): Quantize only repeated transformer blocks, keeping embedding layers at full precision. - Hybrid Precision Plan (
precision_plan): Assign different quant types to different sub-layers by name pattern. - FP8 Per-Tensor Fallback (
per_tensor_fallback=True, default on): Auto-fallback from per-row to per-tensor for TP-incompatible layers. - TorchAO Backend Refactor: Cleaner
QuantizeBackendenum (AUTO / TORCHAO / CACHE_DIT / NONE). - Quantize API Refactor: Deprecated legacy kwargs, unified under
QuantizeConfig+svdq_kwargs.
📦 cache-dit-cu13 Pre-built Wheel
Pre-compiled SVDQuant wheel for CUDA 13 users: pip install cache-dit-cu13 — no source build needed.
2. 💾 Bucket-style Layerwise CPU Offload
Cache-DiT v1.5.0 introduces a novel bucket-style layerwise offload mechanism that overcomes the "every layer waits for its own transfer" inefficiency of traditional approaches.
Core Design:
- Bucket Pipeline: Divide target modules into small contiguous buckets; prefetch the next bucket asynchronously while the current one executes.
- Dual Independent Copy Stream Pools: Separate CUDA stream pools for onload (H2D) and offload (D2H).
- Persistent Bins: Distribute the persistent budget evenly across the target sequence.
- Flexible Resource Controls:
transfer_buckets,persistent_buckets,persistent_bins,prefetch_limit,max_copy_streams,max_inflight_prefetch_bytes.
Performance (FLUX.1-dev, L20):
| Config | Memory | Latency |
|---|---|---|
| No offload | ~38 GiB | 23.4s |
| Diffusers sequential | ~1 GiB | 335s |
| Layerwise (transfer=4, persistent=32, bins=4) | ~16 GiB | 24.6s |
Only ~1.2s added latency versus no-offload baseline for a 38→16 GiB memory reduction.
torch.compile Compatible: Apply offload before compile; offload hooks work correctly after compilation.
CLI Quick Start:
python3 -m cache_dit.generate flux \
--layerwise-offload --layerwise-async-transfer \
--layerwise-transfer-buckets 4 --layerwise-persistent-buckets 32 \
--layerwise-persistent-bins 4 --layerwise-prefetch-limit \
--layerwise-max-inflight-prefetch-bytes 8gib --compile3. 🌩️ Ray Wrapper (Transparent Distributed Inference)
The Ray Wrapper makes distributed inference completely transparent to user code. No torchrun, no dist.init_process_group, no manual model sharding — just use_ray=True, and Cache-DiT handles everything.
Two Wrapper Levels:
| Level | Description | Best For |
|---|---|---|
| Pipeline Wrapper (recommended) | Ray manages the entire pipeline execution | Full feature support (cache, quant, parallelism), simplest, fastest. |
| Transformer Wrapper | Only the transformer runs on Ray workers | Lightweight, but slight slower |
Key Features:
ray_transfer_fn: User-defined per-worker model loading, bypassing serialization overhead and custom module class resolution issues.ray_use_compile: Automatic per-worker compilation.ray_runtime_env: Custom module import handling viaPYTHONPATH.- Supports all parallelism strategies: TP, Ulysses, Ring.
- LoRA support: fuse before enabling (TP requires fused LoRAs).
Performance (FLUX.2-klein-base-9B):
| Config | Latency |
|---|---|
| Baseline (single GPU) | 47.41s |
| Ray TP=2 + compile | 24.57s |
Minimal Example:
cache_dit.enable_cache(
pipe,
parallelism_config=ParallelismConfig(tp_size=2, use_ray=True),
)
image = pipe(prompt="A cat...").images[0] # Code unchanged4. 🔮 DMD Calibrator (Dynamic Mode Decomposition)
DMD (Dynamic Mode Decomposition) is an exponential-basis feed-forward calibrator, serving as a drop-in alternative to TaylorSeer's polynomial basis.
Mathematical Principle: The cached feature stream is modeled as a linear dynamical system
TaylorSeer vs DMD:
| Aspect | TaylorSeer (Polynomial) | DMD (Exponential) |
|---|---|---|
| Basis | ||
| Extrapolation | Diverges as |
Bounded when |
| Snapshots needed | 2+ (1st order) | ≥ 4 uniformly spaced |
| Best for | DiT-class denoising (DDPM) | Flow-matching generators (Hunyuan3D, etc.) |
| Noise sensitivity | Low | Moderate (SVD truncation suppresses noise) |
Usage:
cache_dit.enable_cache(
pipe,
cache_config=DBCacheConfig(...),
calibrator_config=DMDCalibratorConfig(
dmd_history=6, dmd_rank=0, dmd_ridge=1e-8,
),
)CLI: python -m cache_dit.generate flux --cache --dmd --dmd-history 6
🔧 Other Enhancements
🧩 FLUX.2-klein-kv Series
📈 CUDA Graph
- Full CUDA Graph support (#942-#952)
- CUDA Graph + fp8 rowwise integration (#952)
descent_tuningenabled by default (#935)- compile full-graph CLI option (#946)
🏗️ Kernel & Infrastructure Refactoring
- Triton kernel refactoring (3/N, #907-#909)
- Communication kernels registered as
torch.libraryops (#905) - CuTeDSL communication kernels (fp8, #977), merge-attn-states kernel (#973)
- Unified all2all/ring communication API (#985)
- Async Ulysses refactoring (#986)
- Unified ops registration policy (#931, #939, #967)
🔗 Parallelism Improvements
record_plansfor TP/TE-P planners (#916-#917)- sub cp_plan support (#989)
- CP & VAE-P planner refactoring for better logging (#918)
⚡ Compilation Optimizations
- Removed manual graph breaks (#885)
- CUDA Graph for dynamic compile (#951)
- SVDQuant + compile compatibility; offload + compile (#1014)
- Log suppression improvements (#934, #923)
🤝 Community Integrations
- MindIE-SD: Huawei Ascend NPU attention/compilation backend support (@blian6, #1004)
- TensorRT-LLM community link (#991)
📦 Dependency Updates
- PyTorch → 2.11.0 (#900)
uvfor dependency management (#992)- Python version compatibility fix (@FNGarvin, #1025)
- TorchAO ≥ 0.17.0
📝 Documentation
- New/reworked docs: QUANTIZATION.md, OFFLOAD.md, RAY.md, CACHE_API.md (DMD), CUDA_GRAPH.md, COMPILE.md
- Formatting and typo fixes (#886-#887, #899, #903, #913)
- FAQ update: Flash Attention 2 install guide (#915)
- Community link fixes (#928, #940)
⚠️ Breaking Changes
| Change | PR | Migration |
|---|---|---|
| Serving module deprecated | #933 | Migrate to SGLang Diffusion or vLLM-Omni |
| Native Diffusers parallelism backend deprecated | #1017 | Use Cache-DiT native parallelism backends (better performance) |
| Quantize API legacy kwargs deprecated | #910-#911 | Migrate to unified QuantizeConfig + svdq_kwargs. Legacy grad_ckpt, reorder_before_quantize etc. removed |
👥 New Contributors
Thank you to the new contributors who joined the Cache-DiT community:
- @blian6 — MindIE-SD NPU attention/compilation backend support
- @FNGarvin — Python version compatibility fix
- @Archerkattri — DMD Calibrator
🙏 Contributors
Thank you to all contributors for this release: @DefTruth, @Archerkattri, @FNGarvin, @blian6.
For the full list of changes, see GitHub Release v1.5.0 and the full changelog.
🇨🇳 中文版
📋 概述
Cache-DiT v1.5.0 是一次重大功能更新,历时 3 个月(2026-03-12 ~ 2026-06-16),涵盖 176 个 PR。本次发布围绕四大核心模块:SVDQuant W4A4 量化(PTQ/DQ/NVFP4/转换器CLI)、DMD Calibrator(基于指数基底的动态模态分解校准器)、Bucket-style Layerwise CPU Offload(计算-通信重叠的逐层卸载)、Ray Wrapper(分布式推理透明化包装器)。此外还包括 FLUX.2-klein-kv 系列支持、CUDA Graph 完整集成、大量 kernel 重构与量化增强、并行框架改进、以及全面更新的文档体系。
✨ 核心亮点
1. 💎 SVDQuant W4A4 量化
Cache-DiT v1.5.0 原生集成了完整的 SVDQuant W4A4 量化工作流,这是本次发布最重要的特性。与其依赖第三方库,用户现在可以直接通过 Cache-DiT 的 cache_dit.quantize() / cache_dit.load() API 完成从校准到推理的全链路 W4A4 量化。
📊 PTQ(后训练量化)
支持 svdq_int4_r{rank} 和 svdq_nvfp4_r{rank} 两种量化类型:
- INT4 PTQ(≥sm80):通过
calibrate_fn收集激活统计 → SVD 低秩分解 → INT4 打包 → 运行时 W4A4 GEMM。支持三种校准精度:low(推荐默认,~18x 加速)、medium、high。序列化到{quant_type}.safetensors+quant_config.json,通过cache_dit.load()一键恢复。 - NVFP4 PTQ(≥sm120,Blackwell):专为 RTX 5090 等 Blackwell GPU 设计,使用 NVFP4 格式打包权重,当前仅支持
runtime_kernel="v1"。
性能数据(FLUX.2-klein-4B,1024×1024,L20):
| 阶段 | 延迟 (s) | 显存 (GiB) | Transformer 权重 (GiB) |
|---|---|---|---|
| BF16 baseline | 2.13 | 17.32 | 7.22 |
| SVDQuant INT4 | 1.24 | 12.39 | 2.28 |
| SVDQuant + compile | 1.02 | 12.39 | 2.28 |
- Transformer 权重:~3.2× 压缩(7.22 → 2.28 GiB)
- 端到端延迟:~1.7× 加速(2.13 → 1.24s),compile 叠加后 ~2.1× 加速(2.13 → 1.02s)
- PSNR 保持 29+ dB,视觉质量几乎无损
NVFP4 性能(RTX 5090,FLUX.2-klein-4B,1024×1024):
| 阶段 | 延迟 (s) | 加速比 | 显存 (GiB) |
|---|---|---|---|
| BF16 baseline | 0.97 | 1.00× | 17.32 |
| NVFP4 PTQ | 0.58 | 1.69× | 12.50 |
| NVFP4 + compile | 0.47 | 2.05× | 12.50 |
⚡ DQ(动态量化)
无需校准数据的零样本量化,类型后缀 _dq(如 svdq_int4_r128_dq):
- identity(默认):直接对原始权重矩阵做 SVD 低秩分解,无需校准、无需序列化
- weight / weight_inv:仅基于权重统计量的启发式平滑策略(实验性)
- few_shot:运行时收集少量前向的激活统计后实时量化,支持 7 种松弛策略(
auto/stable_auto/power/log/rank/top/fixed),可配置few_shot_steps、few_shot_relax_factor、few_shot_relax_top_ratio、few_shot_relax_strategy。支持few_shot_auto_compile在量化完成后自动触发torch.compile
DQ 性能(FLUX.2-klein-4B,1024×1024,identity smooth,rank=128):1.28s,PSNR 28.71 dB。
🔧 SVDQ Converter CLI
新增 cache-dit-convert 命令行工具,一键将预训练模型转换为 SVDQ W4A4 格式:
cache-dit-convert --model-path black-forest-labs/FLUX.2-klein-4B \
--save-dir ./FLUX.2-klein-4B-svdq \
--quant-type svdq-int4-r128-dq支持 INT4/NVFP4 DQ 转换、自定义 smooth 策略、多种精度选项。生成 {quant_type}.safetensors + quant_config.json,可通过 cache_dit.load() 直接加载。
🔀 Fused MLP
新增 fused_gelu_mlp / fused_gelu_proj pass(通过 svdq_kwargs["fused_mlp"]=True 启用),将第一个 GEMM + GELU 激活融合为单 kernel,降低 kernel launch 开销和中间激活显存占用。
🔗 与并行兼容(Cache-DiT 独家特色)
SVDQuant 在 Cache-DiT 中完整支持 Context Parallelism(Ulysses / Ring / USP)和 Tensor Parallelism(PyTorch DTensor)。用户可以在分布式推理场景下叠加使用量化 + 并行,实现极致的显存压缩和吞吐提升。SVDQW4A4ShardLinear(dtensor.py)提供原生 TP 分片支持。这是 Cache-DiT 中 SVDQuant 区别于其他 W4A4 实现的差异化能力。
⚙️ 量化配置增强
- 区域量化(
regional_quantize=True+repeated_blocks):仅量化 transformer 的重复块,保持嵌入层等敏感层全精度 - 混合精度计划(
precision_plan):按层名模式为不同子层指定不同量化类型(如attn.to_q用float8_per_tensor、attn.to_k用float8_weight_only) - FP8 Per-Tensor Fallback(
per_tensor_fallback=True,默认开启):在 TP 场景下,不支持 per-row 量化的层自动回退到 per-tensor,消除跳过警告,提升覆盖率(144 → 144,0 skip) - TorchAO 后端重构:更清晰的
QuantizeBackend枚举(AUTO / TORCHAO / CACHE_DIT / NONE) - 量化 API 重构:废弃旧版 kwargs,统一为
QuantizeConfig+svdq_kwargs
📦 cache-dit-cu13 预编译 Wheel
为 CUDA 13 用户提供预编译的 SVDQuant wheel:pip install cache-dit-cu13,免去从源码编译 SVDQ kernel 的麻烦。
2. 💾 Bucket-style Layerwise CPU Offload
Cache-DiT v1.5.0 引入了全新的 bucket 式逐层卸载机制,解决了传统逐层 offload "每层等传输"的低效问题。
核心设计:
- Bucket Pipeline:将目标模块切分为连续小桶,当前桶执行时异步预取下个桶,实现计算-通信重叠
- 双独立 Copy Stream 池:onload(H2D)和 offload(D2H)各自拥有独立 CUDA stream 池
- Persistent Bins:将常驻预算均匀分布到目标序列上,避免热权重集中在 prefix
- 灵活的资源控制:
transfer_buckets(预取深度)、persistent_buckets(常驻桶数)、persistent_bins(常驻分布桶数)、prefetch_limit(保守预取限制)、max_copy_streams(并发拷贝流数)、max_inflight_prefetch_bytes(预取字节预算)
性能数据(FLUX.1-dev,L20):
| 配置 | 显存 | 延迟 |
|---|---|---|
| 无 offload | ~38 GiB | 23.4s |
| 原生 diffusers offload | ~25 GiB | 56s |
| 原生 diffusers sequential | ~1 GiB | 335s |
| Layerwise(transfer=4, persistent=32, bins=4) | ~16 GiB | 24.6s |
仅 1.2s 的额外延迟(vs 无 offload 23.4s),即可将显存从 38 GiB 压缩到 16 GiB。
兼容 torch.compile:先应用 offload 再编译,offload hooks 在编译后正常工作。
CLI 快速启动:
python3 -m cache_dit.generate flux \
--layerwise-offload --layerwise-async-transfer \
--layerwise-transfer-buckets 4 --layerwise-persistent-buckets 32 \
--layerwise-persistent-bins 4 --layerwise-prefetch-limit \
--layerwise-max-inflight-prefetch-bytes 8gib --compile3. 🌩️ Ray Wrapper(分布式推理透明化)
Ray Wrapper 让分布式推理对用户代码完全透明。你不需要 torchrun、不需要 dist.init_process_group、不需要手动模型分片 —— 只需 use_ray=True,Cache-DiT 接管一切。
两种包装级别:
| 级别 | 描述 | 推荐场景 |
|---|---|---|
| Pipeline Wrapper(推荐) | Ray 管理整个 pipeline 执行,包括 text encoder、VAE、scheduler | 完整功能支持(cache、量化、并行),最简单的用户体验 |
| Transformer Wrapper | 仅 transformer 由 Ray 执行,其他组件留在主进程 | 轻量级,但比pipeline level 慢 |
核心特性:
ray_transfer_fn:用户自定义每个 worker 的模型加载逻辑,绕过序列化/反序列化开销,解决自定义模块的类解析问题ray_use_compile:worker 内自动编译ray_runtime_env:通过PYTHONPATH处理自定义模块导入- 支持 TP、Ulysses、Ring 等所有并行策略
- LoRA 支持:建议融合后使用(TP 不支持未融合 LoRA)
性能数据(FLUX.2-klein-base-9B):
| 配置 | 延迟 |
|---|---|
| Baseline(单卡) | 47.41s |
| Ray Wrapper TP=2 + compile | 24.57s |
最简用法:
cache_dit.enable_cache(
pipe,
parallelism_config=ParallelismConfig(tp_size=2, use_ray=True),
)
image = pipe(prompt="A cat...").images[0] # 代码完全不变4. 🔮 DMD Calibrator(动态模态分解校准器)
DMD(Dynamic Mode Decomposition)是一个基于指数基底的前馈校准器,作为 TaylorSeer(多项式基底)的替代方案。
数学原理:将缓存特征流建模为线性动力系统
TaylorSeer vs DMD 对比:
| 维度 | TaylorSeer(多项式) | DMD(指数) |
|---|---|---|
| 基底 | ||
| 外推行为 |
|
|
| 快照要求 | 2+(一阶) | ≥ 4 均匀间隔 |
| 最佳场景 | DiT 类去噪(DDPM) | 流匹配生成器(Hunyuan3D 等) |
| 噪声敏感性 | 低 | 中等(SVD 截断抗噪) |
使用方式:
cache_dit.enable_cache(
pipe,
cache_config=DBCacheConfig(...),
calibrator_config=DMDCalibratorConfig(
dmd_history=6, dmd_rank=0, dmd_ridge=1e-8,
),
)CLI:python -m cache_dit.generate flux --cache --dmd --dmd-history 6
🔧 其他增强
🧩 FLUX.2-klein-kv 系列支持
📈 CUDA Graph
- 完整 CUDA Graph 支持(#942-#952)
- CUDA Graph + fp8 rowwise 联合使用(#952)
- descent_tuning 默认启用(#935)
- compile full-graph CLI option(#946)
🏗️ Kernel 与基础设施重构
- Triton kernel 重构(3/N,#907-#909)
- 通信 kernel 注册为
torch.libraryops(#905) - CuTeDSL 通信 kernel(fp8,#977)、merge-attn-states kernel(#973)
- 统一 all2all/ring 通信 API(#985)
- 异步 Ulysses 重构(#986)
- ops 注册策略统一(#931, #939, #967)
🔗 并行改进
⚡ 编译优化
- 移除手动 graph break(#885)
- CUDA Graph for dynamic compile(#951)
- SVDQuant + compile 兼容(#1014 offload + compile)
- 日志抑制优化(#934, #923)
🤝 社区集成
📦 依赖升级
📝 文档
- 新增/大幅更新文档:QUANTIZATION.md、OFFLOAD.md、RAY.md、CACHE_API.md(DMD)、CUDA_GRAPH.md、COMPILE.md
- 文档格式化与 typo 修复(#886-#887, #899, #903, #913)
- FAQ 更新:Flash Attention 2 安装指南(#915)
- 社区链接修复(#928, #940)
⚠️ Breaking Changes
| 变更 | PR | 迁移指引 |
|---|---|---|
| Serving 模块废弃 | #933 | 推荐迁移至 SGLang Diffusion 或 vLLM-Omni |
| Native Diffusers 并行后端废弃 | #1017 | 使用 Cache-DiT 原生并行后端(性能更好) |
| Quantize API 废弃旧 kwargs | #910-#911 | 统一使用 QuantizeConfig + svdq_kwargs,旧版 grad_ckpt、reorder_before_quantize 等参数已移除 |
👥 新贡献者
感谢以下新贡献者加入 Cache-DiT 社区:
- @blian6 — MindIE-SD NPU 注意力/编译后端支持
- @FNGarvin — Python 版本兼容性修复
- @Archerkattri — DMD Calibrator
🙏 致谢
感谢所有贡献者:@DefTruth、@Archerkattri、@FNGarvin、@blian6。
完整变更列表请参见 GitHub Release v1.5.0。