🚀 Cache-DiT v1.5.0 Release Notes

Release Date: 2026-06-16
Comparison: v1.3.0...v1.5.0 (176 commits)
Full Changelog: GitHub Releases

📋 Overview

Cache-DiT v1.5.0 is a major feature release spanning 3 months (2026-03-12 ~ 2026-06-16), covering 176 PRs. This release is organized around four core modules: SVDQuant W4A4 Quantization (PTQ/DQ/NVFP4/Converter CLI), DMD Calibrator (exponential-basis Dynamic Mode Decomposition), Bucket-style Layerwise CPU Offload (compute-communication overlapped offloading), and Ray Wrapper (transparent distributed inference). Additional highlights include FLUX.2-klein-kv series support, full CUDA Graph integration, extensive kernel refactoring, quantization enhancements, parallelism framework improvements, and comprehensively updated documentation.

✨ Core Highlights

1. 💎 SVDQuant W4A4 Quantization

Cache-DiT v1.5.0 natively integrates a complete SVDQuant W4A4 quantization workflow — the most significant feature of this release. Instead of relying on third-party libraries, users can now perform end-to-end W4A4 quantization directly through Cache-DiT's cache_dit.quantize() / cache_dit.load() API.

📊 PTQ (Post-Training Quantization)

Supports svdq_int4_r{rank} and svdq_nvfp4_r{rank} quant types:

INT4 PTQ (≥sm80): Collect activation statistics via calibrate_fn → SVD low-rank decomposition → INT4 packing → runtime W4A4 GEMM. Supports three calibration precision levels: low (recommended default, ~18× speedup), medium, high. Serialize to {quant_type}.safetensors + quant_config.json; restore via cache_dit.load().
NVFP4 PTQ (≥sm120, Blackwell): Designed for RTX 5090 and other Blackwell GPUs. Currently only runtime_kernel="v1" is supported for NVFP4.

Performance (FLUX.2-klein-4B, 1024×1024, L20):

Stage	Latency (s)	Memory (GiB)	Transformer Weight (GiB)
BF16 baseline	2.13	17.32	7.22
SVDQuant INT4	1.24	12.39	2.28
SVDQuant + compile	1.02	12.39	2.28

Transformer weight reduction: ~3.2× compression (7.22 → 2.28 GiB)
End-to-end latency: ~1.7× speedup (2.13 → 1.24s), ~2.1× with compile (2.13 → 1.02s)
PSNR > 29 dB, near-lossless visual quality

NVFP4 Performance (RTX 5090, FLUX.2-klein-4B, 1024×1024):

Stage	Latency (s)	Speedup	Memory (GiB)
BF16 baseline	0.97	1.00×	17.32
NVFP4 PTQ	0.58	1.69×	12.50
NVFP4 + compile	0.47	2.05×	12.50

⚡ DQ (Dynamic Quantization)

Zero-calibration quantization via _dq suffix types (e.g., svdq_int4_r128_dq):

identity (default): Apply SVD low-rank decomposition directly to the original weight matrix — no calibration, no serialization.
weight / weight_inv: Weight-statistics-only heuristic smooth strategies (experimental).
few_shot: Collect a small number of real inference forwards at runtime, then quantize in-place with configurable relaxation strategies (7 strategies: auto/stable_auto/power/log/rank/top/fixed). Supports few_shot_auto_compile for deferred compilation after quantization.

DQ Performance (FLUX.2-klein-4B, 1024×1024, identity, rank=128): 1.28s, PSNR 28.71 dB.

🔧 SVDQ Converter CLI

New cache-dit-convert command-line tool for one-click model conversion to SVDQ W4A4:

cache-dit-convert --model-path black-forest-labs/FLUX.2-klein-4B \
  --save-dir ./FLUX.2-klein-4B-svdq \
  --quant-type svdq-int4-r128-dq

Supports INT4/NVFP4 DQ conversion, custom smooth strategies, and multiple precision options. Outputs {quant_type}.safetensors + quant_config.json.

🔀 Fused MLP

New fused_gelu_mlp / fused_gelu_proj passes (enable via svdq_kwargs["fused_mlp"]=True), fusing the first GEMM + GELU activation into a single kernel for reduced kernel launch overhead and intermediate activation memory.

🔗 Parallelism Compatibility (Cache-DiT Exclusive)

SVDQuant in Cache-DiT fully supports Context Parallelism (Ulysses / Ring / USP) and Tensor Parallelism (PyTorch DTensor). Users can layer quantization on top of distributed parallelism for extreme memory compression and throughput gains. SVDQW4A4ShardLinear (dtensor.py) provides native TP sharding support. This is a differentiating capability of Cache-DiT's SVDQuant implementation versus other W4A4 solutions.

⚙️ Quantization Configuration Enhancements

Regional Quantization (regional_quantize=True + repeated_blocks): Quantize only repeated transformer blocks, keeping embedding layers at full precision.
Hybrid Precision Plan (precision_plan): Assign different quant types to different sub-layers by name pattern.
FP8 Per-Tensor Fallback (per_tensor_fallback=True, default on): Auto-fallback from per-row to per-tensor for TP-incompatible layers.
TorchAO Backend Refactor: Cleaner QuantizeBackend enum (AUTO / TORCHAO / CACHE_DIT / NONE).
Quantize API Refactor: Deprecated legacy kwargs, unified under QuantizeConfig + svdq_kwargs.

📦 cache-dit-cu13 Pre-built Wheel

Pre-compiled SVDQuant wheel for CUDA 13 users: pip install cache-dit-cu13 — no source build needed.

2. 💾 Bucket-style Layerwise CPU Offload

Cache-DiT v1.5.0 introduces a novel bucket-style layerwise offload mechanism that overcomes the "every layer waits for its own transfer" inefficiency of traditional approaches.

Core Design:

Bucket Pipeline: Divide target modules into small contiguous buckets; prefetch the next bucket asynchronously while the current one executes.
Dual Independent Copy Stream Pools: Separate CUDA stream pools for onload (H2D) and offload (D2H).
Persistent Bins: Distribute the persistent budget evenly across the target sequence.
Flexible Resource Controls: transfer_buckets, persistent_buckets, persistent_bins, prefetch_limit, max_copy_streams, max_inflight_prefetch_bytes.

Performance (FLUX.1-dev, L20):

Config	Memory	Latency
No offload	~38 GiB	23.4s
Diffusers sequential	~1 GiB	335s
Layerwise (transfer=4, persistent=32, bins=4)	~16 GiB	24.6s

Only ~1.2s added latency versus no-offload baseline for a 38→16 GiB memory reduction.

torch.compile Compatible: Apply offload before compile; offload hooks work correctly after compilation.

CLI Quick Start:

python3 -m cache_dit.generate flux \
  --layerwise-offload --layerwise-async-transfer \
  --layerwise-transfer-buckets 4 --layerwise-persistent-buckets 32 \
  --layerwise-persistent-bins 4 --layerwise-prefetch-limit \
  --layerwise-max-inflight-prefetch-bytes 8gib --compile

3. 🌩️ Ray Wrapper (Transparent Distributed Inference)

The Ray Wrapper makes distributed inference completely transparent to user code. No torchrun, no dist.init_process_group, no manual model sharding — just use_ray=True, and Cache-DiT handles everything.

Two Wrapper Levels:

Level	Description	Best For
Pipeline Wrapper (recommended)	Ray manages the entire pipeline execution	Full feature support (cache, quant, parallelism), simplest, fastest.
Transformer Wrapper	Only the transformer runs on Ray workers	Lightweight, but slight slower

Key Features:

ray_transfer_fn: User-defined per-worker model loading, bypassing serialization overhead and custom module class resolution issues.
ray_use_compile: Automatic per-worker compilation.
ray_runtime_env: Custom module import handling via PYTHONPATH.
Supports all parallelism strategies: TP, Ulysses, Ring.
LoRA support: fuse before enabling (TP requires fused LoRAs).

Performance (FLUX.2-klein-base-9B):

Config	Latency
Baseline (single GPU)	47.41s
Ray TP=2 + compile	24.57s

Minimal Example:

cache_dit.enable_cache(
  pipe,
  parallelism_config=ParallelismConfig(tp_size=2, use_ray=True),
)
image = pipe(prompt="A cat...").images[0]  # Code unchanged

4. 🔮 DMD Calibrator (Dynamic Mode Decomposition)

DMD (Dynamic Mode Decomposition) is an exponential-basis feed-forward calibrator, serving as a drop-in alternative to TaylorSeer's polynomial basis.

Mathematical Principle: The cached feature stream is modeled as a linear dynamical system $Y_{t+1} \approx A \cdot Y_t$. The propagator $A$ is identified via one economy SVD, then eigendecomposed for extrapolation via $\lambda^k$. Unlike TaylorSeer's polynomial extrapolation (diverges as $t^n \to \infty$), DMD is bounded when $\lvert\lambda\rvert \leq 1$.

TaylorSeer vs DMD:

Aspect	TaylorSeer (Polynomial)	DMD (Exponential)
Basis	$Y(t) \approx \sum \frac{d^nY}{dt^n} \frac{t^n}{n!}$	$Y(t) \approx \Phi \cdot \text{diag}(\lambda^t) \cdot b$
Extrapolation	Diverges as $t^n \to \infty$	Bounded when $\lvert\lambda\rvert \leq 1$
Snapshots needed	2+ (1st order)	≥ 4 uniformly spaced
Best for	DiT-class denoising (DDPM)	Flow-matching generators (Hunyuan3D, etc.)
Noise sensitivity	Low	Moderate (SVD truncation suppresses noise)

Usage:

cache_dit.enable_cache(
  pipe,
  cache_config=DBCacheConfig(...),
  calibrator_config=DMDCalibratorConfig(
    dmd_history=6, dmd_rank=0, dmd_ridge=1e-8,
  ),
)

CLI: python -m cache_dit.generate flux --cache --dmd --dmd-history 6

🔧 Other Enhancements

🧩 FLUX.2-klein-kv Series

TP + compile integration (#888)
fp8 per-row + TP support (#896)
Async Ulysses support (#877)

📈 CUDA Graph

Full CUDA Graph support (#942-#952)
CUDA Graph + fp8 rowwise integration (#952)
descent_tuning enabled by default (#935)
compile full-graph CLI option (#946)

🏗️ Kernel & Infrastructure Refactoring

Triton kernel refactoring (3/N, #907-#909)
Communication kernels registered as torch.library ops (#905)
CuTeDSL communication kernels (fp8, #977), merge-attn-states kernel (#973)
Unified all2all/ring communication API (#985)
Async Ulysses refactoring (#986)
Unified ops registration policy (#931, #939, #967)

🔗 Parallelism Improvements

record_plans for TP/TE-P planners (#916-#917)
sub cp_plan support (#989)
CP & VAE-P planner refactoring for better logging (#918)

⚡ Compilation Optimizations

Removed manual graph breaks (#885)
CUDA Graph for dynamic compile (#951)
SVDQuant + compile compatibility; offload + compile (#1014)
Log suppression improvements (#934, #923)

🤝 Community Integrations

MindIE-SD: Huawei Ascend NPU attention/compilation backend support (@blian6, #1004)
TensorRT-LLM community link (#991)

📦 Dependency Updates

PyTorch → 2.11.0 (#900)
uv for dependency management (#992)
Python version compatibility fix (@FNGarvin, #1025)
TorchAO ≥ 0.17.0

📝 Documentation

New/reworked docs: QUANTIZATION.md, OFFLOAD.md, RAY.md, CACHE_API.md (DMD), CUDA_GRAPH.md, COMPILE.md
Formatting and typo fixes (#886-#887, #899, #903, #913)
FAQ update: Flash Attention 2 install guide (#915)
Community link fixes (#928, #940)

⚠️ Breaking Changes

Change	PR	Migration
Serving module deprecated	#933	Migrate to SGLang Diffusion or vLLM-Omni
Native Diffusers parallelism backend deprecated	#1017	Use Cache-DiT native parallelism backends (better performance)
Quantize API legacy kwargs deprecated	#910-#911	Migrate to unified `QuantizeConfig` + `svdq_kwargs`. Legacy `grad_ckpt`, `reorder_before_quantize` etc. removed

👥 New Contributors

Thank you to the new contributors who joined the Cache-DiT community:

@blian6 — MindIE-SD NPU attention/compilation backend support
@FNGarvin — Python version compatibility fix
@Archerkattri — DMD Calibrator

🙏 Contributors

Thank you to all contributors for this release: @DefTruth, @Archerkattri, @FNGarvin, @blian6.

For the full list of changes, see GitHub Release v1.5.0 and the full changelog.

🇨🇳 中文版

📋 概述

Cache-DiT v1.5.0 是一次重大功能更新，历时 3 个月（2026-03-12 ~ 2026-06-16），涵盖 176 个 PR。本次发布围绕四大核心模块：SVDQuant W4A4 量化（PTQ/DQ/NVFP4/转换器CLI）、DMD Calibrator（基于指数基底的动态模态分解校准器）、Bucket-style Layerwise CPU Offload（计算-通信重叠的逐层卸载）、Ray Wrapper（分布式推理透明化包装器）。此外还包括 FLUX.2-klein-kv 系列支持、CUDA Graph 完整集成、大量 kernel 重构与量化增强、并行框架改进、以及全面更新的文档体系。

✨ 核心亮点

1. 💎 SVDQuant W4A4 量化

Cache-DiT v1.5.0 原生集成了完整的 SVDQuant W4A4 量化工作流，这是本次发布最重要的特性。与其依赖第三方库，用户现在可以直接通过 Cache-DiT 的 cache_dit.quantize() / cache_dit.load() API 完成从校准到推理的全链路 W4A4 量化。

📊 PTQ（后训练量化）

支持 svdq_int4_r{rank} 和 svdq_nvfp4_r{rank} 两种量化类型：

INT4 PTQ（≥sm80）：通过 calibrate_fn 收集激活统计 → SVD 低秩分解 → INT4 打包 → 运行时 W4A4 GEMM。支持三种校准精度：low（推荐默认，~18x 加速）、medium、high。序列化到 {quant_type}.safetensors + quant_config.json，通过 cache_dit.load() 一键恢复。
NVFP4 PTQ（≥sm120，Blackwell）：专为 RTX 5090 等 Blackwell GPU 设计，使用 NVFP4 格式打包权重，当前仅支持 runtime_kernel="v1"。

性能数据（FLUX.2-klein-4B，1024×1024，L20）：

阶段	延迟 (s)	显存 (GiB)	Transformer 权重 (GiB)
BF16 baseline	2.13	17.32	7.22
SVDQuant INT4	1.24	12.39	2.28
SVDQuant + compile	1.02	12.39	2.28

Transformer 权重：~3.2× 压缩（7.22 → 2.28 GiB）
端到端延迟：~1.7× 加速（2.13 → 1.24s），compile 叠加后 ~2.1× 加速（2.13 → 1.02s）
PSNR 保持 29+ dB，视觉质量几乎无损

NVFP4 性能（RTX 5090，FLUX.2-klein-4B，1024×1024）：

阶段	延迟 (s)	加速比	显存 (GiB)
BF16 baseline	0.97	1.00×	17.32
NVFP4 PTQ	0.58	1.69×	12.50
NVFP4 + compile	0.47	2.05×	12.50

⚡ DQ（动态量化）

无需校准数据的零样本量化，类型后缀 _dq（如 svdq_int4_r128_dq）：

identity（默认）：直接对原始权重矩阵做 SVD 低秩分解，无需校准、无需序列化
weight / weight_inv：仅基于权重统计量的启发式平滑策略（实验性）
few_shot：运行时收集少量前向的激活统计后实时量化，支持 7 种松弛策略（auto/stable_auto/power/log/rank/top/fixed），可配置 few_shot_steps、few_shot_relax_factor、few_shot_relax_top_ratio、few_shot_relax_strategy。支持 few_shot_auto_compile 在量化完成后自动触发 torch.compile

DQ 性能（FLUX.2-klein-4B，1024×1024，identity smooth，rank=128）：1.28s，PSNR 28.71 dB。

🔧 SVDQ Converter CLI

新增 cache-dit-convert 命令行工具，一键将预训练模型转换为 SVDQ W4A4 格式：

cache-dit-convert --model-path black-forest-labs/FLUX.2-klein-4B \
  --save-dir ./FLUX.2-klein-4B-svdq \
  --quant-type svdq-int4-r128-dq

支持 INT4/NVFP4 DQ 转换、自定义 smooth 策略、多种精度选项。生成 {quant_type}.safetensors + quant_config.json，可通过 cache_dit.load() 直接加载。

🔀 Fused MLP

新增 fused_gelu_mlp / fused_gelu_proj pass（通过 svdq_kwargs["fused_mlp"]=True 启用），将第一个 GEMM + GELU 激活融合为单 kernel，降低 kernel launch 开销和中间激活显存占用。

🔗 与并行兼容（Cache-DiT 独家特色）

SVDQuant 在 Cache-DiT 中完整支持 Context Parallelism（Ulysses / Ring / USP）和 Tensor Parallelism（PyTorch DTensor）。用户可以在分布式推理场景下叠加使用量化 + 并行，实现极致的显存压缩和吞吐提升。SVDQW4A4ShardLinear（dtensor.py）提供原生 TP 分片支持。这是 Cache-DiT 中 SVDQuant 区别于其他 W4A4 实现的差异化能力。

⚙️ 量化配置增强

区域量化（regional_quantize=True + repeated_blocks）：仅量化 transformer 的重复块，保持嵌入层等敏感层全精度
混合精度计划（precision_plan）：按层名模式为不同子层指定不同量化类型（如 attn.to_q 用 float8_per_tensor、attn.to_k 用 float8_weight_only）
FP8 Per-Tensor Fallback（per_tensor_fallback=True，默认开启）：在 TP 场景下，不支持 per-row 量化的层自动回退到 per-tensor，消除跳过警告，提升覆盖率（144 → 144，0 skip）
TorchAO 后端重构：更清晰的 QuantizeBackend 枚举（AUTO / TORCHAO / CACHE_DIT / NONE）
量化 API 重构：废弃旧版 kwargs，统一为 QuantizeConfig + svdq_kwargs

📦 cache-dit-cu13 预编译 Wheel

为 CUDA 13 用户提供预编译的 SVDQuant wheel：pip install cache-dit-cu13，免去从源码编译 SVDQ kernel 的麻烦。

2. 💾 Bucket-style Layerwise CPU Offload

Cache-DiT v1.5.0 引入了全新的 bucket 式逐层卸载机制，解决了传统逐层 offload "每层等传输"的低效问题。

核心设计：

Bucket Pipeline：将目标模块切分为连续小桶，当前桶执行时异步预取下个桶，实现计算-通信重叠
双独立 Copy Stream 池：onload（H2D）和 offload（D2H）各自拥有独立 CUDA stream 池
Persistent Bins：将常驻预算均匀分布到目标序列上，避免热权重集中在 prefix
灵活的资源控制：transfer_buckets（预取深度）、persistent_buckets（常驻桶数）、persistent_bins（常驻分布桶数）、prefetch_limit（保守预取限制）、max_copy_streams（并发拷贝流数）、max_inflight_prefetch_bytes（预取字节预算）

性能数据（FLUX.1-dev，L20）：

配置	显存	延迟
无 offload	~38 GiB	23.4s
原生 diffusers offload	~25 GiB	56s
原生 diffusers sequential	~1 GiB	335s
Layerwise（transfer=4, persistent=32, bins=4）	~16 GiB	24.6s

仅 1.2s 的额外延迟（vs 无 offload 23.4s），即可将显存从 38 GiB 压缩到 16 GiB。

兼容 torch.compile：先应用 offload 再编译，offload hooks 在编译后正常工作。

CLI 快速启动：

python3 -m cache_dit.generate flux \
  --layerwise-offload --layerwise-async-transfer \
  --layerwise-transfer-buckets 4 --layerwise-persistent-buckets 32 \
  --layerwise-persistent-bins 4 --layerwise-prefetch-limit \
  --layerwise-max-inflight-prefetch-bytes 8gib --compile

3. 🌩️ Ray Wrapper（分布式推理透明化）

Ray Wrapper 让分布式推理对用户代码完全透明。你不需要 torchrun、不需要 dist.init_process_group、不需要手动模型分片 —— 只需 use_ray=True，Cache-DiT 接管一切。

两种包装级别：

级别	描述	推荐场景
Pipeline Wrapper（推荐）	Ray 管理整个 pipeline 执行，包括 text encoder、VAE、scheduler	完整功能支持（cache、量化、并行），最简单的用户体验
Transformer Wrapper	仅 transformer 由 Ray 执行，其他组件留在主进程	轻量级，但比pipeline level 慢

核心特性：

ray_transfer_fn：用户自定义每个 worker 的模型加载逻辑，绕过序列化/反序列化开销，解决自定义模块的类解析问题
ray_use_compile：worker 内自动编译
ray_runtime_env：通过 PYTHONPATH 处理自定义模块导入
支持 TP、Ulysses、Ring 等所有并行策略
LoRA 支持：建议融合后使用（TP 不支持未融合 LoRA）

性能数据（FLUX.2-klein-base-9B）：

配置	延迟
Baseline（单卡）	47.41s
Ray Wrapper TP=2 + compile	24.57s

最简用法：

cache_dit.enable_cache(
  pipe,
  parallelism_config=ParallelismConfig(tp_size=2, use_ray=True),
)
image = pipe(prompt="A cat...").images[0]  # 代码完全不变

4. 🔮 DMD Calibrator（动态模态分解校准器）

DMD（Dynamic Mode Decomposition）是一个基于指数基底的前馈校准器，作为 TaylorSeer（多项式基底）的替代方案。

数学原理：将缓存特征流建模为线性动力系统 $Y_{t+1} \approx A \cdot Y_t$，通过一次 SVD 识别传播子 $A$，特征分解后以 $\lambda^k$ 进行外推。相比 TaylorSeer 的多项式外推（$t^n$ 发散），DMD 在 $\lvert\lambda\rvert \leq 1$ 时有界。

TaylorSeer vs DMD 对比：

维度	TaylorSeer（多项式）	DMD（指数）
基底	$Y(t) \approx \sum \frac{d^nY}{dt^n} \frac{t^n}{n!}$	$Y(t) \approx \Phi \cdot \text{diag}(\lambda^t) \cdot b$
外推行为	$t^n \to \infty$ 发散	$\lvert\lambda\rvert \leq 1$ 时有界
快照要求	2+（一阶）	≥ 4 均匀间隔
最佳场景	DiT 类去噪（DDPM）	流匹配生成器（Hunyuan3D 等）
噪声敏感性	低	中等（SVD 截断抗噪）

使用方式：

cache_dit.enable_cache(
  pipe,
  cache_config=DBCacheConfig(...),
  calibrator_config=DMDCalibratorConfig(
    dmd_history=6, dmd_rank=0, dmd_ridge=1e-8,
  ),
)

CLI：python -m cache_dit.generate flux --cache --dmd --dmd-history 6

🔧 其他增强

🧩 FLUX.2-klein-kv 系列支持

TP + compile 集成（#888）
fp8 per-row + TP 支持（#896）
Async Ulysses 支持（#877）

📈 CUDA Graph

完整 CUDA Graph 支持（#942-#952）
CUDA Graph + fp8 rowwise 联合使用（#952）
descent_tuning 默认启用（#935）
compile full-graph CLI option（#946）

🏗️ Kernel 与基础设施重构

Triton kernel 重构（3/N，#907-#909）
通信 kernel 注册为 torch.library ops（#905）
CuTeDSL 通信 kernel（fp8，#977）、merge-attn-states kernel（#973）
统一 all2all/ring 通信 API（#985）
异步 Ulysses 重构（#986）
ops 注册策略统一（#931, #939, #967）

🔗 并行改进

record_plans 函数支持（TP/TE-P planners，#916-#917）
sub cp_plan 支持（#989）
CP & VAE-P planner 重构以改进日志（#918）

⚡ 编译优化

移除手动 graph break（#885）
CUDA Graph for dynamic compile（#951）
SVDQuant + compile 兼容（#1014 offload + compile）
日志抑制优化（#934, #923）

🤝 社区集成

MindIE-SD：华为昇腾 NPU 的注意力/编译后端支持（@blian6，#1004）
TensorRT-LLM 社区链接（#991）

📦 依赖升级

PyTorch → 2.11.0（#900）
使用 uv 替代 pip 管理依赖（#992）
Python 版本兼容性修复（@FNGarvin，#1025）
TorchAO ≥ 0.17.0

📝 文档

新增/大幅更新文档：QUANTIZATION.md、OFFLOAD.md、RAY.md、CACHE_API.md（DMD）、CUDA_GRAPH.md、COMPILE.md
文档格式化与 typo 修复（#886-#887, #899, #903, #913）
FAQ 更新：Flash Attention 2 安装指南（#915）
社区链接修复（#928, #940）

⚠️ Breaking Changes

变更	PR	迁移指引
Serving 模块废弃	#933	推荐迁移至 SGLang Diffusion 或 vLLM-Omni
Native Diffusers 并行后端废弃	#1017	使用 Cache-DiT 原生并行后端（性能更好）
Quantize API 废弃旧 kwargs	#910-#911	统一使用 `QuantizeConfig` + `svdq_kwargs`，旧版 `grad_ckpt`、`reorder_before_quantize` 等参数已移除

👥 新贡献者

感谢以下新贡献者加入 Cache-DiT 社区：

@blian6 — MindIE-SD NPU 注意力/编译后端支持
@FNGarvin — Python 版本兼容性修复
@Archerkattri — DMD Calibrator

🙏 致谢

感谢所有贡献者：@DefTruth、@Archerkattri、@FNGarvin、@blian6。

完整变更列表请参见 GitHub Release v1.5.0。

Uh oh!

v1.5.0 Major Release

🚀 Cache-DiT v1.5.0 Release Notes

📋 Overview

✨ Core Highlights

1. 💎 SVDQuant W4A4 Quantization

📊 PTQ (Post-Training Quantization)

⚡ DQ (Dynamic Quantization)

🔧 SVDQ Converter CLI

🔀 Fused MLP

🔗 Parallelism Compatibility (Cache-DiT Exclusive)

⚙️ Quantization Configuration Enhancements

📦 cache-dit-cu13 Pre-built Wheel

2. 💾 Bucket-style Layerwise CPU Offload

3. 🌩️ Ray Wrapper (Transparent Distributed Inference)

4. 🔮 DMD Calibrator (Dynamic Mode Decomposition)

🔧 Other Enhancements

🧩 FLUX.2-klein-kv Series

📈 CUDA Graph

🏗️ Kernel & Infrastructure Refactoring

🔗 Parallelism Improvements

⚡ Compilation Optimizations

🤝 Community Integrations

📦 Dependency Updates

📝 Documentation

⚠️ Breaking Changes

👥 New Contributors

🙏 Contributors

🇨🇳 中文版

📋 概述

✨ 核心亮点

1. 💎 SVDQuant W4A4 量化

📊 PTQ（后训练量化）

⚡ DQ（动态量化）

🔧 SVDQ Converter CLI

🔀 Fused MLP

🔗 与并行兼容（Cache-DiT 独家特色）

⚙️ 量化配置增强

📦 cache-dit-cu13 预编译 Wheel

2. 💾 Bucket-style Layerwise CPU Offload

3. 🌩️ Ray Wrapper（分布式推理透明化）

4. 🔮 DMD Calibrator（动态模态分解校准器）

🔧 其他增强

🧩 FLUX.2-klein-kv 系列支持

📈 CUDA Graph

🏗️ Kernel 与基础设施重构

🔗 并行改进

⚡ 编译优化

🤝 社区集成

📦 依赖升级

📝 文档

⚠️ Breaking Changes

👥 新贡献者

🙏 致谢

Contributors

Uh oh!