Skip to content

v1.2.0 Major Release: NPU, TE-P, VAE-P, CN-P, ...

Choose a tag to compare

@DefTruth DefTruth released this 16 Jan 09:04
· 290 commits to main since this release
853df89

v1.2.0 Major Release: NPU, TE-P, VAE-P, CN-P, ...

Overviews

v1.2.0 is a Major Release after v1.1.0. We introduced many updates in v1.2.0, thereby further enhancing the ease of use and performance of Cache-DiT. We sincerely thank the contributors of Cache-DiT. The main updates for this time are as follows, includes:

  • 🎉New Models Support
  • 🎉Request level cache context
  • 🎉HTTP Serving Support
  • 🎉Context Parallelism Optimization
  • 🎉Text Encoder Parallelism
  • 🎉Auto Encoder (VAE) Parallelism
  • 🎉ControlNet Parallelism
  • 🎉Ascend NPU Support
  • 🎉Community Integration.

🔥New Models Support

  • Qwen-Image:
    • Image: Qwen-Image-2512, Qwen-Image-Layered
    • Edit: Qwen-Image-Edit-2511, Qwen-Image-Edit-2509
    • ControlNet: Qwen-Image-ControlNet, Qwen-Image-ControlNet-Inpainting
  • Qwen-Image-Lightning: Qwen-Image-Lightning series, Qwen-Image-Edit-Lightning series
  • Wan: Wan 2.1 VACE, Wan 2.2 VACE.
  • Z-Image: Z-Image-Turbo, Z-Image-Turbo-Fun-ControlNet-2.0, Z-Image-Turbo-Fun-ControlNet-2.1
  • FLUX.2: FLUX.2-dev, FLUX.2-Klein-4B, FLUX.2-Klein-base-4B, FLUX.2-Klein-9B, FLUX.2-Klein-base-9B
  • LTX-2: LTX-2-I2V, LTX-2-T2V by @BBuf
  • Ovis-Image: Ovis-Image
  • LongCat-Image: LongCat-Image, LongCat-Image-Edit
  • Nunchaku INT4 Models: Z-Image-Turbo, Qwen-Image-Edit-2511

🔥Request level cache context

If you need to use a different num_inference_steps for each user request instead of a fixed value, you should use it in conjunction with refresh_context API. Before performing inference for each user request, update the cache context based on the actual number of steps. Please refer to 📚run_cache_refresh as an example.

import cache_dit
from cache_dit import DBCacheConfig
from diffusers import DiffusionPipeline

# Init cache context with num_inference_steps=None (default)
pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image")
pipe = cache_dit.enable_cache(pipe.transformer, cache_config=DBCacheConfig(num_inference_steps=None))

# Assume num_inference_steps is 28, and we want to refresh the context
cache_dit.refresh_context(pipe.transformer, num_inference_steps=28, verbose=True)
output = pipe(...) # Just call the pipe as normal.
stats = cache_dit.summary(pipe.transformer) # Then, get the summary

# Update the cache context with new num_inference_steps=50.
cache_dit.refresh_context(pipe.transformer, num_inference_steps=50, verbose=True)
output = pipe(...) # Just call the pipe as normal.
stats = cache_dit.summary(pipe.transformer) # Then, get the summary

# Update the cache context with new cache_config.
cache_dit.refresh_context(
    pipe.transformer,
    cache_config=DBCacheConfig(
        residual_diff_threshold=0.1,
        max_warmup_steps=10,
        max_cached_steps=20,
        max_continuous_cached_steps=4,
        # The cache settings should all be located in the cache config 
        # if cache config is provided. Otherwise, we will skip it.
        num_inference_steps=50,
    ),
    verbose=True,
)
output = pipe(...) # Just call the pipe as normal.
stats = cache_dit.summary(pipe.transformer) # Then, get the summary

🔥HTTP Serving Support

  • Built-in HTTP serving deployment support with simple REST APIs by @BBuf, deploy cache-dit models with HTTP API for text-to-image, image editing, multi-image editing, and text/image-to-video generation.

🔥Context Parallelism Optimization

  • UAA: Ulysses Anything Attention: support any sequence length and any head num by @DefTruth @gameofdimension @tingkuanpei
  • Async Ulysses CP: support Async Ulysses QKV Projection for FLUX.1, FLUX.2, Z-Image, Qwen-Image by @DefTruth
  • Async FP8 Ulysses: support async FP8 all2all comm for ulysses by @triple-mu

🔥Text Encoder Parallelism

Currently, cache-dit supported text encoder parallelism for T5Encoder, UMT5Encoder, Llama, Gemma 1/2/3, Mistral, Mistral-3, Qwen-3, Qwen-2.5 VL, Glm and Glm-4 model series, namely, supported almost 🔥ALL pipelines in diffusers.

Users can set the extra_parallel_modules parameter in parallelism_config (when using Tensor Parallelism or Context Parallelism) to specify additional modules that need to be parallelized beyond the main transformer — e.g, text_encoder in Flux2Pipeline. It can further reduce the per-GPU memory requirement and slightly improve the inference performance of the text encoder.

# pip3 install "cache-dit[parallelism]"
from cache_dit import ParallelismConfig

# Transformer Tensor Parallelism + Text Encoder Tensor Parallelism
cache_dit.enable_cache(
    pipe, 
    cache_config=DBCacheConfig(...),
    parallelism_config=ParallelismConfig(
        tp_size=2,
        parallel_kwargs={
            "extra_parallel_modules": [pipe.text_encoder], # FLUX.2
        },
    ),
)

🔥Auto Encoder (VAE) Parallelism

Currently, cache-dit supported auto encoder (vae) parallelism for AutoencoderKL, AutoencoderKLQwenImage, AutoencoderKLWan, and AutoencoderKLHunyuanVideo series, namely, supported almost 🔥ALL pipelines in diffusers. It can further reduce the per-GPU memory requirement and slightly improve the inference performance of the auto encoder. Users can set it by extra_parallel_modules parameter in parallelism_config, for example:

# pip3 install "cache-dit[parallelism]"
from cache_dit import ParallelismConfig

# Transformer Context Parallelism + Text Encoder Tensor Parallelism + VAE Data Parallelism
cache_dit.enable_cache(
    pipe, 
    cache_config=DBCacheConfig(...),
    parallelism_config=ParallelismConfig(
        ulysses_size=2,
        parallel_kwargs={
            "extra_parallel_modules": [pipe.text_encoder, pipe.vae], # FLUX.1
        },
    ),
)

🔥ControlNet Parallelism

Further, cache-dit even supported controlnet parallelism for specific models, such as Z-Image-Turbo with ControlNet. Users can set it by extra_parallel_modules parameter in parallelism_config, for example:

# pip3 install "cache-dit[parallelism]"
from cache_dit import ParallelismConfig

# Transformer Context Parallelism + Text Encoder Tensor Parallelism 
# + VAE Data Parallelism + ControlNet Context Parallelism
cache_dit.enable_cache(
    pipe, 
    cache_config=DBCacheConfig(...),
    parallelism_config=ParallelismConfig(
        ulysses_size=2,
        # case: Z-Image-Turbo-Fun-ControlNet-2.1
        parallel_kwargs={
            "extra_parallel_modules": [pipe.text_encoder, pipe.vae, pipe.controlnet],
        },
    ),
)
# torchrun --nproc_per_node=2 parallel_cache.py

🔥Ascend NPU Support

Cache-DiT now provides native support for Ascend NPU (by @gameofdimension @luren55 @DefTruth). Theoretically, nearly all models supported by Cache-DiT can run on Ascend NPU with most of Cache-DiT’s optimization technologies, including:

  • Hybrid Cache Acceleration (DBCache, DBPrune, TaylorSeer, SCM and more)
  • Context Parallelism (w/ Extended Diffusers' CP APIs, UAA, Async Ulysses, ...)
  • Tensor Parallelism (w/ PyTorch native DTensor and Tensor Parallelism APIs)
  • Text Encoder Parallelism (w/ PyTorch native DTensor and Tensor Parallelism APIs)
  • Auto Encoder (VAE) Parallelism (w/ Data or Tile Parallelism, avoid OOM)
  • ControlNet Parallelism (w/ Context Parallelism for ControlNet module)
  • Built-in HTTP serving deployment support with simple REST APIs

Please refer to Ascend NPU Supported Matrix for more details.

🔥Community Integration

Full Changelogs

New Contributors

Full Changelog: v1.1.0...v1.2.0