## **9. Accelerate Inference and Reduce Memory**

> Original Source: https://huggingface.co/docs/diffusers/main/optimization/fp16

```
> Accelerate inference
> Reduce memory usage
> Diffusers supports: PyTorch 2.0, xFormers, Token merging, DeepCache, TGATE, xDiT, ParaAttention
> Optimized Model Format: JAX/Flax, ONNX, OpenVINO, CoreML
```

In [4]:
import time
import functools
import torch
import tomesd
from diffusers import StableDiffusionXLPipeline
from diffusers import StableDiffusionPipeline
 from diffusers import StableDiffusion3Pipeline
from diffusers import UNet2DConditionModel, LCMScheduler
from dataclasses import dataclass

from torch.nn.attention import SDPBackend, sdpa_kernel
from torchao import apply_dynamic_quant

from diffusers import AutoModel
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image

from diffusers.hooks import apply_group_offloading
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
from diffusers.hooks import apply_layerwise_casting

from DeepCache import DeepCacheSDHelper
from diffusers import PixArtAlphaPipeline

from diffusers import DPMSolverMultistepScheduler
from tgate import TgatePixArtLoader
from tgate import TgateSDXLLoader
from tgate import TgateSDXLDeepCacheLoader

 from xfuser import xFuserArgs, xDiTParallel
 from xfuser.config import FlexibleArgumentParser
 from xfuser.core.distributed import get_world_group

from diffusers import FluxPipeline
import torch.distributed as dist

import jax
import jax.tools.colab_tpu

from diffusers.utils import make_image_grid

from optimum.onnxruntime import ORTStableDiffusionPipeline
from optimum.onnxruntime import ORTStableDiffusionXLPipeline
from optimum.intel import OVStableDiffusionPipeline
from optimum.intel import OVStableDiffusionXLPipeline

from huggingface_hub import snapshot_download
from pathlib import Path

----
### **Accelerate inference**
- Diffusion models are slow at inference because generation is an iterative process where noise is gradually refined into an image or video over a certain number of “steps”.
  - To speedup this process, you can try experimenting with different schedulers, reduce the precision of the model weights for faster computations, use more memory-efficient attention mechanisms, and more.
  - Combine and use these techniques together to make inference faster than using any single technique on its own.

#### Model data type
- The precision and data type of the model weights affect inference speed because a higher precision requires more memory to load and more time to perform the computations.
  - PyTorch loads model weights in float32 or full precision by default, so changing the data type is a simple way to quickly get faster inference.
 
- `bfloat16`
  - `bfloat16` is similar to `float16` but it is more robust to numerical errors.
  - Hardware support for `bfloat16` varies, but most modern GPUs are capable of supporting `bfloat16`.

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
).to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
pipeline(prompt, num_inference_steps=30).images[0]

- `float16`
  - `float16` is similar to `bfloat16` but may be more prone to numerical errors.

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
pipeline(prompt, num_inference_steps=30).images[0]

- `TensorFloat-32 (tf32)`
  - `TensorFloat-32 (tf32)` mode is supported on NVIDIA Ampere GPUs and it computes the convolution and matrix multiplication operations in `tf32`.
  - Storage and other operations are kept in `float32`. This enables significantly faster computations when combined with `bfloat16` or `float16`.
  - PyTorch only enables `tf32` mode for convolutions by default and you’ll need to explicitly enable it for matrix multiplications.

torch.backends.cuda.matmul.allow_tf32 = True

pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
).to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
pipeline(prompt, num_inference_steps=30).images[0]

#### Scaled dot product attention
- Scaled dot product attention (SDPA) implements several attention backends, FlashAttention, xFormers, and a native C++ implementation.
  - It automatically selects the most optimal backend for your hardware.

- SDPA is enabled by default if you’re using `PyTorch >= 2.0` and no additional changes are required to your code.
  - You could try experimenting with other attention backends though if you’d like to choose your own.
  - The example below uses the `torch.nn.attention.sdpa_kernel` context manager to enable efficient attention.

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
).to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
  image = pipeline(prompt, num_inference_steps=30).images[0]

#### `torch.compile`
- `torch.compile` accelerates inference by compiling PyTorch code and operations into optimized kernels.
  - Diffusers typically compiles the more compute-intensive models like the UNet, transformer, or VAE.
  - Enable the following compiler settings for maximum speed (refer to the full list for more options).

In [None]:
torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True

- Load and compile the UNet and VAE.
- There are several different modes you can choose from, but `"max-autotune"` optimizes for the fastest speed by compiling to a CUDA graph.
  - CUDA graphs effectively reduces the overhead by launching multiple GPU operations through a single CPU operation.
- Changing the memory layout to channels_last also optimizes memory and inference speed.

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")
pipeline.unet.to(memory_format=torch.channels_last)
pipeline.vae.to(memory_format=torch.channels_last)
pipeline.unet = torch.compile(
    pipeline.unet, mode="max-autotune", fullgraph=True
)
pipeline.vae.decode = torch.compile(
    pipeline.vae.decode,
    mode="max-autotune",
    fullgraph=True
)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
pipeline(prompt, num_inference_steps=30).images[0]

- Compilation is slow the first time, but once compiled, it is significantly faster.
  - Try to only use the compiled pipeline on the same type of inference operations.
  - Calling the compiled pipeline on a different image size retriggers compilation which is slow and inefficient.

<br>

- **Graph breaks**
  - It is important to specify `fullgraph=True` in `torch.compile` to ensure there are no graph breaks in the underlying model.
  - This allows you to take advantage of torch.compile without any performance degradation.
  - For the UNet and VAE, this changes how you access the return variables.
  ```
    - latents = unet(
    -   latents, timestep=timestep, encoder_hidden_states=prompt_embeds
    -).sample
    
    + latents = unet(
    +   latents, timestep=timestep, encoder_hidden_states=prompt_embeds, return_dict=False
    +)[0]
  ```

<br>

- **GPU sync**
  - The `step()` function is called on the scheduler each time after the denoiser makes a prediction, and the sigmas variable is indexed.
    - When placed on the GPU, it introduces latency because of the communication sync between the CPU and GPU.
    - It becomes more evident when the denoiser has already been compiled.
  - In general, the sigmas should stay on the CPU to avoid the communication sync and latency.

#### Dynamic quantization
- Dynamic quantization improves inference speed by reducing precision to enable faster math operations.
  - This particular type of quantization determines how to scale the activations based on the data at runtime rather than using a fixed scaling factor.
  - As a result, the scaling factor is more accurately aligned with the data.

- The example below applies dynamic `int8` quantization to the UNet and VAE with the `torchao` library.

In [None]:
torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True
torch._inductor.config.force_fuse_int_mm_with_mul = True
torch._inductor.config.use_mixed_mm = True

- Filter out some linear layers in the UNet and VAE which don’t benefit from dynamic quantization with the `dynamic_quant_filter_fn`.

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.bfloat16
).to("cuda")

apply_dynamic_quant(pipeline.unet, dynamic_quant_filter_fn)
apply_dynamic_quant(pipeline.vae, dynamic_quant_filter_fn)

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
pipeline(prompt, num_inference_steps=30).images[0]

#### Fused projection matrices
- An input is projected into three subspaces, represented by the projection matrices Q, K, and V, in an attention block.
  - These projections are typically calculated separately, but you can horizontally combine these into a single matrix and perform the projection in a single step.
  - It increases the size of the matrix multiplications of the input projections and also improves the impact of quantization.

In [None]:
pipeline.fuse_qkv_projections()

----
### **Reduce Memory Usage**
- Modern diffusion models like Flux and Wan have billions of parameters that take up a lot of memory on your hardware for inference.   - This is challenging because common GPUs often don’t have sufficient memory.
  - To overcome the memory limitations, you can use more than one GPU (if available), offload some of the pipeline components to the CPU, and more.

- Keep in mind these techniques may need to be adjusted depending on the model.
  - ex. a transformer-based diffusion model may not benefit equally from these inference speed optimizations as a UNet-based model.

#### Multiple GPUs
- If you have access to more than one GPU, there a few options for efficiently loading and distributing a large model across your hardware.

- **Sharded checkpoints**
  - Loading large checkpoints in several shards in useful because the shards are loaded one at a time.
  - This keeps memory usage low, only requiring enough memory for the model size and the largest shard size.
  - We recommend sharding when the `fp32` checkpoint is greater than 5GB. The default shard size is 5GB.
  - Shard a checkpoint in `save_pretrained()` with the `max_shard_size parameter`.

In [None]:
unet = AutoModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet"
)
unet.save_pretrained("sdxl-unet-sharded", max_shard_size="5GB")

unet = AutoModel.from_pretrained(
    "username/sdxl-unet-sharded", torch_dtype=torch.float16
)
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    unet=unet,
    torch_dtype=torch.float16
).to("cuda")

- **Device placement**
  - The `device_map` parameter controls how the model components in a pipeline are distributed across devices.
  - The balanced device placement strategy evenly splits the pipeline across all available devices.
  - You can inspect a pipeline’s device map with `hf_device_map`.
  - The `device_map` parameter also works on the model-level.
    - This is useful for loading large models, such as the Flux diffusion transformer which has 12.5B parameters.
    - Instead of balanced, set it to "auto" to automatically distribute a model across the fastest device first before moving to slower devices.
  - For more fine-grained control, pass a dictionary to enforce the maximum GPU memory to use on each device.
    - If a device is not in `max_memory`, it is ignored and pipeline components won’t be distributed to it.

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    device_map="balanced"
)

print(pipeline.hf_device_map)

transformer = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev", 
    subfolder="transformer",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

In [None]:
max_memory = {0:"1GB", 1:"1GB"}
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    device_map="balanced",
    max_memory=max_memory
)

- Diffusers uses the maxmium memory of all devices by default, but if they don’t fit on the GPUs, then you’ll need to use a single GPU and offload to the CPU with the methods below.
  - `enable_model_cpu_offload()` only works on a single GPU but a very large model may not fit on it
  - `enable_sequential_cpu_offload()` may work but it is extremely slow and also limited to a single GPU
- Use the `reset_device_map()` method to reset the device_map.
  - This is necessary if you want to use methods like `.to()`, `enable_sequential_cpu_offload()`, and `enable_model_cpu_offload()` on a pipeline that was device-mapped.

In [None]:
pipeline.reset_device_map()

#### VAE slicing
- VAE slicing saves memory by splitting large batches of inputs into a single batch of data and separately processing them.
  - This method works best when generating more than one image at a time.
  - ex. if you’re generating 4 images at once, decoding would increase peak activation memory by `4x`.
  - VAE slicing reduces this by only decoding 1 image at a time instead of all 4 images at once.

- Call `enable_vae_slicing()` to enable sliced VAE.
  - You can expect a small increase in performance when decoding multi-image batches and no performance impact for single-image batches.

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")
pipeline.enable_vae_slicing()
pipeline(["An astronaut riding a horse on Mars"]*32).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

#### VAE tiling
- VAE tiling saves memory by dividing an image into smaller overlapping tiles instead of processing the entire image at once.
  - This also reduces peak memory usage because the GPU is only processing a tile at a time.

- Call `enable_vae_tiling()` to enable VAE tiling.
  - The generated image may have some tone variation from tile-to-tile because they’re decoded separately, but there shouldn’t be any obvious seams between the tiles.
  - Tiling is disabled for resolutions lower than a pre-specified (but configurable) limit.
  - ex. this limit is 512x512 for the VAE in `StableDiffusionPipeline`.

In [None]:
pipeline = AutoPipelineForImage2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")
pipeline.enable_vae_tiling()

init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl-init.png")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
pipeline(prompt, image=init_image, strength=0.5).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

#### CPU offloading
- CPU offloading selectively moves weights from the GPU to the CPU.
  - When a component is required, it is transferred to the GPU and when it isn’t required, it is moved to the CPU.
  - This method works on submodules rather than whole models.
  - It saves memory by avoiding storing the entire model on the GPU.

- CPU offloading dramatically reduces memory usage, but it is also extremely slow because submodules are passed back and forth multiple times between devices.
  - It can often be impractical due to how slow it is.

- Don’t move the pipeline to CUDA before calling `enable_sequential_cpu_offload()`, otherwise the amount of memory saved is only minimal (refer to this issue for more details).
  - This is a stateful operation that installs hooks on the model.

- Call `enable_sequential_cpu_offload()` to enable it on a pipeline.

In [None]:
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipeline.enable_sequential_cpu_offload()

pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=0.,
    height=768,
    width=1360,
    num_inference_steps=4,
    max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

#### Model offloading
- Model offloading moves entire models to the GPU instead of selectively moving some layers or model components.
  - One of the main pipeline models, usually the text encoder, UNet, and VAE, is placed on the GPU while the other components are held on the CPU.
  - Components like the UNet that run multiple times stays on the GPU until its completely finished and no longer needed.
  - This eliminates the communication overhead of CPU offloading and makes model offloading a faster alternative.
  - The tradeoff is memory savings won’t be as large.

- If models are reused outside the pipeline after hookes have been installed (see Removing Hooks for more details), you need to run the entire pipeline and models in the expected order to properly offload them.
  - This is a stateful operation that installs hooks on the model.

- Call `enable_model_cpu_offload()` to enable it on a pipeline.

In [None]:
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipline.enable_model_cpu_offload()

pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=0.,
    height=768,
    width=1360,
    num_inference_steps=4,
    max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

#### Group offloading
- Group offloading moves groups of internal layers (`torch.nn.ModuleList` or `torch.nn.Sequential`) to the CPU.
  - It uses less memory than model offloading and it is faster than CPU offloading because it reduces communication overhead.

- Group offloading may not work with all models if the forward implementation contains weight-dependent device casting of inputs because it may clash with group offloading’s device casting mechanism.

- Call `enable_group_offload()` to enable it for standard Diffusers model components that inherit from `ModelMixin`.
  - For other model components that don’t inherit from `ModelMixin`, such as a generic `torch.nn.Module`, use `apply_group_offloading()` instead.
  - The offload_type parameter can be set to `block_level` or `leaf_level`.

- `block_level` offloads groups of layers based on the `num_blocks_per_group parameter`.
  - ex. if `num_blocks_per_group=2` on a model with 40 layers, 2 layers are onloaded and offloaded at a time (20 total onloads/offloads).
  - This drastically reduces memory requirements.
  - `leaf_level` offloads individual layers at the lowest level and is equivalent to CPU offloading.
    - But it can be made faster if you use streams without giving up inference speed.

In [None]:
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipeline = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)

# Use the enable_group_offload method for Diffusers model implementations
pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level")
pipeline.vae.enable_group_offload(onload_device=onload_device, offload_type="leaf_level")

# Use the apply_group_offloading method for other model components
apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
export_to_video(video, "output.mp4", fps=8)

#### CUDA stream
- The `use_stream` parameter can be activated for CUDA devices that support asynchronous data transfer streams to reduce overall execution time compared to CPU offloading.
  - It overlaps data transfer and computation by using layer prefetching.
  - The next layer to be executed is loaded onto the GPU while the current layer is still being executed.
  - It can increase CPU memory significantly so ensure you have 2x the amount of memory as the model size.

- Set `record_stream=True` for more of a speedup at the cost of slightly increased memory usage.
  - Refer to the `torch.Tensor.record_stream` docs to learn more.

- When `use_stream=True` on VAEs with tiling enabled, make sure to do a dummy forward pass (possible with dummy inputs as well) before inference to avoid device mismatch errors.
  - This may not work on all implementations, so feel free to open an issue if you encounter any problems.

- If you’re using `block_level` group offloading with use_stream enabled, the `num_blocks_per_group` parameter should be set to 1, otherwise a warning will be raised.

In [None]:
pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True, record_stream=True)

- The `low_cpu_mem_usage` parameter can be set to `True` to reduce CPU memory usage when using streams during group offloading.
  - It is best for `leaf_level` offloading and when CPU memory is bottlenecked.
  - Memory is saved by creating pinned tensors on the fly instead of pre-pinning them.
  - However, this may increase overall execution time.

#### Layerwise casting
- Layerwise casting stores weights in a smaller data format (for example, `torch.float8_e4m3fn` and `torch.float8_e5m2`) to use less memory and upcasts those weights to a higher precision like `torch.float16` or `torch.bfloat16` for computation.
  - Certain layers (normalization and modulation related weights) are skipped because storing them in fp8 can degrade generation quality.
  - Call `enable_layerwise_casting()` to set the storage and computation datatypes.

In [None]:
transformer = CogVideoXTransformer3DModel.from_pretrained(
    "THUDM/CogVideoX-5b",
    subfolder="transformer",
    torch_dtype=torch.bfloat16
)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)

pipeline = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b",
    transformer=transformer,
    torch_dtype=torch.bfloat16
).to("cuda")
prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
export_to_video(video, "output.mp4", fps=8)

- The `apply_layerwise_casting()` method can also be used if you need more control and flexibility.
  - It can be partially applied to model layers by calling it on specific internal modules.
  - Use the `skip_modules_pattern` or `skip_modules_classes` parameters to specify modules to avoid, such as the normalization and modulation layers.

In [None]:
transformer = CogVideoXTransformer3DModel.from_pretrained(
    "THUDM/CogVideoX-5b",
    subfolder="transformer",
    torch_dtype=torch.bfloat16
)

# skip the normalization layer
apply_layerwise_casting(
    transformer,
    storage_dtype=torch.float8_e4m3fn,
    compute_dtype=torch.bfloat16,
    skip_modules_classes=["norm"],
    non_blocking=True,
)

#### `torch.channels_last`
- `torch.channels_last` flips how tensors are stored from (batch size, channels, height, width) to (batch size, heigh, width, channels).
  - This aligns the tensors with how the hardware sequentially accesses the tensors stored in memory and avoids skipping around in memory to access the pixel values.

In [None]:
print(pipeline.unet.conv_out.state_dict()["weight"].stride())  # (2880, 9, 3, 1)
pipeline.unet.to(memory_format=torch.channels_last)  # in-place operation
print(
    pipeline.unet.conv_out.state_dict()["weight"].stride()
)  # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works

#### `torch.jit.trace`
- `torch.jit.trace` records the operations a model performs on a sample input and creates a new, optimized representation of the model based on the recorded execution path.
  - During tracing, the model is optimized to reduce overhead from Python and dynamic control flows and operations are fused together for more efficiency.
  - The returned executable or `ScriptFunction` can be compiled.

In [None]:
# torch disable grad
torch.set_grad_enabled(False)

# set variables
n_experiments = 2
unet_runs_per_experiment = 50

# load sample inputs
def generate_inputs():
    sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16)
    timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999
    encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16)
    return sample, timestep, encoder_hidden_states


pipeline = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
).to("cuda")
unet = pipeline.unet
unet.eval()
unet.to(memory_format=torch.channels_last)  # use channels_last memory format
unet.forward = functools.partial(unet.forward, return_dict=False)  # set return_dict=False as default

# warmup
for _ in range(3):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet(*inputs)

# trace
print("tracing..")
unet_traced = torch.jit.trace(unet, inputs)
unet_traced.eval()
print("done tracing")

# warmup and optimize graph
for _ in range(5):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet_traced(*inputs)

# benchmarking
with torch.inference_mode():
    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet_traced(*inputs)
        torch.cuda.synchronize()
        print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet(*inputs)
        torch.cuda.synchronize()
        print(f"unet inference took {time.time() - start_time:.2f} seconds")

# save the model
unet_traced.save("unet_traced.pt")

- Replace the pipeline’s UNet with the traced version.

In [None]:
@dataclass
class UNet2DConditionOutput:
    sample: torch.Tensor

pipeline = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
).to("cuda")

# use jitted unet
unet_traced = torch.jit.load("unet_traced.pt")

# del pipeline.unet
class TracedUNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.in_channels = pipe.unet.config.in_channels
        self.device = pipe.unet.device

    def forward(self, latent_model_input, t, encoder_hidden_states):
        sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)

pipeline.unet = TracedUNet()

with torch.inference_mode():
    image = pipe([prompt] * 1, num_inference_steps=50).images[0]

#### Memory-efficient attention
- The Transformers attention mechanism is memory-intensive, especially for long sequences, so you can try using different and more memory-efficient attention types.
  - By default, if PyTorch >= 2.0 is installed, scaled dot-product attention (SDPA) is used. You don’t need to make any additional changes to your code.

- SDPA supports `FlashAttention` and xFormers as well as a native C++ PyTorch implementation.
  - It automatically selects the most optimal implementation based on your input.
  - You can explicitly use xFormers with the `enable_xformers_memory_efficient_attention()` method.
  - Call `disable_xformers_memory_efficient_attention()` to disable it.

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")
pipeline.enable_xformers_memory_efficient_attention()

# pipeline.disable_xformers_memory_efficient_attention()

-----
### **PyTorch 2.0**
> Original Source: https://huggingface.co/docs/diffusers/main/optimization/torch2.0

- Diffusers supports the latest optimizations from `PyTorch 2.0` which include:
  - A memory-efficient attention implementation, scaled dot product attention, without requiring any extra dependencies such as xFormers.
  - `torch.compile`, a just-in-time (JIT) compiler to provide an extra performance boost when individual models are compiled.


---------
### **xFormers**
- We recommend `xFormers` for both inference and training.
  - In our tests, the optimizations performed in the attention blocks allow for both faster speed and reduced memory consumption.

- You can use `enable_xformers_memory_efficient_attention()` for faster inference and reduced memory consumption as shown in this section.

---------
### **Token merging**
- `Token merging (ToMe)` merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of `StableDiffusionPipeline`.

- The `apply_patch` function exposes a number of arguments to help strike a balance between pipeline inference speed and the quality of the generated tokens.
  - The most important argument is ratio which controls the number of tokens that are merged during the forward pass.

- `ToMe` can greatly preserve the quality of the generated images while boosting inference speed.
  - By increasing the ratio, you can speed-up inference even further, but at the cost of some degraded image quality.
  - To test the quality of the generated images, we sampled a few prompts from Parti Prompts and performed inference with the `StableDiffusionPipeline` with the following settings:

In [None]:
pipeline = StableDiffusionPipeline.from_pretrained(
        "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True,
).to("cuda")

# + tomesd.apply_patch(pipeline, ratio=0.5)

image = pipeline("a photo of an astronaut riding a horse on mars").images[0]

----
### **DeepCache**
- `DeepCache` accelerates `StableDiffusionPipeline` and `StableDiffusionXLPipeline` by strategically caching and reusing high-level features while efficiently updating low-level features by taking advantage of the U-Net architecture.
- Load and enable the `DeepCacheSDHelper`:

In [None]:
pipe = StableDiffusionPipeline.from_pretrained('stable-diffusion-v1-5/stable-diffusion-v1-5', torch_dtype=torch.float16).to("cuda")

helper = DeepCacheSDHelper(pipe=pipe)
helper.set_params(cache_interval=3, cache_branch_id=0)
helper.enable()

image = pipe("a photo of an astronaut on a moon").images[0]

- The `set_params` method accepts two arguments: `cache_interval` and `cache_branch_id`.
  - `cache_interval` means the frequency of feature caching, specified as the number of steps between each cache operation.
  - `cache_branch_id` identifies which branch of the network (ordered from the shallowest to the deepest layer) is responsible for executing the caching processes.
  - Opting for a lower `cache_branch_id` or a larger `cache_interval` can lead to faster inference speed at the expense of reduced image quality (ablation experiments of these two hyperparameters can be found in the paper).
  - Once those arguments are set, use the enable or disable methods to activate or deactivate the `DeepCacheSDHelper`.

----
### **T-GATE**
- `T-GATE` accelerates inference for `Stable Diffusion`, `PixArt`, and `Latency Consistency Model` pipelines by skipping the cross-attention calculation once it converges.
  - This method doesn’t require any additional training and it can speed up inference from 10-50%.
    - `T-GATE` is also compatible with other optimization methods like `DeepCache`.
   
- Create a `TgateLoader` with a pipeline, the gate step (the time step to stop calculating the cross attention), and the number of inference steps.
  - Call the tgate method on the pipeline with a prompt, gate step, and the number of inference steps.

- Accelerate `PixArtAlphaPipeline`, `StableDiffusionXLPipeline`, `StableDiffusionXLPipeline` and `latent-consistency/lcm-sdxl` with `T-GATE`:

In [None]:
# PixArt
pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16)

gate_step = 8
inference_step = 25
pipe = TgatePixArtLoader(
       pipe,
       gate_step=gate_step,
       num_inference_steps=inference_step,
).to("cuda")

image = pipe.tgate(
       "An alpaca made of colorful building blocks, cyberpunk.",
       gate_step=gate_step,
       num_inference_steps=inference_step,
).images[0]

In [None]:
# Stable Diffusion XL
pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

gate_step = 10
inference_step = 25
pipe = TgateSDXLLoader(
       pipe,
       gate_step=gate_step,
       num_inference_steps=inference_step,
).to("cuda")

image = pipe.tgate(
       "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
       gate_step=gate_step,
       num_inference_steps=inference_step
).images[0]

In [None]:
# Stable Diffusion XL with DeepCache
pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

gate_step = 10
inference_step = 25
pipe = TgateSDXLDeepCacheLoader(
       pipe,
       cache_interval=3,
       cache_branch_id=0,
).to("cuda")

image = pipe.tgate(
       "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
       gate_step=gate_step,
       num_inference_steps=inference_step
).images[0]

In [None]:
# Latent Consistency Model
unet = UNet2DConditionModel.from_pretrained(
    "latent-consistency/lcm-sdxl",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    unet=unet,
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

gate_step = 1
inference_step = 4
pipe = TgateSDXLLoader(
       pipe,
       gate_step=gate_step,
       num_inference_steps=inference_step,
       lcm=True
).to("cuda")

image = pipe.tgate(
       "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
       gate_step=gate_step,
       num_inference_steps=inference_step
).images[0]

----
### **xDiT**
- `xDiT` is an inference engine designed for the large scale parallel deployment of Diffusion Transformers (DiTs).
  - `xDiT` provides a suite of efficient parallel approaches for Diffusion Models, as well as GPU kernel accelerations.

- There are four parallel methods supported in `xDiT`, including Unified Sequence Parallelism, PipeFusion, CFG parallelism and data parallelism.
  - The four parallel methods in xDiT can be configured in a hybrid manner, optimizing communication patterns to best suit the underlying network hardware.

- Optimization orthogonal to parallelization focuses on accelerating single GPU performance.
  - In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as torch.compile and onediff.
 
- Using `xDiT` to accelerate inference of a Diffusers model.

In [None]:
def main():
    parser = FlexibleArgumentParser(description="xFuser Arguments")
    args = xFuserArgs.add_cli_args(parser).parse_args()
    engine_args = xFuserArgs.from_cli_args(args)
    engine_config, input_config = engine_args.create_config()

     local_rank = get_world_group().local_rank
     pipe = StableDiffusion3Pipeline.from_pretrained(
         pretrained_model_name_or_path=engine_config.model_config.model,
         torch_dtype=torch.float16,
     ).to(f"cuda:{local_rank}")
    
# do anything you want with pipeline here

    pipe = xDiTParallel(pipe, engine_config, input_config)

     pipe(
         height=input_config.height,
         width=input_config.height,
         prompt=input_config.prompt,
         num_inference_steps=input_config.num_inference_steps,
         output_type=input_config.output_type,
         generator=torch.Generator(device="cuda").manual_seed(input_config.seed),
     )

    if input_config.output_type == "pil":
        pipe.save("results", "stable_diffusion_3")

if __name__ == "__main__":
    main()

- We only need to use xFuserArgs from `xDiT` to get configuration parameters, and pass these parameters along with the pipeline object from the Diffusers library into `xDiTParallel` to complete the parallelization of a specific pipeline in Diffusers.
  - `xDiT` runtime parameters can be viewed in the command line using `-h`, and you can refer to this usage example for more details.
  - `xDiT` needs to be launched using torchrun to support its multi-node, multi-GPU parallel capabilities.
  - The following command can be used for 8-GPU parallel inference:
  ```
  torchrun --nproc_per_node=8 ./inference.py --model models/FLUX.1-dev --data_parallel_degree 2 --ulysses_degree 2 --ring_degree 2 --prompt "A snowy mountain" "A small dog" --num_inference_steps 50

  ```

----
### **ParaAttention**
- Large image generation models, such as `FLUX.1-dev`, can be an inference challenge for real-time applications and deployment because of their size.
  - `ParaAttention` is a library that implements context parallelism and first block cache, and can be combined with other techniques (`torch.compile`, `fp8` dynamic quantization), to accelerate inference.
  - How to apply `ParaAttention` to `FLUX.1-dev` and `HunyuanVideo` on NVIDIA L20 GPUs.
  - `FLUX.1-dev` is able to generate a 1024x1024 resolution image in 28 steps in 26.36 seconds.

#### First Block Cache
- Caching the output of the transformers blocks in the model and reusing them in the next inference steps reduces the computation cost and makes inference faster.
  - However, it is hard to decide when to reuse the cache to ensure quality generated images or videos.
  - `ParaAttention` directly uses the residual difference of the first transformer block output to approximate the difference among model outputs.
  - When the difference is small enough, the residual difference of previous inference steps is reused.
  - In other words, the denoising step is skipped.

- This achieves a 2x speedup on `FLUX.1-dev` inference with very good quality.
  - To apply first block cache on `FLUX.1-dev`, call `apply_cache_on_pipe` as shown below.
  - `0.08` is the default residual difference value for FLUX models.

In [None]:
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe

apply_cache_on_pipe(pipe, residual_diff_threshold=0.08)

# Enable memory savings
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()

begin = time.time()
image = pipe(
    "A cat holding a sign that says hello world",
    num_inference_steps=28,
).images[0]
end = time.time()
print(f"Time: {end - begin:.2f}s")

print("Saving image to flux.png")
image.save("flux.png")

#### `fp8` quantization
- `fp8` with dynamic quantization further speeds up inference and reduces memory usage.
  - Both the activations and weights must be quantized in order to use the 8-bit NVIDIA Tensor Cores.
  - Use `float8_weight_only` and `float8_dynamic_activation_float8_weight` to quantize the text encoder and transformer model.

- The default quantization method is per tensor quantization, but if your GPU supports row-wise quantization, you can also try it for better accuracy.
  - `torch.compile` with `mode="max-autotune-no-cudagraphs"` or `mode="max-autotune"` selects the best kernel for performance.
  - Compilation can take a long time if it’s the first time the model is called, but it is worth it once the model has been compiled.

In [None]:
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe

apply_cache_on_pipe(
    pipe,
    residual_diff_threshold=0.12,  # Use a larger value to make the cache take effect
)

from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only

quantize_(pipe.text_encoder, float8_weight_only())
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
pipe.transformer = torch.compile(
   pipe.transformer, mode="max-autotune-no-cudagraphs",
)

# Enable memory savings
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()

for i in range(2):
    begin = time.time()
    image = pipe(
        "A cat holding a sign that says hello world",
        num_inference_steps=28,
    ).images[0]
    end = time.time()
    if i == 0:
        print(f"Warm up time: {end - begin:.2f}s")
    else:
        print(f"Time: {end - begin:.2f}s")

print("Saving image to flux.png")
image.save("flux.png")

#### Context Parallelism
- Context Parallelism parallelizes inference and scales with multiple GPUs.
  - The ParaAttention compositional design allows you to combine Context Parallelism with First Block Cache and dynamic quantization.
  - If the inference process needs to be persistent and serviceable, it is suggested to use `torch.multiprocessing` to write your own inference processor.
  - This can eliminate the overhead of launching the process and loading and recompiling the model.
  - This combines `First Block Cache`, `fp8` dynamic quantization, `torch.compile`, and Context Parallelism for the fastest inference speed.

In [None]:
dist.init_process_group()

torch.cuda.set_device(dist.get_rank())

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

from para_attn.context_parallel import init_context_parallel_mesh
from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
from para_attn.parallel_vae.diffusers_adapters import parallelize_vae

mesh = init_context_parallel_mesh(
    pipe.device.type,
    max_ring_dim_size=2,
)
parallelize_pipe(
    pipe,
    mesh=mesh,
)
parallelize_vae(pipe.vae, mesh=mesh._flatten())

from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe

apply_cache_on_pipe(
    pipe,
    residual_diff_threshold=0.12,  # Use a larger value to make the cache take effect
)

from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only

quantize_(pipe.text_encoder, float8_weight_only())
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
torch._inductor.config.reorder_for_compute_comm_overlap = True
pipe.transformer = torch.compile(
   pipe.transformer, mode="max-autotune-no-cudagraphs",
)

# Enable memory savings
# pipe.enable_model_cpu_offload(gpu_id=dist.get_rank())
# pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank())

for i in range(2):
    begin = time.time()
    image = pipe(
        "A cat holding a sign that says hello world",
        num_inference_steps=28,
        output_type="pil" if dist.get_rank() == 0 else "pt",
    ).images[0]
    end = time.time()
    if dist.get_rank() == 0:
        if i == 0:
            print(f"Warm up time: {end - begin:.2f}s")
        else:
            print(f"Time: {end - begin:.2f}s")

if dist.get_rank() == 0:
    print("Saving image to flux.png")
    image.save("flux.png")

dist.destroy_process_group()

- Save to `run_flux.py` and launch it with torchrun.
```
# Use --nproc_per_node to specify the number of GPUs
torchrun --nproc_per_node=2 run_flux.py
```

----
### **Optimized Model Formats | JaX/Flax**
- Diffusers supports `Flax` for super fast inference on Google TPUs, such as those available in Colab, Kaggle or Google Cloud Platform.
- You should also make sure you’re using a TPU backend.
  - While JAX does not run exclusively on TPUs, you’ll get the best performance on a TPU because each server has 8 TPU accelerators working in parallel.
  - If you are running this guide in Colab, select Runtime in the menu above, select the option Change runtime type, and then select TPU under the Hardware accelerator setting. Import JAX and quickly check whether you’re using a TPU:

In [None]:
jax.tools.colab_tpu.setup_tpu()

num_devices = jax.device_count()
device_type = jax.devices()[0].device_kind

print(f"Found {num_devices} JAX devices of type {device_type}.")
assert (
    "TPU" in device_type,
    "Available device is not a TPU, please select TPU from Runtime > Change runtime type > Hardware accelerator"
)
# Found 8 JAX devices of type Cloud TPU.

#### Load a model
- `Flax` is a functional framework, so models are stateless and parameters are stored outside of them.
  - Loading a pretrained `Flax` pipeline returns both the pipeline and the model weights (or parameters).
  - You will use `bfloat16`, a more efficient half-float type that is supported by TPUs (you can also use `float32` for full precision if you want).

In [None]:
dtype = jnp.bfloat16
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    variant="bf16",
    dtype=dtype,
)

#### Inference
- TPUs usually have 8 devices working in parallel, so let’s use the same prompt for each device.
  - This means you can perform inference on 8 devices at once, with each device generating one image.
  - You’ll get 8 images in the same amount of time it takes for one chip to generate a single image.
- After replicating the prompt, get the tokenized text ids by calling the `prepare_inputs` function on the pipeline.
  - The length of the tokenized text is set to 77 tokens as required by the configuration of the underlying CLIP text model.

In [None]:
prompt = "A cinematic film still of Morgan Freeman starring as Jimi Hendrix, portrait, 40mm lens, shallow depth of field, close up, split lighting, cinematic"
prompt = [prompt] * jax.device_count()
prompt_ids = pipeline.prepare_inputs(prompt)
prompt_ids.shape

In [None]:
# parameters
p_params = replicate(params)

# arrays
prompt_ids = shard(prompt_ids)
prompt_ids.shape

- This shape means each one of the 8 devices receives as an input a jnp array with shape (1, 77), where 1 is the batch size per device.
  - On TPUs with sufficient memory, you could have a batch size larger than 1 if you want to generate multiple images (per chip) at once.

- Create a random number generator to pass to the generation function.
  - This is standard procedure in `Flax`, which is very serious and opinionated about random numbers.
  - All functions that deal with random numbers are expected to receive a generator to ensure reproducibility, even when you’re training across multiple distributed devices.
- The helper function below uses a seed to initialize a random number generator.
  - As long as you use the same seed, you’ll get the exact same results.
  - Feel free to use different seeds when exploring results later in the guide.
  - The helper function, or rng, is split 8 times so each device receives a different generator and generates a different image.

In [None]:
def create_key(seed=0):
    return jax.random.PRNGKey(seed)

rng = create_key(0)
rng = jax.random.split(rng, jax.device_count())

- To take advantage of JAX’s optimized speed on a TPU, pass `jit=True` to the pipeline to compile the JAX code into an efficient representation and to ensure the model runs in parallel across the 8 devices.

In [None]:
%%time
images = pipeline(prompt_ids, p_params, rng, jit=True)[0]

# CPU times: user 56.2 s, sys: 42.5 s, total: 1min 38s
# Wall time: 1min 29s

images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
images = pipeline.numpy_to_pil(images)
make_image_grid(images, rows=2, cols=4)

#### Using different prompts

In [None]:
prompts = [
    "Labrador in the style of Hokusai",
    "Painting of a squirrel skating in New York",
    "HAL-9000 in the style of Van Gogh",
    "Times Square under water, with fish and a dolphin swimming around",
    "Ancient Roman fresco showing a man working on his laptop",
    "Close-up photograph of young black woman against urban background, high quality, bokeh",
    "Armchair in the shape of an avocado",
    "Clown astronaut in space, with Earth in the background",
]

prompt_ids = pipeline.prepare_inputs(prompts)
prompt_ids = shard(prompt_ids)

images = pipeline(prompt_ids, p_params, rng, jit=True).images
images = images.reshape((images.shape[0] * images.shape[1],) + images.shape[-3:])
images = pipeline.numpy_to_pil(images)

make_image_grid(images, 2, 4)

#### How does parallelization work?
- The `Flax` pipeline in Diffusers automatically compiles the model and runs it in parallel on all available devices.
  - `JAX` parallelization can be done in multiple ways.
  - The easiest one revolves around using the jax.pmap function to achieve single-program multiple-data (SPMD) parallelization.
  - It means running several copies of the same code, each on different data inputs.
  - More sophisticated approaches are possible, and you can go over to the `JAX` documentation to explore this topic in more detail if you are interested.

- `jax.pmap` does two things:
  - Compiles (or `”jits”`) the code which is similar to `jax.jit()`.
    - This does not happen when you call pmap, and only the first time the pmapped function is called.
  - Ensures the compiled code runs in parallel on all available devices.

- To demonstrate, call pmap on the pipeline’s `_generate` method (this is a private method that generates images and may be renamed or removed in future releases of Diffusers):

In [None]:
p_generate = pmap(pipeline._generate)

- After calling pmap, the prepared function p_generate will:

Make a copy of the underlying function, pipeline._generate, on each device.
Send each device a different portion of the input arguments (this is why it’s necessary to call the shard function). In this case, prompt_ids has shape (8, 1, 77, 768) so the array is split into 8 and each copy of _generate receives an input with shape (1, 77, 768).
The most important thing to pay attention to here is the batch size (1 in this example), and the input dimensions that make sense for your code. You don’t have to change anything else to make the code work in parallel.

The first time you call the pipeline takes more time, but the calls afterward are much faster. The block_until_ready function is used to correctly measure inference time because JAX uses asynchronous dispatch and returns control to the Python loop as soon as it can. You don’t need to use that in your code; blocking occurs automatically when you want to use the result of a computation that has not yet been materialized.

- After calling pmap, the prepared function `p_generate` will:
  - Make a copy of the underlying function, `pipeline._generate`, on each device.
  - Send each device a different portion of the input arguments (this is why it’s necessary to call the shard function).
    - In this case, `prompt_ids` has shape `(8, 1, 77, 768)` so the array is split into 8 and each copy of `_generate` receives an input with shape `(1, 77, 768)`.
  - The most important thing to pay attention to here is the batch size (1 in this example), and the input dimensions that make sense for your code.
    - You don’t have to change anything else to make the code work in parallel.

- The first time you call the pipeline takes more time, but the calls afterward are much faster.
  - The `block_until_ready` function is used to correctly measure inference time because `JAX` uses asynchronous dispatch and returns control to the Python loop as soon as it can.
  - You don’t need to use that in your code; blocking occurs automatically when you want to use the result of a computation that has not yet been materialized.

In [None]:
%%time
images = p_generate(prompt_ids, p_params, rng)
images = images.block_until_ready()

# CPU times: user 1min 15s, sys: 18.2 s, total: 1min 34s
# Wall time: 1min 15s

print(images.shape)

----
### **Optimized Model Formats | ONNX**
- `Optimum` provides a Stable Diffusion pipeline compatible with ONNX Runtime

#### Stable Diffusion
- To load and run inference, use the `ORTStableDiffusionPipeline`.
  - If you want to load a PyTorch model and convert it to the ONNX format on-the-fly, set `export=True`:

In [None]:
model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id, export=True)
prompt = "sailing ship in storm by Leonardo da Vinci"
image = pipeline(prompt).images[0]
pipeline.save_pretrained("./onnx-stable-diffusion-v1-5")

- To export the pipeline in the `ONNX` format offline and use it later for inference, use the `optimum-cli` export command:
```
optimum-cli export onnx --model stable-diffusion-v1-5/stable-diffusion-v1-5 sd_v15_onnx/
```

- Then to perform inference (you don’t have to specify `export=True` again):

In [None]:
model_id = "sd_v15_onnx"
pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id)
prompt = "sailing ship in storm by Leonardo da Vinci"
image = pipeline(prompt).images[0]

#### Stable Diffusion XL
- To load and run inference with `SDXL`, use the `ORTStableDiffusionXLPipeline`.

- To export the pipeline in the ONNX format and use it later for inference, use the optimum-cli export command:
```
optimum-cli export onnx --model stabilityai/stable-diffusion-xl-base-1.0 --task stable-diffusion-xl sd_xl_onnx/
```

In [None]:
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipeline = ORTStableDiffusionXLPipeline.from_pretrained(model_id)
prompt = "sailing ship in storm by Leonardo da Vinci"
image = pipeline(prompt).images[0]

----
### **Optimized Model Formats | OpenVINO**
- Optimum provides Stable Diffusion pipelines compatible with `OpenVINO` to perform inference on a variety of Intel processors.

#### Stable Diffusion
- To load and run inference, use the `OVStableDiffusionPipeline`.
  - If you want to load a PyTorch model and convert it to the `OpenVINO` format on-the-fly, set `export=True`:


In [None]:
model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=True)
prompt = "sailing ship in storm by Rembrandt"
image = pipeline(prompt).images[0]

# Don't forget to save the exported model
pipeline.save_pretrained("openvino-sd-v1-5")

- To further speed-up inference, statically reshape the model.
  - If you change any parameters such as the outputs height or width, you’ll need to statically reshape your model again.

In [None]:
# Define the shapes related to the inputs and desired outputs
batch_size, num_images, height, width = 1, 1, 512, 512

# Statically reshape the model
pipeline.reshape(batch_size, height, width, num_images)
# Compile the model before inference
pipeline.compile()

image = pipeline(
    prompt,
    height=height,
    width=width,
    num_images_per_prompt=num_images,
).images[0]

#### Stable Diffusion XL
- To load and run inference with `SDXL`, use the `OVStableDiffusionXLPipeline`:

In [None]:
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipeline = OVStableDiffusionXLPipeline.from_pretrained(model_id)
prompt = "sailing ship in storm by Rembrandt"
image = pipeline(prompt).images[0]

----
### **Optimized Model Formats | CoreML**
- `Core ML` is the model format and machine learning library supported by Apple frameworks.
  - If you are interested in running Stable Diffusion models inside your macOS or iOS/iPadOS apps, this guide will show you how to convert existing PyTorch checkpoints into the Core ML format and use them for inference with Python or Swift.
- Core ML models can leverage all the compute engines available in Apple devices: the CPU, the GPU, and the Apple Neural Engine (or ANE, a tensor-optimized accelerator available in Apple Silicon Macs and modern iPhones/iPads).
  - Depending on the model and the device it’s running on, Core ML can mix and match compute engines too, so some portions of the model may run on the CPU while others run on GPU, for example.

- You can also run the diffusers Python codebase on Apple Silicon Macs using the mps accelerator built into PyTorch. This approach is explained in depth in the mps guide, but it is not compatible with native apps.

#### Core ML Inference in Python
- Install the following libraries to run Core ML inference in Python:
```
pip install huggingface_hub
pip install git+https://github.com/apple/ml-stable-diffusion
```

- To run inference in Python, use one of the versions stored in the packages folders because the compiled ones are only compatible with Swift.
  - You may choose whether you want to use original or split_einsum attention.
  - This is how you’d download the original attention variant from the Hub to a directory called models:

In [None]:
repo_id = "apple/coreml-stable-diffusion-v1-4"
variant = "original/packages"

model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_"))
snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False)
print(f"Model downloaded at {model_path}")

- Once you have downloaded a snapshot of the model, you can test it using Apple’s Python script.
```
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./models/coreml-stable-diffusion-v1-4_original_packages/original/packages -o </path/to/output/image> --compute-unit CPU_AND_GPU --seed 93
```

- Pass the path of the downloaded checkpoint with -i flag to the script. `--compute-unit` indicates the hardware you want to allow for inference.
  - It must be one of the following options: ALL, CPU_AND_GPU, CPU_ONLY, CPU_AND_NE.
  - You may also provide an optional output path, and a seed for reproducibility.

- The inference script assumes you’re using the original version of the Stable Diffusion model, `CompVis/stable-diffusion-v1-4`.
  - If you use another model, you have to specify its Hub id in the inference command line, using the `--model-version` option.
  - This works for models already supported and custom models you trained or fine-tuned yourself.

- If you want to use `stable-diffusion-v1-5/stable-diffusion-v1-5`:
```
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" --compute-unit ALL -o output --seed 93 -i models/coreml-stable-diffusion-v1-5_original_packages --model-version stable-diffusion-v1-5/stable-diffusion-v1-5
```

#### Core ML inference in Swift
- Running inference in Swift is slightly faster than in Python because the models are already compiled in the mlmodelc format.
  - This is noticeable on app startup when the model is loaded but shouldn’t be noticeable if you run several generations afterward.
- To run inference in Swift on your Mac, you need one of the compiled checkpoint versions.
  - We recommend you download them locally using Python code similar to the previous example, but with one of the compiled variants:

In [None]:
repo_id = "apple/coreml-stable-diffusion-v1-4"
variant = "original/compiled"

model_path = Path("./models") / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_"))
snapshot_download(repo_id, allow_patterns=f"{variant}/*", local_dir=model_path, local_dir_use_symlinks=False)
print(f"Model downloaded at {model_path}")

- To run inference, please clone Apple’s repo:
```
git clone https://github.com/apple/ml-stable-diffusion
cd ml-stable-diffusion
```

- And then use Apple’s command line tool, Swift Package Manager:
```
swift run StableDiffusionSample --resource-path models/coreml-stable-diffusion-v1-4_original_compiled --compute-units all "a photo of an astronaut riding a horse on mars"
```

- You have to specify in `--resource-path` one of the checkpoints downloaded in the previous step, so please make sure it contains compiled Core ML bundles with the extension `.mlmodelc`.
  - The `--compute-units` has to be one of these values: all, cpuOnly, cpuAndGPU, cpuAndNeuralEngine.