## **8. Quantization**

> Original Source: https://huggingface.co/docs/diffusers/main/quantization/overview

```
> bitsandbytes
> gguf
> torchao
> quanto
```

- **Quantization** focuses on representing data with fewer bits while also trying to preserve the precision of the original data.
  - This often means converting a data type to represent the same information with fewer bits.
  - If your model weights are stored as 32-bit floating points and they’re quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory usage.
  - Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.

In [6]:
import torch
from accelerate import init_empty_weights

from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers.quantizers.quantization_config import QuantoConfig
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers import SD3Transformer2DModel
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig

from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig

from diffusers import AutoModel, BitsAndBytesConfig, TorchAoConfig, QuantoConfig
from transformers import T5EncoderModel

-----
### **Pipeline-level quantization**
- There are two ways you can use `PipelineQuantizationConfig` depending on the level of control you want over the quantization specifications of each model in the pipeline.
  - for more basic and simple use cases, you only need to define the quant_backend, quant_kwargs, and `components_to_quantize`
  - for more granular quantization control, provide a `quant_mapping` that provides the quantization specifications for the individual model components.

- **Simple quantization**
  - Initialize `PipelineQuantizationConfig` with the following parameters.
    - `quant_backend` specifies which quantization backend to use. Currently supported backends include: `bitsandbytes_4bit`, `bitsandbytes_8bit`, `gguf`, `quanto`, and `torchao`.
    - `quant_kwargs` contains the specific quantization arguments to use.
    - `components_to_quantize` specifies which components of the pipeline to quantize.
      - Typically, you should quantize the most compute intensive components like the transformer.
      - The text encoder is another component to consider quantizing if a pipeline has more than one such as `FluxPipeline`.
      - The example below quantizes the T5 text encoder in FluxPipeline while keeping the CLIP model intact.

In [None]:
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)

- Pass the `pipeline_quant_config` to `from_pretrained()` to quantize the pipeline.

In [None]:
pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("photo of a cute dog").images[0]

#### quant_mapping
- The `quant_mapping` argument provides more flexible options for how to quantize each individual component in a pipeline, like combining different quantization backends.
  - Initialize `PipelineQuantizationConfig` and pass a `quant_mapping` to it.
  - The `quant_mapping` allows you to specify the quantization options for each component in the pipeline such as the transformer and text encoder.

The example below uses two quantization backends, ~quantizers.QuantoConfig and transformers.BitsAndBytesConfig, for the transformer and text encoder.

In [None]:
pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={
        "transformer": QuantoConfig(weights_dtype="int8"),
        "text_encoder_2": TransformersBitsAndBytesConfig(
            load_in_4bit=True, compute_dtype=torch.bfloat16
        ),
    }
)

- There is a separate bitsandbytes backend in Transformers.
  - You need to import and use `transformers.BitsAndBytesConfig` for components that come from `Transformers`.
  - `text_encoder_2` in `FluxPipeline` is a `T5EncoderModel` from Transformers so you need to use transformers.
  - `BitsAndBytesConfig` instead of `diffusers.BitsAndBytesConfig`.

- Pass the `pipeline_quant_config` to `from_pretrained()` to quantize the pipeline.

In [None]:
pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={
        "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16),
        "text_encoder_2": TransformersBitsAndBytesConfig(
            load_in_4bit=True, compute_dtype=torch.bfloat16
        ),
    }
)

pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("photo of a cute dog").images[0]

---------
### **bitsandbytes**
- **bitsandbytes** is the easiest option for quantizing a model to 8 and 4-bit.
  - 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16.
  - This reduces the degradative effect outlier values have on a model’s performance.

- 4-bit quantization compresses a model even further, and it is commonly used with `QLoRA` to finetune quantized LLMs.
- Quantize a model by passing a `BitsAndBytesConfig` to `from_pretrained()`.
  - This works for any model in any modality, as long as it supports loading with `Accelerate` and contains `torch.nn.Linear` layers.
 
- **8-bit**
  - Quantizing a model in 8-bit halves the memory-usage:
    - bitsandbytes is supported in both `Transformers` and `Diffusers`, so you can quantize both the `FluxTransformer2DModel` and `T5EncoderModel`.
    - For Ada and higher-series GPUs. we recommend changing `torch_dtype` to `torch.bfloat16`.

In [None]:
quant_config = TransformersBitsAndBytesConfig(load_in_8bit=True,)

text_encoder_2_8bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True,)

transformer_8bit = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

- **4-bit**
  - Quantizing a model in 4-bit reduces your memory-usage by 4x:

In [None]:
quant_config = TransformersBitsAndBytesConfig(load_in_4bit=True,)

text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(load_in_4bit=True,)

transformer_4bit = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

- By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`.
  - You can change the data type of these modules with the `torch_dtype` parameter.

In [None]:
transformer_4bit = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
# +   torch_dtype=torch.float32,
)

- generate an image using our quantized models.

Setting device_map="auto" automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.

- Generate an image using our quantized models.
  - Setting `device_map="auto"` automatically fills all available space on the GPU(s) first, then the CPU, and finally, the hard drive (the absolute slowest option) if there is still not enough memory.

In [None]:
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer_4bit,
    text_encoder_2=text_encoder_2_4bit,
    torch_dtype=torch.float16,
    device_map="auto",
)

pipe_kwargs = {
    "prompt": "A cat holding a sign that says hello world",
    "height": 1024,
    "width": 1024,
    "guidance_scale": 3.5,
    "num_inference_steps": 50,
    "max_sequence_length": 512,
}

- When there is enough memory, you can also directly move the pipeline to the GPU with `.to("cuda")` and apply `enable_model_cpu_offload()` to optimize GPU memory usage.
  - Once a model is quantized, you can push the model to the Hub with the `push_to_hub()` method.
  - The quantization `config.json` file is pushed first, followed by the quantized model weights.
  - You can also save the serialized 4-bit models locally with `save_pretrained()`.

- Training with 8-bit and 4-bit weights are only supported for training extra parameters.
  - Check your memory footprint with the `get_memory_footprint` method:

In [None]:
print(model.get_memory_footprint())

- This only tells you the memory footprint of the model params and does not estimate the inference memory requirements.
  - Quantized models can be loaded from the `from_pretrained()` method without needing to specify the `quantization_config parameters`:

In [None]:
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_4bit = AutoModel.from_pretrained(
    "hf-internal-testing/flux.1-dev-nf4-pkg", subfolder="transformer"
)

#### 8-bit (LLM.int8() algorithm)
- **Outlier threshold**
  - An “outlier” is a hidden state value greater than a certain threshold, and these values are computed in `fp16`.
  - While the values are usually normally distributed (`[-3.5, 3.5]`), this distribution can be very different for large models (`[-60, 6]` or `[6, 60]`).
    - 8-bit quantization works well for values `~5`, but beyond that, there is a significant performance penalty.
    - A good default threshold value is `6`, but a lower threshold may be needed for more unstable models (small models or finetuning).

- To find the best threshold for your model, we recommend experimenting with the llm_int8_threshold parameter in BitsAndBytesConfig:


In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True, llm_int8_threshold=10,
)

model_8bit = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quantization_config,
)

- **Skip module conversion**
  - For some models, you don’t need to quantize every module to 8-bit which can actually cause instability.
  - For example, for diffusion models like `Stable Diffusion 3`, the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in `BitsAndBytesConfig`:

In [None]:
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True, llm_int8_skip_modules=["proj_out"],
)

model_8bit = SD3Transformer2DModel.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    subfolder="transformer",
    quantization_config=quantization_config,
)

#### 4-bit (QLoRA algorithm)
- Explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.

- **Compute data type**
  - To speedup computation, you can change the data type from `float32` (the default value) to `bf16` using the `bnb_4bit_compute_dtype` parameter in `BitsAndBytesConfig`:

In [None]:
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

- **Normal Float 4 (NF4)**
  - NF4 is a 4-bit data type from the QLoRA paper, adapted for weights initialized from a normal distribution.
    - You should use NF4 for training 4-bit base models.
    - This can be configured with the `bnb_4bit_quant_type` parameter in the `BitsAndBytesConfig`:
  - For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance.
    - However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.

In [None]:
quant_config = TransformersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

transformer_4bit = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

- **Nested quantization**
  - Nested quantization is a technique that can save additional memory at no additional performance cost.
  - This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter.

In [None]:
quant_config = TransformersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

transformer_4bit = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

#### Dequantizing bitsandbytes models
- Once quantized, you can dequantize a model to its original precision, but this might result in a small loss of quality.
  - Make sure you have enough GPU RAM to fit the dequantized model.

In [None]:
quant_config = TransformersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

text_encoder_2_4bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

transformer_4bit = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

text_encoder_2_4bit.dequantize()
transformer_4bit.dequantize()

------------
### **GGUF**
- The **GGUF** file format is typically used to store models for inference with GGML and supports a variety of block wise quantization options.
  - Diffusers supports loading checkpoints prequantized and saved in the GGUF format via `from_single_file` loading with Model classes.
  - Loading GGUF checkpoints via Pipelines is currently not supported.

- Since GGUF is a single file format, use `~FromSingleFileMixin.from_single_file` to load the model and pass in the `GGUFQuantizationConfig`.
  - When using GGUF checkpoints, the quantized weights remain in a low memory dtype(typically `torch.uint8`) and are dynamically dequantized and cast to the configured `compute_dtype` during each module’s forward pass through the model.
  - The `GGUFQuantizationConfig` allows you to set the `compute_dtype`.

- The functions used for dynamic dequantizatation are based on the great work done by `city96`, who created the Pytorch ports of the original numpy implementation by compilade.

- Supported Quantization Types
  - `BF16`, `Q4_0`, `Q4_1`, `Q5_0`, `Q5_1`, `Q8_0`, `Q2_K`, `Q3_K`, `Q4_K`, `Q5_K`, `Q6_K`

In [None]:
ckpt_path = (
    "https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-Q2_K.gguf"
)
transformer = FluxTransformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
prompt = "A cat holding a sign that says hello world"
image = pipe(prompt, generator=torch.manual_seed(0)).images[0]
image.save("flux-gguf.png")

-----
### **torchao**
- `TorchAO` is an architecture optimization library for PyTorch.
  - It provides high-performance dtypes, optimization techniques, and kernels for inference and training, featuring composability with native PyTorch features like torch.compile, FullyShardedDataParallel (FSDP), and more.

- Quantize a model by passing `TorchAoConfig` to `from_pretrained()` (you can also load pre-quantized models).
  - This works for any model in any modality, as long as it supports loading with `Accelerate` and contains `torch.nn.Linear` layers.

- The example below only quantizes the weights to `int8`.

In [None]:
model_id = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16

quantization_config = TorchAoConfig("int8wo")
transformer = AutoModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=dtype,
)
pipe = FluxPipeline.from_pretrained(
    model_id,
    transformer=transformer,
    torch_dtype=dtype,
)
pipe.to("cuda")

# Without quantization: ~31.447 GB
# With quantization: ~20.40 GB
print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
).images[0]
image.save("output.png")

- `TorchAO` is fully compatible with torch.compile, setting it apart from other quantization methods.
  - This makes it easy to speed up inference with just one line of code.
 
- `torchao` also supports an automatic quantization API through autoquant.
  - Autoquantization determines the best quantization strategy applicable to a model by comparing the performance of each technique on chosen input types and shapes.
  - Currently, this can be used directly on the underlying modeling components.
  - Diffusers will also expose an autoquant configuration option in the future.

- The `TorchAoConfig` class accepts three parameters:
  - `quant_type`: A string value mentioning one of the quantization types below.
  - `modules_to_not_convert`: A list of module full/partial module names for which quantization should not be performed.
    - For example, to not perform any quantization of the `FluxTransformer2DModel`’s first block.
    - one would specify: `modules_to_not_convert=["single_transformer_blocks.0"]`.
  - `kwargs`: A dict of keyword arguments to pass to the underlying quantization method which will be invoked based on `quant_type`.

In [None]:
# In the above code, add the following after initializing the transformer
transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True)

#### Supported quantization types
- `torchao` supports weight-only quantization and weight and dynamic-activation quantization for `int8`, `float3-float8`, and `uint1-uint7`.
  - Weight-only quantization stores the model weights in a specific low-bit data type but performs computation with a higher-precision data type, like `bfloat16`.
  - This lowers the memory requirements from model weights but retains the memory peaks for activation computation.

- Dynamic activation quantization stores the model weights in a low-bit dtype, while also quantizing the activations on-the-fly to save additional memory.
  - This lowers the memory requirements from model weights, while also lowering the memory overhead from activation computations.
  - This may come at a quality tradeoff at times, so it is recommended to test different models thoroughly.

- Some quantization methods are aliases (for example, `int8wo` is the commonly used shorthand for `int8_weight_only`).
  - This allows using the quantization methods described in the torchao docs as-is, while also making it convenient to remember their shorthand notations.

- Refer to the official torchao documentation for a better understanding of the available quantization methods and the exhaustive list of configuration options available.


#### Serializing and Deserializing quantized models
- To serialize a quantized model in a given dtype, first load the model with the desired quantization dtype and then save it using the `save_pretrained()` method.

In [None]:
quantization_config = TorchAoConfig("int8wo")
transformer = AutoModel.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)
transformer.save_pretrained("/path/to/flux_int8wo", safe_serialization=False)

- To load a serialized quantized model, use the `from_pretrained()` method.

In [None]:
transformer = AutoModel.from_pretrained("/path/to/flux_int8wo", torch_dtype=torch.bfloat16, use_safetensors=False)
pipe = FluxPipeline.from_pretrained("black-forest-labs/Flux.1-Dev", transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A cat holding a sign that says hello world"
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.0).images[0]
image.save("output.png")

- If you are using `torch<=2.6.0`, some quantization methods, such as `uint4wo`, cannot be loaded directly and may result in an `UnpicklingError` when trying to load the models, but work as expected when saving them.
  - In order to work around this, one can load the state dict manually into the model.
  - This requires using `weights_only=False` in `torch.load`, so it should be run only if the weights were obtained from a trustable source.

In [None]:
# Serialize the model
transformer = AutoModel.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    subfolder="transformer",
    quantization_config=TorchAoConfig("uint4wo"),
    torch_dtype=torch.bfloat16,
)
transformer.save_pretrained("/path/to/flux_uint4wo", safe_serialization=False, max_shard_size="50GB")
# ...

# Load the model
state_dict = torch.load("/path/to/flux_uint4wo/diffusion_pytorch_model.bin", weights_only=False, map_location="cpu")
with init_empty_weights():
    transformer = AutoModel.from_config("/path/to/flux_uint4wo/config.json")
transformer.load_state_dict(state_dict, strict=True, assign=True)

-----
### **Quanto**
- **Quanto** is a PyTorch quantization backend for Optimum. It has been designed with versatility and simplicity in mind:
  - All features are available in eager mode (works with non-traceable models)
  - Supports quantization aware training
  - Quantized models are compatible with torch.compile
  - Quantized models are Device agnostic (e.g CUDA,XPU,MPS,CPU)

- Quantize a model by passing the `QuantoConfig` object to the `from_pretrained()` method.
  - Although the `Quanto` library does allow quantizing `nn.Conv2d` and `nn.LayerNorm` modules
  - Diffusers only supports quantizing the weights in the `nn.Linear` layers of a model.
  - The following snippet demonstrates how to apply `float8` quantization with `Quanto`.

In [None]:
model_id = "black-forest-labs/FLUX.1-dev"
quantization_config = QuantoConfig(weights_dtype="float8")
transformer = FluxTransformer2DModel.from_pretrained(
      model_id,
      subfolder="transformer",
      quantization_config=quantization_config,
      torch_dtype=torch.bfloat16,
)

pipe = FluxPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch_dtype)
pipe.to("cuda")

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
).images[0]
image.save("output.png")

#### Skipping Quantization on specific modules
- It is possible to skip applying quantization on certain modules using the `modules_to_not_convert` argument in the `QuantoConfig`.   - Ensure that the modules passed in to this argument match the keys of the modules in the `state_dict`.

In [None]:
model_id = "black-forest-labs/FLUX.1-dev"
quantization_config = QuantoConfig(weights_dtype="float8", modules_to_not_convert=["proj_out"])
transformer = FluxTransformer2DModel.from_pretrained(
      model_id,
      subfolder="transformer",
      quantization_config=quantization_config,
      torch_dtype=torch.bfloat16,
)

#### Using from_single_file with the Quanto Backend
- `QuantoConfig` is compatible with `~FromOriginalModelMixin.from_single_file`.

In [None]:
ckpt_path = "https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors"
quantization_config = QuantoConfig(weights_dtype="float8")
transformer = FluxTransformer2DModel.from_single_file(ckpt_path, quantization_config=quantization_config, torch_dtype=torch.bfloat16)

#### Saving Quantized models
- Diffusers supports serializing Quanto models using the `~ModelMixin.save_pretrained` method.
  - The serialization and loading requirements are different for models quantized directly with the `Quanto` library and models quantized with Diffusers using `Quanto` as the backend.
  - It is currently not possible to load models quantized directly with Quanto into Diffusers using `~ModelMixin.from_pretrained`.

In [None]:
model_id = "black-forest-labs/FLUX.1-dev"
quantization_config = QuantoConfig(weights_dtype="float8")
transformer = FluxTransformer2DModel.from_pretrained(
      model_id,
      subfolder="transformer",
      quantization_config=quantization_config,
      torch_dtype=torch.bfloat16,
)
# save quantized model to reuse
transformer.save_pretrained("<your quantized model save path>")

# you can reload your quantized model with
model = FluxTransformer2DModel.from_pretrained("<your quantized model save path>")

#### Using torch.compile with Quanto
- Currently the `Quanto` backend supports `torch.compile` for the following quantization types:
- `int8` weights

In [None]:
model_id = "black-forest-labs/FLUX.1-dev"
quantization_config = QuantoConfig(weights_dtype="int8")
transformer = FluxTransformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)
transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True)

pipe = FluxPipeline.from_pretrained(
    model_id, transformer=transformer, torch_dtype=torch_dtype
)
pipe.to("cuda")
images = pipe("A cat holding a sign that says hello").images[0]
images.save("flux-quanto-compile.png")