## **7. Training**

> Original Source: https://huggingface.co/docs/diffusers/main/training/overview

```
> Create a Dataset for Training
> Adapt a Model to a New Task
> Models
  - Unconditional Image Generation
  - Text-to-Image
  - Stable Diffusion XL
  - Kandinsky 2.2, Wuerstchen, ControlNet, T2I-Adapter, InstructPix2Pix, etc.
> Methods
  - Textual Inversion
  - DreamBooth, LoRA, Custom Diffusion, Latent Consistency Distillation

```

------------------------
### **Install**
- Diffusers provides a collection of training scripts for you to train your own diffusion models.
- Make sure you can successfully run the latest versions of the example scripts by installing the library from source in a new virtual environment:

```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

- Then navigate to the folder of the training script (for example, DreamBooth) and install the `requirements.txt` file.
  - Some training scripts have a specific requirement file for SDXL, LoRA or Flax.
  - If you’re using one of these scripts, make sure you install its corresponding requirements file.
```
cd examples/dreambooth
pip install -r requirements.txt
# to train SDXL with DreamBooth
pip install -r requirements_sdxl.txt
```

- To speedup training and reduce memory-usage, we recommend:
  - using `PyTorch 2.0` or higher to automatically use scaled dot product attention during training (you don’t need to make any changes to the training code)
  - installing `xFormers` to enable memory-efficient attention

In [2]:
import numpy as np
import torch
import jax

from diffusers import DiffusionPipeline
from diffusers import StableDiffusionPipeline
from diffusers import FlaxStableDiffusionPipeline
from diffusers import AutoModel
import torch_xla.core.xla_model as xm

from flax.jax_utils import replicate
from flax.training.common_utils import shard

from datasets import load_dataset
from accelerate.utils import write_basic_config

-----------------
### **Create a Dataset for Training**
- There are many datasets on the Hub to train a model on, but if you can’t find one you’re interested in or want to use your own, you can create a dataset with the `Datasets` library.
  - The dataset structure depends on the task you want to train your model on.
  - The most basic dataset structure is **a directory of images** for tasks like unconditional image generation.
    - Another dataset structure may be **a directory of images and a text file containing their corresponding text captions** for tasks like text-to-image generation.


#### Provide a dataset as a folder
- For unconditional generation, you can provide your own dataset as a folder of images.
  - The training script uses the `ImageFolder` builder from `Datasets` to automatically build a dataset from the folder.
  - Your directory structure should look like:
```
data_dir/xxx.png
data_dir/xxy.png
data_dir/[...]/xxz.png
```

- Pass the path to the dataset directory to the `--train_data_dir` argument, and then you can start training:
```
accelerate launch train_unconditional.py \
    --train_data_dir <path-to-train-directory> \
    <other-arguments>
```


#### Upload your data to the Hub
- Start by creating a dataset with the `ImageFolder` feature, which creates an image column containing the PIL-encoded images.
  - You can use the `data_dir` or `data_files` parameters to specify the location of the dataset.
  - The `data_files` parameter supports mapping specific files to dataset splits like train or test:

In [None]:
# example 1: local folder
dataset = load_dataset("imagefolder", data_dir="path_to_your_folder")

# example 2: local files (supported formats are tar, gzip, zip, xz, rar, zstd)
dataset = load_dataset("imagefolder", data_files="path_to_zip_file")

# example 3: remote files (supported formats are tar, gzip, zip, xz, rar, zstd)
dataset = load_dataset(
    "imagefolder",
    data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip",
)

# example 4: providing several splits
dataset = load_dataset(
    "imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]}
)

- Then use the `push_to_hub` method to upload the dataset to the Hub.
  - The dataset is available for training by passing the dataset name to the `--dataset_name` argument:

In [None]:
# assuming you have ran the huggingface-cli login command in a terminal
dataset.push_to_hub("name_of_your_dataset")

# if you want to push to a private repo, simply pass private=True:
dataset.push_to_hub("name_of_your_dataset", private=True)

In [None]:
accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
  --dataset_name="name_of_your_dataset" \
  <other-arguments>

-----
### **Adapt a model to a new task**
- Many diffusion systems share the same components, allowing you to adapt a pretrained model for one task to an entirely different task.
  - How to adapt a pretrained text-to-image model for inpainting by initializing and modifying the architecture of a pretrained `UNet2DConditionModel`.

- Configure `UNet2DConditionModel` parameters
  - A `UNet2DConditionModel` by default accepts 4 channels in the input sample.
    - Load a pretrained text-to-image model like `stable-diffusion-v1-5/stable-diffusion-v1-5` and take a look at the number of `in_channels`:
   
- Inpainting requires 9 channels in the input sample. You can check this value in a pretrained inpainting model like `runwayml/stable-diffusion-inpainting`:

In [None]:
pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True)
pipeline.unet.config["in_channels"]

pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-inpainting", use_safetensors=True)
pipeline.unet.config["in_channels"]

- To adapt your text-to-image model for inpainting, you’ll need to change the number of in_channels from 4 to 9.
- Initialize a `UNet2DConditionModel` with the pretrained text-to-image model weights, and change `in_channels` to 9.
  - Changing the number of `in_channels` means you need to set `ignore_mismatched_sizes=True` and `low_cpu_mem_usage=False` to avoid a size mismatch error because the shape is different now.
 
- The pretrained weights of the other components from the text-to-image model are initialized from their checkpoints, but the input channel weights (`conv_in.weight`) of the unet are randomly initialized.
  - It is important to finetune the model for inpainting because otherwise the model returns noise.

In [None]:
model_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
unet = AutoModel.from_pretrained(
    model_id,
    subfolder="unet",
    in_channels=9,
    low_cpu_mem_usage=False,
    ignore_mismatched_sizes=True,
    use_safetensors=True,
)

----
### **Model | Unconditional image generation**
- Unconditional image generation models are not conditioned on text or images during training.
  - It only generates images that resemble its training data distribution.

```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

- Then navigate to the example folder containing the training script and install the required dependencies:
```
cd examples/unconditional_image_generation
pip install -r requirements.txt
```

- Initialize an `Accelerate` environment:
  - To setup a default `Accelerate` environment without choosing any configurations:

```
accelerate config
accelerate config default
```

- If your environment doesn’t support an interactive shell like a notebook, you can use:

In [None]:
write_basic_config()

#### Script parameters
- The training script provides many parameters to help you customize your training run.
  - All of the parameters and their descriptions are found in `the parse_args()` function.
  - It provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you’d like.


- To speedup training with mixed precision using the bf16 format, add the `--mixed_precision` parameter to the training command:
```
accelerate launch train_unconditional.py \
--mixed_precision="bf16"
```
- Some basic and important parameters to specify include:
  - `--dataset_name`: the name of the dataset on the Hub or a local path to the dataset to train on
  - `--output_dir`: where to save the trained model
  - `--push_to_hub`: whether to push the trained model to the Hub
  - `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command

<br>

#### Training script
- The code for preprocessing the dataset and the training loop is found in the `main()` function.
  - If you need to adapt the training script, this is where you’ll need to make your changes.
- The `train_unconditional` script initializes a `UNet2DModel` if you don’t provide a model configuration.
  - You can configure the UNet here if you’d like:

In [None]:
model = UNet2DModel(
    sample_size=args.resolution,
    in_channels=3,
    out_channels=3,
    layers_per_block=2,
    block_out_channels=(128, 128, 256, 256, 512, 512),
    down_block_types=(
        "DownBlock2D",
        "DownBlock2D",
        "DownBlock2D",
        "DownBlock2D",
        "AttnDownBlock2D",
        "DownBlock2D",
    ),
    up_block_types=(
        "UpBlock2D",
        "AttnUpBlock2D",
        "UpBlock2D",
        "UpBlock2D",
        "UpBlock2D",
        "UpBlock2D",
    ),
)

- Initializes a scheduler and optimizer:

In [None]:
# Initialize the scheduler
accepts_prediction_type = "prediction_type" in set(inspect.signature(DDPMScheduler.__init__).parameters.keys())
if accepts_prediction_type:
    noise_scheduler = DDPMScheduler(
        num_train_timesteps=args.ddpm_num_steps,
        beta_schedule=args.ddpm_beta_schedule,
        prediction_type=args.prediction_type,
    )
else:
    noise_scheduler = DDPMScheduler(num_train_timesteps=args.ddpm_num_steps, beta_schedule=args.ddpm_beta_schedule)

# Initialize the optimizer
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=args.learning_rate,
    betas=(args.adam_beta1, args.adam_beta2),
    weight_decay=args.adam_weight_decay,
    eps=args.adam_epsilon,
)

- Then it loads a dataset and you can specify how to preprocess it:

In [None]:
dataset = load_dataset("imagefolder", data_dir=args.train_data_dir, cache_dir=args.cache_dir, split="train")

augmentations = transforms.Compose(
    [
        transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR),
        transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution),
        transforms.RandomHorizontalFlip() if args.random_flip else transforms.Lambda(lambda x: x),
        transforms.ToTensor(),
        transforms.Normalize([0.5], [0.5]),
    ]
)

- The training loop handles everything else such as adding noise to the images, predicting the noise residual, calculating the loss, saving checkpoints at specified steps, and saving and pushing the model to the Hub.
  - If you want to learn more about how the training loop works, check out the `Understanding` pipelines, models and schedulers tutorial which breaks down the basic pattern of the denoising process.
 
#### Launch the script
- Simple GPU
```
accelerate launch train_unconditional.py \
  --dataset_name="huggan/flowers-102-categories" \
  --output_dir="ddpm-ema-flowers-64" \
  --mixed_precision="fp16" \
  --push_to_hub
```

- Multi-GPU
  - Add the `--multi_gpu` parameter to the training command:
```
accelerate launch --multi_gpu train_unconditional.py \
  --dataset_name="huggan/flowers-102-categories" \
  --output_dir="ddpm-ema-flowers-64" \
  --mixed_precision="fp16" \
  --push_to_hub
```

- The training script creates and saves a checkpoint file in your repository.
  - Now you can load and use your trained model for inference:

In [None]:
pipeline = DiffusionPipeline.from_pretrained("anton-l/ddpm-butterflies-128").to("cuda")
image = pipeline().images[0]

----
### **Model | Text-to-Image**
- Text-to-image models like Stable Diffusion are conditioned to generate images given a text prompt.
  - Training a model can be taxing on your hardware, but if you enable `gradient_checkpointing` and `mixed_precision`, it is possible to train a model on a single 24GB GPU.
  - If you’re training with larger batch sizes or want to train faster, it’s better to use GPUs with more than 30GB of memory.
    - Reduce your memory footprint by enabling memory-efficient attention with xFormers.
    - JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn’t support gradient checkpointing, gradient accumulation or xFormers.
  - A GPU with at least 30GB of memory or a TPU v3 is recommended for training with Flax.

- Explore the `train_text_to_image.py` training script to help you become familiar with it, and how you can adapt it for your own use-case.
  - Before running the script, make sure you install the library from source:
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

- Navigate to the example folder containing the training script and install the required dependencies for the script you’re using:
```
# Pytorch
cd examples/text_to_image
pip install -r requirements.txt

# Flax
cd examples/text_to_image
pip install -r requirements_flax.txt
```

- Initialize an `Accelerate` environment.
  - To setup a default `Accelerate` environment without choosing any configurations:

```
accelerate config
accelerate config default
```

- If your environment doesn’t support an interactive shell, like a notebook, you can use:

In [None]:
write_basic_config()

#### Script parameters
- The training script provides many parameters to help you customize your training run. 
  - All of the parameters and their descriptions are found in the `parse_args()` function. 
  - This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you’d like.

- To speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command:
```
accelerate launch train_text_to_image.py \
  --mixed_precision="fp16"
```

- Some basic and important parameters include:
  - `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model
  - `--dataset_name`: the name of the dataset on the Hub or a local path to the dataset to train on
  - `--image_column`: the name of the image column in the dataset to train on
  - `--caption_column`: the name of the text column in the dataset to train on
  - `--output_dir`: where to save the trained model
  - `--push_to_hub`: whether to push the trained model to the Hub
  - `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command
Min-SNR weighting

- **Min-SNR weighting**
    - The Min-SNR weighting strategy can help with training by rebalancing the loss to achieve faster convergence.
      - The training script supports predicting epsilon (noise) or v_prediction, but Min-SNR is compatible with both prediction types.
      - This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script.
    
    - Add the `--snr_gamma parameter` and set it to the recommended value of 5.0:
    ```
    accelerate launch train_text_to_image.py \
      --snr_gamma=5.0
    ```

#### Training script
- The dataset preprocessing code and training loop are found in the `main()` function. If you need to adapt the training script, this is where you’ll need to make your changes.
- The `train_text_to_image` script starts by loading a scheduler and tokenizer.
  - You can choose to use a different scheduler here if you want.
  - And loads the UNet model

In [None]:
noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
tokenizer = CLIPTokenizer.from_pretrained(
    args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision
)

load_model = UNet2DConditionModel.from_pretrained(input_dir, subfolder="unet")
model.register_to_config(**load_model.config)

model.load_state_dict(load_model.state_dict())

- Text and image columns of the dataset need to be preprocessed.
  - The `tokenize_captions` function handles tokenizing the inputs, and the `train_transforms` function specifies the type of transforms to apply to the image.
  - Both of these functions are bundled into `preprocess_train`:

In [None]:
def preprocess_train(examples):
    images = [image.convert("RGB") for image in examples[image_column]]
    examples["pixel_values"] = [train_transforms(image) for image in images]
    examples["input_ids"] = tokenize_captions(examples)
    return examples

#### Launch the script
- Made all your changes or you’re okay with the default configuration, you’re ready to launch the training script.
- **Pytorch**
    - Let’s train on the Naruto BLIP captions dataset to generate your own Naruto characters.
    - Set the environment variables `MODEL_NAME` and `dataset_name` to the model and the dataset (either from the Hub or a local path).
    - If you’re training on more than one GPU, add the `--multi_gpu` parameter to the accelerate launch command.
        ```
        export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
        export dataset_name="lambdalabs/naruto-blip-captions"
        
        accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
          --pretrained_model_name_or_path=$MODEL_NAME \
          --dataset_name=$dataset_name \
          --use_ema \
          --resolution=512 --center_crop --random_flip \
          --train_batch_size=1 \
          --gradient_accumulation_steps=4 \
          --gradient_checkpointing \
          --max_train_steps=15000 \
          --learning_rate=1e-05 \
          --max_grad_norm=1 \
          --enable_xformers_memory_efficient_attention \
          --lr_scheduler="constant" --lr_warmup_steps=0 \
          --output_dir="sd-naruto-model" \
          --push_to_hub
        ```

- **Flax**
  - Training with Flax can be faster on TPUs and GPUs.
    - Flax is more efficient on a TPU, but GPU performance is also great.
    - Set the environment variables `MODEL_NAME` and `dataset_name` to the model and the dataset (either from the Hub or a local path).
    ```
    export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
    export dataset_name="lambdalabs/naruto-blip-captions"
    
    python train_text_to_image_flax.py \
      --pretrained_model_name_or_path=$MODEL_NAME \
      --dataset_name=$dataset_name \
      --resolution=512 --center_crop --random_flip \
      --train_batch_size=1 \
      --max_train_steps=15000 \
      --learning_rate=1e-05 \
      --max_grad_norm=1 \
      --output_dir="sd-naruto-model" \
      --push_to_hub
    ```

- Use your newly trained model for inference:

In [None]:
# Pytorch
pipeline = StableDiffusionPipeline.from_pretrained("path/to/saved_model", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

image = pipeline(prompt="yoda").images[0]
image.save("yoda-naruto.png")

# Flax
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained("path/to/saved_model", dtype=jax.numpy.bfloat16)

prompt = "yoda naruto"
prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50

num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)

# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, jax.device_count())
prompt_ids = shard(prompt_ids)

images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
image.save("yoda-naruto.png")

#### Next Steps
- Learn how to load `LoRA` weights for inference if you trained your model with LoRA.
- Learn more about how certain parameters like guidance scale or techniques such as prompt weighting can help you control inference in the Text-to-image task guide.

----
### **Model | Stable Diffusion XL**
- Stable Diffusion XL (SDXL) is a larger and more powerful iteration of the Stable Diffusion model, capable of producing higher resolution images.
  - SDXL’s UNet is 3x larger and the model adds a second text encoder to the architecture.
  - Depending on the hardware available to you, this can be very computationally intensive and it may not run on a consumer GPU like a Tesla T4.
  - To help fit this larger model into memory and to speedup training, try enabling `gradient_checkpointing`, `mixed_precision`, and `gradient_accumulation_steps`.
  - You can reduce your memory-usage even more by enabling memory-efficient attention with xFormers and using bitsandbytes’ 8-bit optimizer.

- Explore the `train_text_to_image_sdxl.py` training script to help you become more familiar with it, and how you can adapt it for your own use-case.
- Before running the script, make sure you install the library from source:
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

- Navigate to the example folder containing the training script and install the required dependencies for the script you’re using:
```
cd examples/text_to_image
pip install -r requirements_sdxl.txt
```

- Initialize an `Accelerate` environment:
```
accelerate config
```

- To setup a default `Accelerate` environment without choosing any configurations:
```
accelerate config default
```

- If your environment doesn’t support an interactive shell, like a notebook, you can use:

In [None]:
write_basic_config()

#### Script parameters
- The training script provides many parameters to help you customize your training run.
  - All of the parameters and their descriptions are found in the `parse_args()` function.
  - This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you’d like.

- To speedup training with mixed precision using the bf16 format, add the `--mixed_precision` parameter to the training command:
```
accelerate launch train_text_to_image_sdxl.py \
  --mixed_precision="bf16"
```

- Most of the parameters are identical to the parameters in the Text-to-image training guide, so you’ll focus on the parameters that are relevant to training SDXL in this guide.
  - `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify a better VAE
  - `--proportion_empty_prompts`: the proportion of image prompts to replace with empty strings
  - `--timestep_bias_strategy`: where (earlier vs. later) in the timestep to apply a bias, which can encourage the model to either learn low or high frequency details
  - `--timestep_bias_multiplier`: the weight of the bias to apply to the timestep
  - `--timestep_bias_begin`: the timestep to begin applying the bias
  - `--timestep_bias_end`: the timestep to end applying the bias
  - `--timestep_bias_portion`: the proportion of timesteps to apply the bias to

- **Min-SNR weighting**
  - The Min-SNR weighting strategy can help with training by rebalancing the loss to achieve faster convergence. 
  - The training script supports predicting either epsilon (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. 
This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script.
  - Add the `--snr_gamma` parameter and set it to the recommended value of `5.0`:
```
accelerate launch train_text_to_image_sdxl.py \
  --snr_gamma=5.0
```

#### Training script
- The training script is also similar to the Text-to-image training guide, but it’s been modified to support SDXL training.
  - This guide will focus on the code that is unique to the SDXL training script.
  - It starts by creating functions to tokenize the prompts to calculate the prompt embeddings, and to compute the image embeddings with the VAE.
- You’ll a function to generate the timesteps weights depending on the number of timesteps and the timestep bias strategy to apply.
  - Within the `main()` function, in addition to loading a tokenizer, the script loads a second tokenizer and text encoder because the SDXL architecture uses two of each:

In [None]:
tokenizer_one = AutoTokenizer.from_pretrained(
    args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision, use_fast=False
)
tokenizer_two = AutoTokenizer.from_pretrained(
    args.pretrained_model_name_or_path, subfolder="tokenizer_2", revision=args.revision, use_fast=False
)

text_encoder_cls_one = import_model_class_from_model_name_or_path(
    args.pretrained_model_name_or_path, args.revision
)
text_encoder_cls_two = import_model_class_from_model_name_or_path(
    args.pretrained_model_name_or_path, args.revision, subfolder="text_encoder_2"
)

- The prompt and image embeddings are computed first and kept in memory, which isn’t typically an issue for a smaller dataset, but for larger datasets it can lead to memory problems.
  - If this is the case, you should save the pre-computed embeddings to disk separately and load them into memory during the training process (see this PR for more discussion about this topic).

In [None]:
text_encoders = [text_encoder_one, text_encoder_two]
tokenizers = [tokenizer_one, tokenizer_two]
compute_embeddings_fn = functools.partial(
    encode_prompt,
    text_encoders=text_encoders,
    tokenizers=tokenizers,
    proportion_empty_prompts=args.proportion_empty_prompts,
    caption_column=args.caption_column,
)

train_dataset = train_dataset.map(compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint)
train_dataset = train_dataset.map(
    compute_vae_encodings_fn,
    batched=True,
    batch_size=args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps,
    new_fingerprint=new_fingerprint_for_vae,
)

- After calculating the embeddings, the text encoder, VAE, and tokenizer are deleted to free up some memory:

In [None]:
del text_encoders, tokenizers, vae
gc.collect()
torch.cuda.empty_cache()

- The training loop takes care of the rest.
  - If you chose to apply a timestep bias strategy, you’ll see the timestep weights are calculated and added as noise:

In [None]:
weights = generate_timestep_weights(args, noise_scheduler.config.num_train_timesteps).to(
        model_input.device
    )
    timesteps = torch.multinomial(weights, bsz, replacement=True).long()

noisy_model_input = noise_scheduler.add_noise(model_input, noise, timesteps)

#### Launch the script
- Once you’ve made all your changes or you’re okay with the default configuration, you’re ready to launch the training script.
- Train on the Naruto BLIP captions dataset to generate your own Naruto characters.
  - Set the environment variables `MODEL_NAME` and `DATASET_NAME` to the model and the dataset (either from the Hub or a local path).
  - You should also specify a VAE other than the SDXL VAE (either from the Hub or a local path) with `VAE_NAME` to avoid numerical instabilities.
  ```
  export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
  export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
  export DATASET_NAME="lambdalabs/naruto-blip-captions"

  accelerate launch train_text_to_image_sdxl.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --pretrained_vae_model_name_or_path=$VAE_NAME \
    --dataset_name=$DATASET_NAME \
    --enable_xformers_memory_efficient_attention \
    --resolution=512 \
    --center_crop \
    --random_flip \
    --proportion_empty_prompts=0.2 \
    --train_batch_size=1 \
    --gradient_accumulation_steps=4 \
    --gradient_checkpointing \
    --max_train_steps=10000 \
    --use_8bit_adam \
    --learning_rate=1e-06 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --mixed_precision="fp16" \
    --report_to="wandb" \
    --validation_prompt="a cute Sundar Pichai creature" \
    --validation_epochs 5 \
    --checkpointing_steps=5000 \
    --output_dir="sdxl-naruto-model" \
    --push_to_hub
  ```

- Use your newly trained SDXL model for inference
  - `PyTorch XLA` allows you to run PyTorch on XLA devices such as TPUs, which can be faster.
  - The initial warmup step takes longer because the model needs to be compiled and optimized.
  - Subsequent calls to the pipeline on an input with the same length as the original prompt are much faster because it can reuse the optimized graph.

In [None]:
# Pytorch
pipeline = DiffusionPipeline.from_pretrained("path/to/your/model", torch_dtype=torch.float16).to("cuda")

prompt = "A naruto with green eyes and red legs."
image = pipeline(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
image.save("naruto.png")

# Pytorch XLA
device = xm.xla_device()
pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0").to(device)

prompt = "A naruto with green eyes and red legs."
start = time()
image = pipeline(prompt, num_inference_steps=inference_steps).images[0]
print(f'Compilation time is {time()-start} sec')
image.save("naruto.png")

start = time()
image = pipeline(prompt, num_inference_steps=inference_steps).images[0]
print(f'Inference time is {time()-start} sec after compilation')

#### Next steps
- Read the `Stable Diffusion XL` guide to learn how to use it for a variety of different tasks (text-to-image, image-to-image, inpainting), how to use it’s refiner model, and the different types of micro-conditionings.
- Check out the `DreamBooth` and `LoRA` training guides to learn how to train a personalized `SDXL` model with just a few example images.

-----------------
- Other training pipelines also follow the sequence:
  - script parameters → training script → launch the script.

- The related repositories are as follows:

### **Model | Kandinsky 2.2**
> Original Source: https://huggingface.co/docs/diffusers/main/training/kandinsky
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

cd examples/kandinsky2_2/text_to_image
pip install -r requirements.txt
```


### **Model | Wuerstchen**
> Original Source: https://huggingface.co/docs/diffusers/main/training/wuerstchen
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

cd examples/wuerstchen/text_to_image
pip install -r requirements.txt
```


### **Model | ControlNet**
> Original Source: https://huggingface.co/docs/diffusers/main/training/controlnet
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

cd examples/controlnet
pip install -r requirements.txt
```

### **Model | T2I-Adapter**
> Original Source: https://huggingface.co/docs/diffusers/main/training/t2i_adapters
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

cd examples/t2i_adapter
pip install -r requirements.txt
```


### **Model | InstructPix2Pix**
> Original Source: https://huggingface.co/docs/diffusers/main/training/instructpix2pix
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

cd examples/instruct_pix2pix
pip install -r requirements.txt
```

----
### **Method | Textual Inversion**
- **Textual Inversion** is a training technique for personalizing image generation models with just a few example images of what you want it to learn.
  - This technique works by learning and updating the text embeddings (the new embeddings are tied to a special word you must use in the prompt) to match the example images you provide.

- If you’re training on a GPU with limited vRAM, you should try enabling the `gradient_checkpointing` and `mixed_precision` parameters in the training command.
  - You can also reduce your memory footprint by using memory-efficient attention with xFormers.
  - JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn’t support gradient checkpointing or xFormers.
  - With the same configuration and setup as PyTorch, the Flax training script should be at least `~70%` faster.

- Explore the `textual_inversion.py` script to help you become more familiar with it, and how you can adapt it for your own use-case.
  - Before running the script, make sure you install the library from source:
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

- Navigate to the example folder with the training script and install the required dependencies for the script you’re using:
```
cd examples/textual_inversion
pip install -r requirements.txt
```

- Initialize an `Accelerate` environment:
  - Setup a default `Accelerate` environment without choosing any configurations:
```
accelerate config
accelerate config default
```

- If your environment doesn’t support an interactive shell, like a notebook, you can use:

In [None]:
from accelerate.utils import write_basic_config

write_basic_config()

#### Script parameters
- The training script has many parameters to help you tailor the training run to your needs.
  - All of the parameters and their descriptions are listed in the `parse_args()` function.
  - Where applicable, Diffusers provides default values for each parameter such as the training batch size and learning rate, but feel free to change these values in the training command if you’d like.

- To increase the number of gradient accumulation steps above the default value of 1:
```
accelerate launch textual_inversion.py \
  --gradient_accumulation_steps=4
```

- Some other basic and important parameters to specify include:
  - `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model
  - `--train_data_dir`: path to a folder containing the training dataset (example images)
  - `--output_dir`: where to save the trained model
  - `--push_to_hub`: whether to push the trained model to the Hub
  - `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding --resume_from_checkpoint to your training command
  - `--num_vectors`: the number of vectors to learn the embeddings with; increasing this parameter helps the model learn better but it comes with increased training costs
  - `--placeholder_token`: the special word to tie the learned embeddings to (you must use the word in your prompt for inference)
  - `--initializer_token`: a single-word that roughly describes the object or style you’re trying to train on
  - `--learnable_property`: whether you’re training the model to learn a new “style” (for example, Van Gogh’s painting style) or “object”

#### Training script
- Unlike some of the other training scripts, `textual_inversion.py` has a custom dataset class, `TextualInversionDataset` for creating a dataset.
- You can customize the image size, placeholder token, interpolation method, whether to crop the image, and more.
- If you need to change how the dataset is created, you can modify `TextualInversionDataset`.
  - Find the dataset preprocessing code and training loop in the `main()` function.

- The script starts by loading the tokenizer, scheduler and model:

In [None]:
# Load tokenizer
if args.tokenizer_name:
    tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_name)
elif args.pretrained_model_name_or_path:
    tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer")

# Load scheduler and models
noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
text_encoder = CLIPTextModel.from_pretrained(
    args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision
)
vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision)
unet = UNet2DConditionModel.from_pretrained(
    args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision
)

- The special placeholder token is added next to the tokenizer, and the embedding is readjusted to account for the new token.
  - The script creates a dataset from the `TextualInversionDataset`:

In [None]:
train_dataset = TextualInversionDataset(
    data_root=args.train_data_dir,
    tokenizer=tokenizer,
    size=args.resolution,
    placeholder_token=(" ".join(tokenizer.convert_ids_to_tokens(placeholder_token_ids))),
    repeats=args.repeats,
    learnable_property=args.learnable_property,
    center_crop=args.center_crop,
    set="train",
)
train_dataloader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.train_batch_size, shuffle=True, num_workers=args.dataloader_num_workers
)

- The training loop handles everything else from predicting the noisy residual to updating the embedding weights of the special placeholder token.
  - Check out the Understanding pipelines, models and schedulers tutorial which breaks down the basic pattern of the denoising process.

#### Launch the script
- You’ll download some images of a cat toy and store them in a directory.

In [None]:
local_dir = "./cat"
snapshot_download(
    "diffusers/cat_toy_example", local_dir=local_dir, repo_type="dataset", ignore_patterns=".gitattributes"
)

- Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model, and `DATA_DIR` to the path where you just downloaded the cat images to.
  - The script creates and saves the following files to your repository:
    - `learned_embeds.bin`: the learned embedding vectors corresponding to your example images
    - `token_identifier.txt`: the special placeholder token
    - `type_of_concept.txt`: the type of concept you’re training on (either “object” or “style”)

- A full training run takes ~1 hour on a single V100 GPU.

- If you’re interested in following along with the training process, you can periodically save generated images as training progresses. Add the following parameters to the training command:

```
--validation_prompt="A <cat-toy> train"
--num_validation_images=4
--validation_steps=100
```

- **Pytorch**
```
export MODEL_NAME="stable-diffusion-v1-5/stable-diffusion-v1-5"
export DATA_DIR="./cat"

accelerate launch textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<cat-toy>" \
  --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 \
  --scale_lr \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --output_dir="textual_inversion_cat" \
  --push_to_hub
```

- **Flax**
```
export MODEL_NAME="duongna/stable-diffusion-v1-4-flax"
export DATA_DIR="./cat"

python textual_inversion_flax.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<cat-toy>" \
  --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 \
  --scale_lr \
  --output_dir="textual_inversion_cat" \
  --push_to_hub
```

- Use your newly trained model for inference like:

In [None]:
# Pytorch
pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
pipeline.load_textual_inversion("sd-concepts-library/cat-toy")
image = pipeline("A <cat-toy> train", num_inference_steps=50).images[0]
image.save("cat-train.png")

# Flax
model_path = "path-to-your-trained-model"
pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16)

prompt = "A <cat-toy> train"
prng_seed = jax.random.PRNGKey(0)
num_inference_steps = 50

num_samples = jax.device_count()
prompt = num_samples * [prompt]
prompt_ids = pipeline.prepare_inputs(prompt)

# shard inputs and rng
params = replicate(params)
prng_seed = jax.random.split(prng_seed, jax.device_count())
prompt_ids = shard(prompt_ids)

images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images
images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:])))
image.save("cat-train.png")

#### Next Steps
- Learn how to load Textual Inversion embeddings and also use them as negative embeddings.
- Learn how to use Textual Inversion for inference with `Stable Diffusion 1/2` and `Stable Diffusion XL`.

-----------------
- Other training pipelines also follow the sequence:
  - script parameters → training script → launch the script.

- The related repositories are as follows:

### **Method | DreamBooth**
> Original Source: https://huggingface.co/docs/diffusers/main/training/dreambooth
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

cd examples/dreambooth
pip install -r requirements.txt
```

### **Method | LoRA**
> Original Source: https://huggingface.co/docs/diffusers/main/training/lora
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

cd examples/text_to_image
pip install -r requirements.txt
```

### **Method | Custom Diffusion**
> Original Source: https://huggingface.co/docs/diffusers/main/training/custom_diffusion
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

cd examples/custom_diffusion
pip install -r requirements.txt
pip install clip-retrieval
```

### **Method | Latent Consistency Distillation**
> Original Source: https://huggingface.co/docs/diffusers/main/training/lcm_distill
```
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

cd examples/consistency_distillation
pip install -r requirements.txt
```