## **6. T2I <sup>Text-to-image</sup> Pipelines**

> Original Source: https://huggingface.co/docs/diffusers/v0.33.1/en/using-diffusers/sdxl_turbo

```
> Stable Diffusion XL
> Stable Diffusion XL Turbo
> Kandinsky
> IP-Adapter
> OmniGen
> Perturbed-Attention Guidance(PAG)
> ControlNet
> T2I-Adapter
> Latent Consistency Model
> Textual inversion
> DiffEdit
> Trajectory Consistency Distillation-LoRA
```

In [27]:
import torch
import numpy as np
from PIL import Image
import cv2

from transformers import pipeline

from diffusers import DiffusionPipeline
from diffusers import DDPMScheduler

from diffusers.utils import load_image, make_image_grid

-----
### **Stable Diffusion XL**
- `Stable Diffusion XL (SDXL)` is a powerful text-to-image generation model that iterates on the previous Stable Diffusion models in three key ways:
  - The `UNet` is 3x larger and `SDXL` combines a second text encoder (`OpenCLIP ViT-bigG/14`) with the original text encoder to significantly increase the number of parameters
  - Introduces size and crop-conditioning to preserve training data from being discarded and gain more control over how a generated image should be cropped
  - Introduces a two-stage model process; the base model (can also be run as a standalone model) generates an image as an input to the refiner model which adds additional high-quality details.

<br>

- Install:
```
pip install -q diffusers transformers accelerate invisible-watermark>=0.2.0
```

- We recommend installing the `invisible-watermark` library to help identify images that are generated. If the invisible-watermark library is installed, it is used by default. To disable the watermarker:

```
pipeline = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False)
```
<br>

#### Load model checkpoints
- Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the `from_pretrained()` method:

In [8]:
from diffusers import StableDiffusionXLPipeline
from diffusers import AutoPipelineForText2Image

from diffusers import StableDiffusionXLImg2ImgPipeline, AutoPipelineForImage2Image
from diffusers import StableDiffusionXLInpaintPipeline, AutoPipelineForInpainting

from diffusers import DiffusionPipeline

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
).to("cuda")

- You can also use the `from_single_file()` method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally:

In [None]:
pipeline = StableDiffusionXLPipeline.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0.safetensors",
    torch_dtype=torch.float16
).to("cuda")

refiner = StableDiffusionXLImg2ImgPipeline.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/blob/main/sd_xl_refiner_1.0.safetensors", torch_dtype=torch.float16
).to("cuda")

#### Text-to-image
- For text-to-image, pass a text prompt. 
- By default, `SDXL` generates a 1024x1024 image for the best results.
  - You can try setting the height and width parameters to 768x768 or 512x512, but anything below 512x512 is not likely to work.

In [None]:
pipeline_text2image = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipeline_text2image(prompt=prompt).images[0]
image

#### Refine image quality
- `SDXL` includes a refiner model specialized in denoising low-noise stage images to generate higher-quality images from the base model.
  - Use the base and refiner models together to produce a refined image
  - Use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how `SDXL` was originally trained)

<br>

- **Base + refiner model**
  - When you use the base and refiner model together to generate an image, this is known as an ensemble of expert denoisers.
  - The ensemble of expert denoisers approach requires fewer overall denoising steps versus passing the base model’s output to the refiner model, so it should be significantly faster to run.
  - However, you won’t be able to inspect the base model’s output because it still contains a large amount of noise.
  - As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model:

In [None]:
base = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base.text_encoder_2,
    vae=base.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
).to("cuda")

- To use this approach, you need to **define the number of timesteps for each model** to run through their respective stages.
  - For the base model, this is controlled by the `denoising_end` parameter and for the refiner model, it is controlled by the `denoising_start` parameter.
 
- The `denoising_end` and `denoising_start` parameters should be a float between 0 and 1.
  - These parameters are represented as a proportion of discrete timesteps as defined by the scheduler.
  - If you’re also using the strength parameter, it’ll be ignored because the number of denoising steps is determined by the discrete timesteps the model is trained on and the declared fractional cutoff.

- Let’s set `denoising_end=0.8` so the base model performs the first 80% of denoising the high-noise timesteps and set `denoising_start=0.8` so the refiner model performs the last 20% of denoising the low-noise timesteps.
  - The base model output should be in latent space instead of a PIL image.

In [None]:
prompt = "A majestic lion jumping from a big stone at night"

image = base(
    prompt=prompt,
    num_inference_steps=40,
    denoising_end=0.8,
    output_type="latent",
).images
image = refiner(
    prompt=prompt,
    num_inference_steps=40,
    denoising_start=0.8,
    image=image,
).images[0]
image

#### Base to refiner model
- `SDXL` gets a boost in image quality by using the refiner model to add additional high-quality details to the fully-denoised image from the base model, in an image-to-image setting.

In [None]:
base = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base.text_encoder_2,
    vae=base.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
).to("cuda")

- You can use `SDXL` refiner with a different base model.
  - You can use the `Hunyuan-DiT` or `PixArt-Sigma` pipelines to generate images with better prompt adherence.
  - Once you have generated an image, you can pass it to the `SDXL` refiner model to enhance final generation quality.
- Set the model output to latent space:

In [None]:
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"

image = base(prompt=prompt, output_type="latent").images[0]

- For inpainting, load the base and the refiner model in the `StableDiffusionXLInpaintPipeline`, remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner.
- Pass the generated image to the refiner model:

In [None]:
image = refiner(prompt=prompt, image=image[None, :]).images[0]

#### Micro-conditioning
- `SDXL` training involves several additional conditioning techniques, which are referred to as micro-conditioning.
  - These include original image size, target image size, and cropping parameters.
  - The micro-conditionings can be used at inference time to create high-quality, centered images.

- You can use both micro-conditioning and negative micro-conditioning parameters thanks to classifier-free guidance.
  - They are available in the `StableDiffusionXLPipeline`, `StableDiffusionXLImg2ImgPipeline`, `StableDiffusionXLInpaintPipeline`, and `StableDiffusionXLControlNetPipeline`.
 
#### Size conditioning
- There are two types of size conditioning:
  - `original_size` conditioning comes from upscaled images in the training batch (because it would be wasteful to discard the smaller images which make up almost 40% of the total training data).
    - Using the default value of (1024, 1024) produces higher-quality images that resemble the 1024x1024 images in the dataset.
  - `target_size` conditioning comes from finetuning SDXL to support different image aspect ratios.
    - During inference, if you use the default value of (1024, 1024), you’ll get an image that resembles the composition of square images in the dataset.
    - We recommend using the same value for `target_size` and `original_size`, but feel free to experiment with other options!

In [None]:
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(
    prompt=prompt,
    negative_original_size=(512, 512),
    negative_target_size=(1024, 1024),
).images[0]

#### Crop conditioning
- Images generated by previous Stable Diffusion models may sometimes appear to be cropped.
  - This is because images are actually cropped during training so that all the images in a batch have the same size.
  - By conditioning on crop coordinates, `SDXL` learns that no cropping - coordinates (0, 0) - usually correlates with centered subjects and complete faces.

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipeline(prompt=prompt, crops_coords_top_left=(256, 0)).images[0]
image

- Specify negative cropping coordinates to steer generation away from certain cropping parameters:

In [None]:
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(
    prompt=prompt,
    negative_original_size=(512, 512),
    negative_crops_coords_top_left=(0, 0),
    negative_target_size=(1024, 1024),
).images[0]
image

#### Use a different prompt for each text-encoder
- SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can improve quality.
  - Pass your original prompt to prompt and the second prompt to `prompt_2`:

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

# prompt is passed to OAI CLIP-ViT/L-14
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# prompt_2 is passed to OpenCLIP-ViT/bigG-14
prompt_2 = "Van Gogh painting"
image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0]
image

#### Optimizations
- SDXL is a large model, and you may need to optimize memory to get it to run on your hardware.
- Offload the model to the CPU with `enable_model_cpu_offload()` for out-of-memory errors:
```
- base.to("cuda")
- refiner.to("cuda")
+ base.enable_model_cpu_offload()
+ refiner.enable_model_cpu_offload()
```
  - Use `torch.compile` for ~20% speed-up (you need torch>=2.0):
```
+ base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True)
+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
```
  - Enable xFormers to run `SDXL` if torch<2.0:
```
+ base.enable_xformers_memory_efficient_attention()
+ refiner.enable_xformers_memory_efficient_attention()
```


-----
### **Stable Diffusion XL Turbo**
- `SDXL Turbo` is an adversarial time-distilled `Stable Diffusion XL (SDXL)` model capable of running inference in as little as 1 step.

<br>

#### Load model checkpoints
- Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the `from_pretrained()` method:

In [None]:
from diffusers import AutoPipelineForText2Image
from diffusers import StableDiffusionXLPipeline, EulerAncestralDiscreteScheduler

from diffusers.utils import load_image, make_image_grid

In [None]:
pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
pipeline = pipeline.to("cuda")

- You can also use the `from_single_file()` method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally.
  - For this loading method, you need to set `timestep_spacing="trailing"` (feel free to experiment with the other scheduler config values to get better results):

In [None]:
pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
pipeline = pipeline.to("cuda")

- You can also use the `from_single_file()` method to load a model checkpoint stored in a single file format (`.ckpt` or `.safetensors`) from the Hub or locally.
  - You need to set `timestep_spacing="trailing"` (feel free to experiment with the other scheduler config values to get better results):

In [None]:
pipeline = StableDiffusionXLPipeline.from_single_file(
    "https://huggingface.co/stabilityai/sdxl-turbo/blob/main/sd_xl_turbo_1.0_fp16.safetensors",
    torch_dtype=torch.float16, variant="fp16")
pipeline = pipeline.to("cuda")
pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(pipeline.scheduler.config, timestep_spacing="trailing")

#### Text-to-image
- By default, `SDXL Turbo` generates a 512x512 image, and that resolution gives the best results.
  - You can try setting the height and width parameters to 768x768 or 1024x1024, but you should expect quality degradations when doing so.

- Make sure to set `guidance_scale` to `0.0` to disable, as the model was trained without it.
  - A single inference step is enough to generate high quality images. Increasing the number of steps to 2, 3 or 4 should improve image quality.

In [None]:
pipeline_text2image = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
pipeline_text2image = pipeline_text2image.to("cuda")

prompt = "A cinematic shot of a baby racoon wearing an intricate italian priest robe."

image = pipeline_text2image(prompt=prompt, guidance_scale=0.0, num_inference_steps=1).images[0]
image

#### Speed-up `SDXL Turbo` even more
- Compile the UNet if you are using PyTorch version 2.0 or higher.
  - The first inference run will be very slow, but subsequent ones will be much faster.

- When using the default VAE, keep it in float32 to avoid costly dtype conversions before and after each generation.
  - You only need to do this one before your first generation:

In [None]:
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
pipe.upcast_vae()

-----
### **Kandinsky**
- The Kandinsky models are a series of multilingual text-to-image generation models.
  - The `Kandinsky 2.0` model uses two multilingual text encoders and concatenates those results for the UNet.
  - `Kandinsky 2.1` changes the architecture to include an image prior model (CLIP) to generate a mapping between text and image embeddings and uses a `Modulating Quantized Vectors (MoVQ)` decoder - which adds a spatial conditional normalization layer to increase photorealism - to decode the latents into images.
  - `Kandinsky 2.2` improves on the previous model by replacing the image encoder of the image prior model with a larger `CLIP-ViT-G` model to improve quality.
    - The only difference with `Kandinsky 2.1` is `Kandinsky 2.2` doesn’t accept prompt as an input when decoding the latents. Instead, `Kandinsky 2.2` only accepts `image_embeds` during decoding.
  - `Kandinsky 3` simplifies the architecture and shifts away from the two-stage generation process involving the prior model and diffusion model and uses `Flan-UL2` to encode text, a `UNet` with BigGan-deep blocks, and `Sber-MoVQGAN` to decode the latents into images.
    - Text understanding and generated image quality are primarily achieved by using a larger text encoder and UNet.


In [17]:
from diffusers import KandinskyPriorPipeline, KandinskyPipeline
from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
from diffusers import Kandinsky3Pipeline

from diffusers import AutoPipelineForText2Image
from diffusers import AutoPipelineForImage2Image
from diffusers import AutoPipelineForInpainting

from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline
from diffusers import KandinskyV22Img2ImgPipeline
from diffusers import Kandinsky3Img2ImgPipeline

from diffusers import KandinskyInpaintPipeline
from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline

from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline

from diffusers.models.attention_processor import AttnAddedKVProcessor2_0

from diffusers.utils import load_image
from diffusers.utils import make_image_grid

#### Text-to-image
- To use the Kandinsky models for any task, you always start by setting up the prior pipeline to encode the prompt and generate the image embeddings.
  - The prior pipeline also generates `negative_image_embeds` that correspond to the negative prompt "".
  - You can pass an actual `negative_prompt` to the prior pipeline, but this’ll increase the effective batch size of the prior pipeline by 2x.

In [None]:
prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16).to("cuda")
pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16).to("cuda")

prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality" # optional to include a negative prompt, but results are usually better
image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple()

- Diffusers also provides an end-to-end API with the `KandinskyCombinedPipeline` and `KandinskyV22CombinedPipeline`, meaning you don’t have to separately load the prior and text-to-image pipeline.
  - The combined pipeline automatically loads both the prior model and the decoder.
  - You can still set different values for the prior pipeline with the `prior_guidance_scale` and `prior_num_inference_steps` parameters if you want.

- Use the `AutoPipelineForText2Image` to automatically call the combined pipelines under the hood:

In [None]:
pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()

prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"

image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
image

#### Interpolation
- Interpolation allows you to explore the latent space between the image and text embeddings which is a cool way to see some of the prior model’s intermediate outputs.
  - Load the prior pipeline and two images you’d like to interpolate:

In [None]:
prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)

- Specify the text or images to interpolate, and set the weights for each text or image.
  - Experiment with the weights to see how they affect the interpolation!

In [None]:
images_texts = ["a cat", img_1, img_2]
weights = [0.3, 0.3, 0.4]

- Call the interpolate function to generate the embeddings, and then pass them to the pipeline to generate the image:

In [None]:
prompt = ""
prior_out = prior_pipeline.interpolate(images_texts, weights)

pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

image = pipeline(prompt, **prior_out, height=768, width=768).images[0]

#### ControlNet
- ControlNet enables conditioning large pretrained diffusion models with additional inputs such as a depth map or edge detection.
  - You can condition `Kandinsky 2.2` with a depth map so the model understands and preserves the structure of the depth image.
 
- Use the depth-estimation Pipeline from `Transformers` to process the image and retrieve the depth map:

In [None]:
img = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768))

def make_hint(image, depth_estimator):
    image = depth_estimator(image)["depth"]
    image = np.array(image)
    image = image[:, :, None]
    image = np.concatenate([image, image, image], axis=2)
    detected_map = torch.from_numpy(image).float() / 255.0
    hint = detected_map.permute(2, 0, 1)
    return hint

depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")

- Load the prior pipeline and the `KandinskyV22ControlnetPipeline`:

In [None]:
prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")

pipeline = KandinskyV22ControlnetPipeline.from_pretrained(
    "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
).to("cuda")

- Generate the image embeddings from a prompt and negative prompt:

In [None]:
prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"

generator = torch.Generator(device="cuda").manual_seed(43)

image_emb, zero_image_emb = prior_pipeline(
    prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
).to_tuple()

- Pass the image embeddings and the depth image to the `KandinskyV22ControlnetPipeline` to generate an image:

In [13]:
image = pipeline(image_embeds=image_emb, negative_image_embeds=zero_image_emb, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]

#### Optimizations
- Kandinsky is unique because it requires a prior pipeline to generate the mappings, and a second pipeline to decode the latents into an image.
  - Optimization efforts should be focused on the second pipeline because that is where the bulk of the computation is done.
  - Here are some tips to improve Kandinsky during inference.

- Enable xFormers if you’re using PyTorch < 2.0:

In [None]:
pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
pipe.enable_xformers_memory_efficient_attention()

- Enable `torch.compile` if you’re using PyTorch >= 2.0 to automatically use scaled dot-product attention (SDPA):

In [None]:
pipe.unet.to(memory_format=torch.channels_last)
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

- This is the same as explicitly setting the attention processor to use `AttnAddedKVProcessor2_0`:

In [None]:
pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())

- Offload the model to the CPU with `enable_model_cpu_offload()` to avoid out-of-memory errors:

In [None]:
pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

- The text-to-image pipeline uses the `DDIMScheduler` but you can replace it with another scheduler like `DDPMScheduler` to see how that affects the tradeoff between inference speed and image quality:

In [None]:
scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler")
pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda")

-----
### **IP-Adapter**
- `IP-Adapter` is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model.
  - This adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like ControlNet.
  - The key idea behind `IP-Adapter` is the decoupled cross-attention mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features.
    - This allows the model to learn more image-specific features.

<br>

#### General tasks
- `set_ip_adapter_scale()` method controls the amount of text or image conditioning to apply to the model.
  - A value of `1.0` means the model is only conditioned on the image prompt.
  - Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt.
  - Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.

- Try adding `low_cpu_mem_usage=True` to the `load_ip_adapter()` method to speed up the loading

In [21]:
from diffusers import AutoPipelineForText2Image
from diffusers import AutoPipelineForImage2Image
from diffusers import AutoPipelineForInpainting

from diffusers.image_processor import IPAdapterMaskProcessor
from diffusers import StableDiffusionPipeline, DDIMScheduler, AutoPipelineForText2Image
from diffusers import DiffusionPipeline, LCMScheduler

from insightface.app import FaceAnalysis
from insightface.utils import face_align

from transformers import CLIPVisionModelWithProjection

- Crafting the precise text prompt to generate the image you want can be difficult because it may not always capture what you’d like to express.
  - Adding an image alongside the text prompt helps the model better understand what it should generate and can lead to more accurate results.

- Load a `Stable Diffusion XL (SDXL)` model and insert an `IP-Adapter` into the model with the `load_ip_adapter()` method.
  - Use the subfolder parameter to load the `SDXL` model weights.

In [None]:
pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
pipeline.set_ip_adapter_scale(0.6)

- Create a text prompt and load an image prompt before passing them to the pipeline to generate an image.

In [None]:
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
generator = torch.Generator(device="cpu").manual_seed(0)
images = pipeline(
    prompt="a polar bear sitting in a chair drinking a milkshake",
    ip_adapter_image=image,
    negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
    num_inference_steps=100,
    generator=generator,
).images
images[0]

#### Configure parameters
- There are a couple of `IP-Adapter` parameters that are useful to know about and can help you with your image generation tasks.
  - These parameters can make your workflow more efficient or give you more control over image generation.

<br>

- **Image embeddings**
  - IP-Adapter enabled pipelines provide the `ip_adapter_image_embeds` parameter to accept precomputed image embeddings.
  - This is particularly useful in scenarios where you need to run the `IP-Adapter` pipeline multiple times because you have more than one image.
    - ex. `Multi IP-Adapter` is a specific use case where you provide multiple styling images to generate a specific image in a specific style.
      - Loading and encoding multiple images each time you use the pipeline would be inefficient.
      - Instead, you can precompute and save the image embeddings to disk (which can save a lot of space if you’re using high-quality images) and load them when you need them.
    - This parameter also gives you the flexibility to load embeddings from other sources.
    - Call the `prepare_ip_adapter_image_embeds()` method to encode and generate the image embeddings.
      - Then you can save them to disk with `torch.save`.
  - If you’re using IP-Adapter with ip_adapter_image_embedding instead of `ip_adapter_image`’, you can set `load_ip_adapter(image_encoder_folder=None,...)` because you don’t need to load an encoder to generate the image embeddings.

In [None]:
image_embeds = pipeline.prepare_ip_adapter_image_embeds(
    ip_adapter_image=image,
    ip_adapter_image_embeds=None,
    device="cuda",
    num_images_per_prompt=1,
    do_classifier_free_guidance=True,
)

torch.save(image_embeds, "image_embeds.ipadpt")

image_embeds = torch.load("image_embeds.ipadpt")
images = pipeline(
    prompt="a polar bear sitting in a chair drinking a milkshake",
    ip_adapter_image_embeds=image_embeds,
    negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
    num_inference_steps=100,
    generator=generator,
).images

- **IP-Adapter masking**
  - Binary masks specify which portion of the output image should be assigned to an `IP-Adapter`.
  - This is useful for composing more than one `IP-Adapter` image. For each input `IP-Adapter` image, you must provide a binary mask.
  - To start, preprocess the input `IP-Adapter` images with the `~image_processor.IPAdapterMaskProcessor.preprocess()` to generate their masks.
    - For optimal results, provide the output height and width to `~image_processor.IPAdapterMaskProcessor.preprocess()`.
    - This ensures masks with different aspect ratios are appropriately stretched.
    - If the input masks already match the aspect ratio of the generated image, you don’t have to set the height and width.

In [None]:
mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")
mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png")

output_height = 1024
output_width = 1024

processor = IPAdapterMaskProcessor()
masks = processor.preprocess([mask1, mask2], height=output_height, width=output_width)

- When there is more than one input `IP-Adapter` image, load them as a list and provide the `IP-Adapter` scale list.
  - Each of the input `IP-Adapter` images here corresponds to one of the masks generated above.

In [None]:
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"])
pipeline.set_ip_adapter_scale([[0.7, 0.7]])  # one scale for each image-mask pair

face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png")

ip_images = [[face_image1, face_image2]]

masks = [masks.reshape(1, masks.shape[0], masks.shape[2], masks.shape[3])]

- Now pass the preprocessed masks to `cross_attention_kwargs` in the pipeline call.

In [None]:
generator = torch.Generator(device="cpu").manual_seed(0)
num_images = 1

image = pipeline(
    prompt="2 girls",
    ip_adapter_image=ip_images,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
    num_inference_steps=20,
    num_images_per_prompt=num_images,
    generator=generator,
    cross_attention_kwargs={"ip_adapter_masks": masks}
).images[0]
image

#### Face model
- IP-Adapter’s image prompting and compatibility with other adapters and models makes it a versatile tool for a variety of use cases.

- Generating accurate faces is challenging because they are complex and nuanced.
- Diffusers supports two `IP-Adapter` checkpoints specifically trained to generate faces from the `h94/IP-Adapter` repository:
  - `ip-adapter-full-face_sd15.safetensors` is conditioned with images of cropped faces and removed backgrounds
  - `ip-adapter-plus-face_sd15.safetensors` uses patch embeddings and is conditioned with images of cropped faces
- Diffusers supports all `IP-Adapter` checkpoints trained with face embeddings extracted by insightface face models.
  - Supported models are from the `h94/IP-Adapter-FaceID` repository.

- For face models, use the `h94/IP-Adapter` checkpoint.
  - It is also recommended to use `DDIMScheduler` or `EulerDiscreteScheduler` for face models.

In [None]:
pipeline = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-full-face_sd15.bin")

pipeline.set_ip_adapter_scale(0.5)

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_einstein_base.png")
generator = torch.Generator(device="cpu").manual_seed(26)

image = pipeline(
    prompt="A photo of Einstein as a chef, wearing an apron, cooking in a French restaurant",
    ip_adapter_image=image,
    negative_prompt="lowres, bad anatomy, worst quality, low quality",
    num_inference_steps=100,
    generator=generator,
).images[0]
image

- To use `IP-Adapter` FaceID models, first extract face embeddings with insightface.
  - Then pass the list of tensors to the pipeline as `ip_adapter_image_embeds`.

In [None]:
pipeline = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter("h94/IP-Adapter-FaceID", subfolder=None, weight_name="ip-adapter-faceid_sd15.bin", image_encoder_folder=None)
pipeline.set_ip_adapter_scale(0.6)

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")

ref_images_embeds = []
app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")

generator = torch.Generator(device="cpu").manual_seed(42)

images = pipeline(
    prompt="A photo of a girl",
    ip_adapter_image_embeds=[id_embeds],
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
    num_inference_steps=20, num_images_per_prompt=1,
    generator=generator
).images

- Both `IP-Adapter` FaceID Plus and `Plus v2` models require CLIP image embeddings.
  - You can prepare face embeddings as shown previously, then you can extract and pass CLIP embeddings to the hidden image projection layers.

In [None]:
ref_images_embeds = []
ip_adapter_images = []
app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
ip_adapter_images.append(face_align.norm_crop(image, landmark=faces[0].kps, image_size=224))
image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")

clip_embeds = pipeline.prepare_ip_adapter_image_embeds(
  [ip_adapter_images], None, torch.device("cuda"), num_images, True)[0]

pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False # True if Plus v2

#### Multi IP-Adapter
- More than one `IP-Adapter` can be used at the same time to generate specific images in more diverse styles.
  - You can use `IP-Adapter-Face` to generate consistent faces and characters, and `IP-Adapter Plus` to generate those faces in a specific style.

- Load the image encoder with `CLIPVisionModelWithProjection`.

In [None]:
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter",
    subfolder="models/image_encoder",
    torch_dtype=torch.float16,
)

- Load a base model, scheduler, and the IP-Adapters.
  - The IP-Adapters to use are passed as a list to the `weight_name` parameter:
  - `ip-adapter-plus_sdxl_vit-h` uses patch embeddings and a ViT-H image encoder
  - `ip-adapter-plus-face_sdxl_vit-h` has the same architecture but it is conditioned with images of cropped faces

In [None]:
pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    image_encoder=image_encoder,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"]
)
pipeline.set_ip_adapter_scale([0.7, 0.3])
pipeline.enable_model_cpu_offload()

- Load an image prompt and a folder containing images of a certain style you want to use.

In [None]:
face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png")
style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy"
style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)]

- Pass the image prompt and style images as a list to the `ip_adapter_image` parameter, and run the pipeline:

In [None]:
generator = torch.Generator(device="cpu").manual_seed(0)

image = pipeline(
    prompt="wonderwoman",
    ip_adapter_image=[style_images, face_image],
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
    num_inference_steps=50, num_images_per_prompt=1,
    generator=generator,
).images[0]
image

#### Instant Generation
- `Latent Consistency Models (LCM)` are diffusion models that can generate images in as little as 4 steps compared to other diffusion models like `SDXL` that typically require way more steps.
  - This is why image generation with an `LCM` feels “instantaneous”.
  - `IP-Adapters` can be plugged into an `LCM-LoRA` model to instantly generate images with an image prompt.

- `IP-Adapter` weights need to be loaded first, then you can use `load_lora_weights()` to load the `LoRA` style and weight you want to apply to your image.

- Try using with a lower `IP-Adapter` scale to condition image generation more on the `herge_style` checkpoint, and remember to use the special token `herge_style` in your prompt to trigger and apply the style.

In [None]:
model_id = "sd-dreambooth-library/herge-style"
lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5"

pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
pipeline.load_lora_weights(lcm_lora_id)
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
pipeline.enable_model_cpu_offload()

pipeline.set_ip_adapter_scale(0.4)

prompt = "herge_style woman in armor, best quality, high quality"
generator = torch.Generator(device="cpu").manual_seed(0)

ip_adapter_image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
image = pipeline(
    prompt=prompt,
    ip_adapter_image=ip_adapter_image,
    num_inference_steps=4,
    guidance_scale=1,
).images[0]
image

#### Structural control
- To control image generation to an even greater degree, you can combine `IP-Adapter` with a model like ControlNet.
  - A `ControlNet` is also an adapter that can be inserted into a diffusion model to allow for conditioning on an additional control image.
  - The control image can be depth maps, edge maps, pose estimations, and more.
- Load a `ControlNetModel` checkpoint conditioned on depth maps, insert it into a diffusion model, and load the `IP-Adapter`.

In [None]:
controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth"
controlnet = ControlNetModel.from_pretrained(controlnet_model_path, torch_dtype=torch.float16)

pipeline = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16)
pipeline.to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")

- Load the IP-Adapter image and depth map.

In [None]:
ip_adapter_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png")
depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png")

- Pass the depth map and `IP-Adapter` image to the pipeline to generate an image.

In [None]:
generator = torch.Generator(device="cpu").manual_seed(33)
image = pipeline(
    prompt="best quality, high quality",
    image=depth_map,
    ip_adapter_image=ip_adapter_image,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
    num_inference_steps=50,
    generator=generator,
).images[0]
image

#### Style & Layout control
- `InstantStyle` is a plug-and-play method on top of `IP-Adapter`, which disentangles style and layout from image prompt to control image generation.
  - You can generate images following only the style or layout from image prompt, with significantly improved diversity.
  - This is achieved by only activating `IP-Adapters` to specific parts of the model.
  - By default `IP-Adapters` are inserted to all layers of the model.
  - Use the `set_ip_adapter_scale()` method with a dictionary to assign scales to `IP-Adapter` at different layers.

In [None]:
pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")

scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

- This will activate `IP-Adapter` at the second layer in the model’s down-part block 2 and up-part block 0.
  - The former is the layer where IP-Adapter injects layout information and the latter injects style.
  - Inserting `IP-Adapter` to these two layers you can generate images following both the style and layout from image prompt, but with contents more aligned to text prompt.

In [None]:
style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg")

generator = torch.Generator(device="cpu").manual_seed(26)
image = pipeline(
    prompt="a cat, masterpiece, best quality, high quality",
    ip_adapter_image=style_image,
    negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
    guidance_scale=5,
    num_inference_steps=30,
    generator=generator,
).images[0]
image

- Inserting `IP-Adapter` to all layers will often generate images that overly focus on image prompt and diminish diversity.
  - Activate `IP-Adapter` only in the style layer and then call the pipeline again.

In [None]:
scale = {
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

generator = torch.Generator(device="cpu").manual_seed(26)
image = pipeline(
    prompt="a cat, masterpiece, best quality, high quality",
    ip_adapter_image=style_image,
    negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
    guidance_scale=5,
    num_inference_steps=30,
    generator=generator,
).images[0]
image

----
### **OmniGen**
- Unlike existing text-to-image models, `OmniGen` is a single model designed to handle a variety of tasks (e.g., text-to-image, image editing, controllable generation).
  - Minimalist model architecture, consisting of only a VAE and a transformer module, for joint modeling of text and images.
  - It can process any text-image mixed(multi-modal) data as instructions for image generation, rather than relying solely on text.
  - For more information, please refer to the paper. This guide will walk you through using OmniGen for various tasks and use cases.

In [22]:
from diffusers import OmniGenPipeline

- Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the `from_pretrained()` method.

In [None]:
pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1-diffusers", torch_dtype=torch.bfloat16)

- `OmniGen` generates a 1024x1024 image.
  - Try setting the height and width parameters to generate images with different size.

In [None]:
pipe = OmniGenPipeline.from_pretrained(
    "Shitao/OmniGen-v1-diffusers",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

prompt = "Realistic photo. A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD."
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=3,
    generator=torch.Generator(device="cpu").manual_seed(111),
).images[0]
image.save("output.png")

#### Image Edit
- `OmniGen` supports multimodal inputs.
  - When the input includes an image, you need to add a placeholder `<|image_1|>` in the text prompt to represent the image.
  - It is recommended to `enable use_input_image_size_as_output` to keep the edited image the same size as the original image.

In [None]:
pipe = OmniGenPipeline.from_pretrained(
    "Shitao/OmniGen-v1-diffusers",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

prompt="<img><|image_1|></img> Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola."
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png")]
image = pipe(
    prompt=prompt, 
    input_images=input_images, 
    guidance_scale=2, 
    img_guidance_scale=1.6,
    use_input_image_size_as_output=True,
    generator=torch.Generator(device="cpu").manual_seed(222)
).images[0]
image.save("output.png")

- `OmniGen` has some interesting features, such as visual reasoning.

In [None]:
prompt="If the woman is thirsty, what should she take? Find it in the image and highlight it in blue. <img><|image_1|></img>"
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")]
image = pipe(
    prompt=prompt, 
    input_images=input_images, 
    guidance_scale=2, 
    img_guidance_scale=1.6,
    use_input_image_size_as_output=True,
    generator=torch.Generator(device="cpu").manual_seed(0)
).images[0]
image.save("output.png")

#### Controllable Generation
- `OmniGen` can handle several classic computer vision tasks.
  - `OmniGen` can detect human skeletons in input images, which can be used as control conditions to generate new images.

In [None]:
pipe = OmniGenPipeline.from_pretrained(
    "Shitao/OmniGen-v1-diffusers",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

prompt="Detect the skeleton of human in this image: <img><|image_1|></img>"
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")]
image1 = pipe(
    prompt=prompt, 
    input_images=input_images, 
    guidance_scale=2, 
    img_guidance_scale=1.6,
    use_input_image_size_as_output=True,
    generator=torch.Generator(device="cpu").manual_seed(333)
).images[0]
image1.save("image1.png")

prompt="Generate a new photo using the following picture and text as conditions: <img><|image_1|></img>\n A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him."
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal.png")]
image2 = pipe(
    prompt=prompt, 
    input_images=input_images, 
    guidance_scale=2, 
    img_guidance_scale=1.6,
    use_input_image_size_as_output=True,
    generator=torch.Generator(device="cpu").manual_seed(333)
).images[0]
image2.save("image2.png")

- `OmniGen` can also directly use relevant information from input images to generate new images.

In [None]:
pipe = OmniGenPipeline.from_pretrained(
    "Shitao/OmniGen-v1-diffusers",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

prompt="Following the pose of this image <img><|image_1|></img>, generate a new photo: A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him."
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")]
image = pipe(
    prompt=prompt, 
    input_images=input_images, 
    guidance_scale=2, 
    img_guidance_scale=1.6,
    use_input_image_size_as_output=True,
    generator=torch.Generator(device="cpu").manual_seed(0)
).images[0]
image.save("output.png")

#### ID and Object Preserving
- `OmniGen` can generate multiple images based on the people and objects in the input image and supports inputting multiple images simultaneously.
  - `OmniGen` can extract desired objects from an image containing multiple objects based on instructions.

In [None]:
pipe = OmniGenPipeline.from_pretrained(
    "Shitao/OmniGen-v1-diffusers",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

prompt="A man and a woman are sitting at a classroom desk. The man is the man with yellow hair in <img><|image_1|></img>. The woman is the woman on the left of <img><|image_2|></img>"
input_image_1 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/3.png")
input_image_2 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/4.png")
input_images=[input_image_1, input_image_2]
image = pipe(
    prompt=prompt, 
    input_images=input_images, 
    height=1024,
    width=1024,
    guidance_scale=2.5, 
    img_guidance_scale=1.6,
    generator=torch.Generator(device="cpu").manual_seed(666)
).images[0]
image.save("output.png")

In [None]:
pipe = OmniGenPipeline.from_pretrained(
    "Shitao/OmniGen-v1-diffusers",
    torch_dtype=torch.bfloat16
)
pipe.to("cuda")

prompt="A woman is walking down the street, wearing a white long-sleeve blouse with lace details on the sleeves, paired with a blue pleated skirt. The woman is <img><|image_1|></img>. The long-sleeve blouse and a pleated skirt are <img><|image_2|></img>."
input_image_1 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/emma.jpeg")
input_image_2 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/dress.jpg")
input_images=[input_image_1, input_image_2]
image = pipe(
    prompt=prompt, 
    input_images=input_images, 
    height=1024,
    width=1024,
    guidance_scale=2.5, 
    img_guidance_scale=1.6,
    generator=torch.Generator(device="cpu").manual_seed(666)
).images[0]
image.save("output.png")

#### Optimization when Using Multiple Images
- For text-to-image task, `OmniGen` requires minimal memory and time costs (9GB memory and 31s for a 1024x1024 image on A800 GPU).
  - However, when using input images, the computational cost increases.

- Like other pipelines, you can reduce memory usage by offloading the model: `pipe.enable_model_cpu_offload()` or `pipe.enable_sequential_cpu_offload()`.
  - Decrease computational overhead by reducing the `max_input_image_size`.

----
### **Perturbed-Attention Guidance(PAG)**
- `Perturbed-Attention Guidance(PAG)` is a new diffusion sampling guidance that improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules.
  - PAG is designed to progressively enhance the structure of synthesized samples throughout the denoising process by considering the self-attention mechanisms’ ability to capture structural information.
  - It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, and guiding the denoising process away from these degraded samples.

<br>

- **General tasks**
  - You can apply `PAG` to the `StableDiffusionXLPipeline` for tasks such as text-to-image, image-to-image, and inpainting.
  - To enable `PAG` for a specific task, load the pipeline using the AutoPipeline API with the `enable_pag=True` flag and the `pag_applied_layers` argument.

In [26]:
from diffusers import AutoPipelineForText2Image, ControlNetModel
from diffusers import AutoPipelineForImage2Image
from diffusers import AutoPipelineForInpainting

from transformers import CLIPVisionModelWithProjection

- The `pag_applied_layers` argument allows you to specify which layers `PAG` is applied to.
  - Use `set_pag_applied_layers` method to update these layers after the pipeline has been created.
  - Check out the `pag_applied_layers` section to learn more about applying `PAG` to other layers.
- If you already have a pipeline created and loaded, you can enable PAG on it using the `from_pipe` API with the `enable_pag` flag.
  - PAG pipeline is created based on the pipeline and task you specified.
  - Since we used `AutoPipelineForText2Image` and passed a `StableDiffusionXLPipeline`, a `StableDiffusionXLPAGPipeline` is created accordingly.
  - Note that this does not require additional memory, and you will have both `StableDiffusionXLPipeline` and `StableDiffusionXLPAGPipeline` loaded and ready to use.
  - You can read more about the `from_pipe` API and how to reuse pipelines in diffuser here.

In [None]:
pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    enable_pag=True,
    pag_applied_layers=["mid"],
    torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()

In [None]:
pipeline_sdxl = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
pipeline = AutoPipelineForText2Image.from_pipe(pipeline_sdxl, enable_pag=True)

- To generate an image, you will also need to pass a `pag_scale`.
  - When `pag_scale` increases, images gain more semantically coherent structures and exhibit fewer artifacts.
  - However overly large guidance scale can lead to smoother textures and slight saturation in the images, similarly to CFG.
  - `pag_scale=3.0` is used in the official demo and works well in most of the use cases, but feel free to experiment and select the appropriate value according to your needs
    - `PAG` is disabled when `pag_scale=0`.

In [None]:
prompt = "an insect robot preparing a delicious meal, anime style"

for pag_scale in [0.0, 3.0]:
    generator = torch.Generator(device="cpu").manual_seed(0)
    images = pipeline(
        prompt=prompt,
        num_inference_steps=25,
        guidance_scale=7.0,
        generator=generator,
        pag_scale=pag_scale,
    ).images

#### `PAG` with `ControlNet`
- To use `PAG` with `ControlNet`, first create a `controlnet`.
  - Pass the controlnet and other PAG arguments to the `from_pretrained` method of the `AutoPipeline` for the specified task.

In [None]:
controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    enable_pag=True,
    pag_applied_layers="mid",
    torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()

- If you already have a controlnet pipeline and want to enable `PAG`, you can use the `from_pipe` API: `AutoPipelineForText2Image.from_pipe(pipeline_controlnet, enable_pag=True)`

- You can use the pipeline in the same way you normally use `ControlNet` pipelines, with the added option to specify a `pag_scale` parameter.
  - Note that `PAG` works well for unconditional generation.

In [None]:
from diffusers.utils import load_image
canny_image = load_image(
    "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/pag_control_input.png"
)

for pag_scale in [0.0, 3.0]:
    generator = torch.Generator(device="cpu").manual_seed(1)
    images = pipeline(
        prompt="",
        controlnet_conditioning_scale=controlnet_conditioning_scale,
        image=canny_image,
        num_inference_steps=50,
        guidance_scale=0,
        generator=generator,
        pag_scale=pag_scale,
    ).images
    images[0]

#### `PAG` with `IP-Adapter`
- `IP-Adapter` is a popular model that can be plugged into diffusion models to enable image prompting without any changes to the underlying model.
  - You can enable `PAG` on a pipeline with `IP-Adapter` loaded.

In [None]:
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter",
    subfolder="models/image_encoder",
    torch_dtype=torch.float16
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    image_encoder=image_encoder,
    enable_pag=True,
    torch_dtype=torch.float16
).to("cuda")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.bin")

pag_scales = 5.0
ip_adapter_scales = 0.8

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")

pipeline.set_ip_adapter_scale(ip_adapter_scale)
generator = torch.Generator(device="cpu").manual_seed(0)
images = pipeline(
    prompt="a polar bear sitting in a chair drinking a milkshake",
    ip_adapter_image=image,
    negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
    num_inference_steps=25,
    guidance_scale=3.0,
    generator=generator,
    pag_scale=pag_scale,
).images
images[0]

#### Configure parameters
- The `pag_applied_layers` argument allows you to specify which layers `PAG` is applied to.
  - By default, it applies only to the mid blocks.
    - Changing this setting will significantly impact the output.
  - You can use the `set_pag_applied_layers` method to adjust the `PAG` layers after the pipeline is created, helping you find the optimal layers for your model.

In [None]:
prompt = "an insect robot preparing a delicious meal, anime style"
pipeline.set_pag_applied_layers(pag_layers)
generator = torch.Generator(device="cpu").manual_seed(0)
images = pipeline(
    prompt=prompt,
    num_inference_steps=25,
    guidance_scale=guidance_scale,
    generator=generator,
    pag_scale=pag_scale,
).images
images[0]

----
### **ControlNet**
- `ControlNet` is a type of model for controlling image diffusion models by conditioning the model with an additional input image.
  - There are many types of conditioning inputs (canny edge, user sketching, human pose, depth, and more) you can use to control a diffusion model.
  - This is hugely useful because it affords you greater control over image generation, making it easier to generate specific images without experimenting with different text prompts or denoising values as much.

- A `ControlNet` model has two sets of weights (or blocks) connected by a zero-convolution layer:
  - a locked copy keeps everything a large pretrained diffusion model has learned
  - a trainable copy is trained on the additional conditioning input

- Since the locked copy preserves the pretrained model, training and implementing a `ControlNet` on a new conditioning input is as fast as finetuning any other model because you aren’t training the model from scratch.

In [3]:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers import StableDiffusionControlNetImg2ImgPipeline
from diffusers import StableDiffusionControlNetInpaintPipeline

from diffusers import StableDiffusionXLControlNetPipeline, AutoencoderKL

- For text-to-image, you normally pass a text prompt to the model.
  - But with `ControlNet`, you can specify an additional conditioning input.
  - Let’s condition the model with a canny image, a white outline of an image on a black background.
  - `ControlNet` can use the canny image as a control to guide the model to generate an image with the same outline.

- Load an image and use the opencv-python library to extract the canny image:

In [None]:
original_image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)

image = np.array(original_image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)

- Load a `ControlNet` model conditioned on canny edge detection and pass it to the `StableDiffusionControlNetPipeline`.
  - Use the faster `UniPCMultistepScheduler` and enable model offloading to speed up inference and reduce memory usage.

In [None]:
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

- Pass your prompt and canny image to the pipeline:

In [None]:
output = pipe(
    "the mona lisa", image=canny_image
).images[0]
make_image_grid([original_image, canny_image, output], rows=1, cols=3)

#### ControlNet with Stable Diffusion XL
- We’ve trained two full-sized `ControlNet` models for `SDXL` conditioned on canny edge detection and depth maps.
  - We’re also experimenting with creating smaller versions of these `SDXL`-compatible ControlNet models so it is easier to run on resource-constrained hardware.

- Use a `SDXL ControlNet` conditioned on canny images to generate an image.
  - Start by loading an image and prepare the canny image:

In [None]:
original_image = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
)

image = np.array(original_image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
make_image_grid([original_image, canny_image], rows=1, cols=2)

- Load a `SDXL ControlNet` model conditioned on canny edge detection and pass it to the `StableDiffusionXLControlNetPipeline`.
  - You can also enable model offloading to reduce memory usage.

In [None]:
controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-canny-sdxl-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True
)
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    vae=vae,
    torch_dtype=torch.float16,
    use_safetensors=True
)
pipe.enable_model_cpu_offload()

- Pass your prompt (and optionally a negative prompt if you’re using one) and canny image to the pipeline:

In [None]:
prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
negative_prompt = 'low quality, bad quality, sketches'

image = pipe(
    prompt,
    negative_prompt=negative_prompt,
    image=canny_image,
    controlnet_conditioning_scale=0.5,
).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)

- Use `StableDiffusionXLControlNetPipeline` in guess mode as well by setting the parameter to True:

In [None]:
prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
negative_prompt = "low quality, bad quality, sketches"

original_image = load_image(
    "https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
)

controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
)
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, vae=vae, torch_dtype=torch.float16, use_safetensors=True
)
pipe.enable_model_cpu_offload()

image = np.array(original_image)
image = cv2.Canny(image, 100, 200)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)

image = pipe(
    prompt, negative_prompt=negative_prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True,
).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)

#### MultiControlNet
- Compose multiple `ControlNet` conditionings from different image inputs to create a `MultiControlNet`.
  - Mask conditionings such that they don’t overlap
  - Experiment with the `controlnet_conditioning_scale` parameter to determine how much weight to assign to each conditioning input

In [None]:
original_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
)
image = np.array(original_image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)

# zero out middle columns of image where pose will be overlaid
zero_start = image.shape[1] // 4
zero_end = zero_start + image.shape[1] // 2
image[:, zero_start:zero_end] = 0

image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
make_image_grid([original_image, canny_image], rows=1, cols=2)

- Load a list of `ControlNet` models that correspond to each conditioning, and pass them to the `StableDiffusionXLControlNetPipeline`.
  - Use the faster `UniPCMultistepScheduler` and enable model offloading to reduce memory usage.

In [None]:
controlnets = [
    ControlNetModel.from_pretrained(
        "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16
    ),
    ControlNetModel.from_pretrained(
        "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
    ),
]

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, torch_dtype=torch.float16, use_safetensors=True
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

In [None]:
prompt = "a giant standing in a fantasy landscape, best quality"
negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"

generator = torch.manual_seed(1)

images = [openpose_image.resize((1024, 1024)), canny_image.resize((1024, 1024))]

images = pipe(
    prompt,
    image=images,
    num_inference_steps=25,
    generator=generator,
    negative_prompt=negative_prompt,
    num_images_per_prompt=3,
    controlnet_conditioning_scale=[1.0, 0.8],
).images
make_image_grid([original_image, canny_image, openpose_image,
                images[0].resize((512, 512)), images[1].resize((512, 512)), images[2].resize((512, 512))], rows=2, cols=3)

----
### **T2I-Adapter**
- `T2I-Adapter` is a lightweight adapter for controlling and providing more accurate structure guidance for text-to-image models.
  - It works by learning an alignment between the internal knowledge of the text-to-image model and an external control signal, such as edge detection or depth estimation.

- The `T2I-Adapter` design is simple, the condition is passed to four feature extraction blocks and three downsample blocks.
  - This makes it fast and easy to train different adapters for different conditions which can be plugged into the text-to-image model.
  - `T2I-Adapter` is similar to `ControlNet` except it is smaller (~77M parameters) and faster because it only runs once during the diffusion process.
  - The downside is that performance may be slightly worse than `ControlNet`.

- Make sure you have the following libraries installed.
```
pip install -q diffusers accelerate controlnet-aux==0.0.7
```

<br>

- Text-to-image models rely on a prompt to generate an image, but sometimes, text alone may not be enough to provide more accurate structural guidance.
  - `T2I-Adapter` allows you to provide an additional control image to guide the generation process.
  - ex. you can provide a canny image (a white outline of an image on a black background) to guide the model to generate an image with a similar structure.

In [None]:
from diffusers import StableDiffusionAdapterPipeline, T2IAdapter
from controlnet_aux.canny import CannyDetector
from diffusers import StableDiffusionXLAdapterPipeline, EulerAncestralDiscreteScheduler, AutoencoderKL
from diffusers import MultiAdapter

#### Stable Diffusion 1.5
- Create a canny image with the opencv-library.
- Load a `T2I-Adapter` conditioned on canny images and pass it to the `StableDiffusionAdapterPipeline`.


In [None]:
image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png")
image = np.array(image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = Image.fromarray(image)

In [None]:
adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_canny_sd15v2", torch_dtype=torch.float16)
pipeline = StableDiffusionAdapterPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    adapter=adapter,
    torch_dtype=torch.float16,
)
pipeline.to("cuda")

- Pass your prompt and control image to the pipeline.

In [None]:
generator = torch.Generator("cuda").manual_seed(0)

image = pipeline(
    prompt="cinematic photo of a plush and soft midcentury style rug on a wooden floor, 35mm photograph, film, professional, 4k, highly detailed",
    image=image,
    generator=generator,
).images[0]
image

#### Stable Diffusion XL
- Create a canny image with the `controlnet-aux` library.
- Load a `T2I-Adapter` conditioned on canny images and pass it to the `StableDiffusionXLAdapterPipeline`.

In [None]:
canny_detector = CannyDetector()

image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png")
image = canny_detector(image, detect_resolution=384, image_resolution=1024)

In [None]:
scheduler = EulerAncestralDiscreteScheduler.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler")
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16)
pipeline = StableDiffusionXLAdapterPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    adapter=adapter,
    vae=vae,
    scheduler=scheduler,
    torch_dtype=torch.float16,
    variant="fp16",
)
pipeline.to("cuda")

- Pass your prompt and control image to the pipeline.

In [None]:
generator = torch.Generator("cuda").manual_seed(0)

image = pipeline(
  prompt="cinematic photo of a plush and soft midcentury style rug on a wooden floor, 35mm photograph, film, professional, 4k, highly detailed",
  image=image,
  generator=generator,
).images[0]
image

#### MultiAdapter
- `T2I-Adapters` are also composable, allowing you to use more than one adapter to impose multiple control conditions on an image.
  - ex. you can use a pose map to provide structural control and a depth map for depth control.
    - This is enabled by the `MultiAdapter` class.

- Condition a text-to-image model with a pose and depth adapter.
  - Create and place your depth and pose image and in a list.

In [None]:
pose_image = load_image(
    "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png"
)
depth_image = load_image(
    "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png"
)
cond = [pose_image, depth_image]
prompt = ["Santa Claus walking into an office room with a beautiful city view"]

- Load the corresponding pose and depth adapters as a list in the `MultiAdapter` class.

In [None]:
adapters = MultiAdapter(
    [
        T2IAdapter.from_pretrained("TencentARC/t2iadapter_keypose_sd14v1"),
        T2IAdapter.from_pretrained("TencentARC/t2iadapter_depth_sd14v1"),
    ]
)
adapters = adapters.to(torch.float16)

- Load a `StableDiffusionAdapterPipeline` with the adapters, and pass your prompt and conditioned images to it.
  - Use the `adapter_conditioning_scale` to adjust the weight of each adapter on the image.

In [None]:
pipeline = StableDiffusionAdapterPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16,
    adapter=adapters,
).to("cuda")

image = pipeline(prompt, cond, adapter_conditioning_scale=[0.7, 0.7]).images[0]
image

----
### **Latent Consistency Model**
- `Latent Consistency Models (LCMs)` enable fast high-quality image generation by directly predicting the reverse diffusion process in the latent rather than pixel space.
  - `LCMs` try to predict the noiseless image from the noisy image in contrast to typical diffusion models that iteratively remove noise from the noisy image.
  - By avoiding the iterative sampling process, `LCMs` are able to generate high-quality images in 2-4 steps instead of 20-30 steps.

- `LCMs` are distilled from pretrained models which requires ~32 hours of A100 compute.
  - To speed this up, `LCM-LoRAs` train a `LoRA` adapter which have much fewer parameters to train compared to the full model.
  - The `LCM-LoRA` can be plugged into a diffusion model once it has been trained.

In [2]:
from diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers import DiffusionPipeline

from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter

- To use `LCMs`, you need to load the `LCM` checkpoint for your supported model into `UNet2DConditionModel` and replace the scheduler with the `LCMScheduler`.

- Typically, batch size is doubled inside the pipeline for classifier-free guidance.
  - But `LCM` applies guidance with guidance embeddings and doesn’t need to double the batch size, which leads to faster inference.
  - The downside is that negative prompts don’t work with `LCM` because they don’t have any effect on the denoising process.
- The ideal range for `guidance_scale` is `[3., 13.]` because that is what the `UNet` was trained with.

In [None]:
unet = UNet2DConditionModel.from_pretrained(
    "latent-consistency/lcm-sdxl",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16",
).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"
generator = torch.manual_seed(0)
image = pipe(
    prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0
).images[0]
image

- To use `LCM-LoRAs`, you need to replace the scheduler with the `LCMScheduler` and load the `LCM-LoRA` weights with the `load_lora_weights()` method.
- Typically, batch size is doubled inside the pipeline for classifier-free guidance.
  - But `LCM` applies guidance with guidance embeddings and doesn’t need to double the batch size, which leads to faster inference.
  - The downside is that negative prompts don’t work with `LCM` because they don’t have any effect on the denoising process.
  - You could use guidance with `LCM-LoRAs`, but it is very sensitive to high `guidance_scale` values and can lead to artifacts in the generated image.
  - The best values we’ve found are between `[1.0, 2.0]`.

- Replace `stabilityai/stable-diffusion-xl-base-1.0` with any finetuned model.
  - Try using the animagine-xl checkpoint to generate anime images with `SDXL`.

In [None]:
pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    variant="fp16",
    torch_dtype=torch.float16
).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")

prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"
generator = torch.manual_seed(42)
image = pipe(
    prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=1.0
).images[0]
image

#### Adapters: LoRA
- `LCMs` are compatible with adapters like `LoRA`, `ControlNet`, `T2I-Adapter`, and `AnimateDiff`.
  - You can bring the speed of `LCMs` to these adapters to generate images in a certain style or condition the model on another input like a canny image.

- `LoRA` adapters can be rapidly finetuned to learn a new style from just a few images and plugged into a pretrained model to generate images in that style.
  - Load the `LCM` checkpoint for your supported model into `UNet2DConditionModel` and replace the scheduler with the `LCMScheduler`.
  - Use the `load_lora_weights()` method to load the LoRA weights into the `LCM` and generate a styled image in a few steps.

In [None]:
unet = UNet2DConditionModel.from_pretrained(
    "latent-consistency/lcm-sdxl",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16",
).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.load_lora_weights("TheLastBen/Papercut_SDXL", weight_name="papercut.safetensors", adapter_name="papercut")

prompt = "papercut, a cute fox"
generator = torch.manual_seed(0)
image = pipe(
    prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0
).images[0]
image

#### Adapters: ControlNet
- `ControlNet` are adapters that can be trained on a variety of inputs like canny edge, pose estimation, or depth.
- `ControlNet` can be inserted into the pipeline to provide additional conditioning and control to the model for more accurate generation.

- You can find additional ControlNet models trained on other inputs in lllyasviel’s repository.

<br>

- Load a `ControlNet` model trained on canny images and pass it to the `ControlNetModel`.
  - Then you can load a `LCM` model into `StableDiffusionControlNetPipeline` and replace the scheduler with the `LCMScheduler`.
  - Now pass the canny image to the pipeline and generate an image.


In [None]:
image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
).resize((512, 512))

image = np.array(image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)

controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "SimianLuo/LCM_Dreamshaper_v7",
    controlnet=controlnet,
    torch_dtype=torch.float16,
    safety_checker=None,
).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

generator = torch.manual_seed(0)
image = pipe(
    "the mona lisa",
    image=canny_image,
    num_inference_steps=4,
    generator=generator,
).images[0]
make_image_grid([canny_image, image], rows=1, cols=2)

#### Adapters: T2I-Adapter
- `T2I-Adapter` is an even more lightweight adapter than `ControlNet`, that provides an additional input to condition a pretrained model with.
  - It is faster than `ControlNet` but the results may be slightly worse.

- Load a `T2IAdapter` trained on canny images and pass it to the `StableDiffusionXLAdapterPipeline`.
  - Load a LCM checkpoint into `UNet2DConditionModel` and replace the scheduler with the `LCMScheduler`.
    - Pass the canny image to the pipeline and generate an image.

In [None]:
# detect the canny map in low resolution to avoid high-frequency details
image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
).resize((384, 384))

image = np.array(image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image).resize((1024, 1216))

adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16, varient="fp16").to("cuda")

unet = UNet2DConditionModel.from_pretrained(
    "latent-consistency/lcm-sdxl",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    unet=unet,
    adapter=adapter,
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")

pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

prompt = "the mona lisa, 4k picture, high quality"
negative_prompt = "extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured"

generator = torch.manual_seed(0)
image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    image=canny_image,
    num_inference_steps=4,
    guidance_scale=5,
    adapter_conditioning_scale=0.8,
    adapter_conditioning_factor=1,
    generator=generator,
).images[0]

----
### **Textual inversion**
- The `StableDiffusionPipeline` supports textual inversion, a technique that enables a model like Stable Diffusion to learn a new concept from just a few sample images.
  - This gives you more control over the generated images and allows you to tailor the model towards specific concepts.

In [4]:
from diffusers import StableDiffusionPipeline
from diffusers import AutoPipelineForText2Image

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

#### Stable Diffusion 1 and 2
- Pick a Stable Diffusion checkpoint and a pre-learned concept from the `Stable Diffusion Conceptualizer`:

In [None]:
pretrained_model_name_or_path = "stable-diffusion-v1-5/stable-diffusion-v1-5"
repo_id_embeds = "sd-concepts-library/cat-toy"

pipeline = StableDiffusionPipeline.from_pretrained(
    pretrained_model_name_or_path, torch_dtype=torch.float16, use_safetensors=True
).to("cuda")

pipeline.load_textual_inversion(repo_id_embeds)

- Create a prompt with the pre-learned concept by using the special placeholder token `<cat-toy>`, and choose the number of samples and rows of images you’d like to generate:

In [None]:
prompt = "a grafitti in a favela wall with a <cat-toy> on it"

num_samples_per_row = 2
num_rows = 2

- Then run the pipeline (feel free to adjust the parameters like `num_inference_steps` and `guidance_scale` to see how they affect image quality), save the generated images and visualize them with the helper function you created at the beginning:

all_images = []
for _ in range(num_rows):
    images = pipeline(prompt, num_images_per_prompt=num_samples_per_row, num_inference_steps=50, guidance_scale=7.5).images
    all_images.extend(images)

grid = make_image_grid(all_images, num_rows, num_samples_per_row)
grid

#### Stable Diffusion XL
- `Stable Diffusion XL (SDXL)` can also use textual inversion vectors for inference.
  - In contrast to Stable Diffusion 1 and 2, SDXL has two text encoders so you’ll need two textual inversion embeddings - one for each text encoder model.
 
- Download the `SDXL` textual inversion embeddings and have a closer look at it’s structure:
  - There are two tensors, `"clip_g"` and `"clip_l"`. `"clip_g"` corresponds to the bigger text encoder in `SDXL` and refers to `pipe.text_encoder_2` and `"clip_l"` refers to `pipe.text_encoder`.
  - Load each tensor separately by passing them along with the correct text encoder and tokenizer to `load_textual_inversion()`:

In [6]:
file = hf_hub_download("dn118/unaestheticXL", filename="unaestheticXLv31.safetensors")
state_dict = load_file(file)
state_dict

{'clip_g': tensor([[ 0.0077, -0.0112,  0.0065,  ...,  0.0195,  0.0159,  0.0275],
         [ 0.0320, -0.0239,  0.0241,  ..., -0.0164,  0.0284, -0.0135],
         [-0.0303,  0.0069, -0.0071,  ...,  0.0100, -0.0251,  0.0164],
         ...,
         [ 0.0136, -0.0042,  0.0027,  ..., -0.0277, -0.0232, -0.0380],
         [ 0.0121, -0.0066,  0.0176,  ..., -0.0292,  0.0065, -0.0139],
         [-0.0170,  0.0213,  0.0143,  ..., -0.0302, -0.0240, -0.0362]],
        dtype=torch.float16),
 'clip_l': tensor([[ 0.0023,  0.0192,  0.0213,  ..., -0.0385,  0.0048, -0.0011],
         [-0.0079, -0.0240,  0.0062,  ..., -0.0042,  0.0103,  0.0328],
         [ 0.0096,  0.0127,  0.0181,  ..., -0.0076, -0.0272, -0.0204],
         ...,
         [ 0.0210,  0.0003,  0.0207,  ...,  0.0063, -0.0131,  0.0299],
         [ 0.0160, -0.0136,  0.0269,  ...,  0.0242,  0.0356, -0.0205],
         [ 0.0475, -0.0508, -0.0145,  ...,  0.0070, -0.0089, -0.0163]],
        dtype=torch.float16)}

In [None]:
pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", variant="fp16", torch_dtype=torch.float16)
pipe.to("cuda")

pipe.load_textual_inversion(state_dict["clip_g"], token="unaestheticXLv31", text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2)
pipe.load_textual_inversion(state_dict["clip_l"], token="unaestheticXLv31", text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer)

# the embedding should be used as a negative embedding, so we pass it as a negative prompt
generator = torch.Generator().manual_seed(33)
image = pipe("a woman standing in front of a mountain", negative_prompt="unaestheticXLv31", generator=generator).images[0]
image

----
### **DiffEdit**
- **Image editing** typically requires providing a mask of the area to be edited.
  - DiffEdit automatically generates the mask for you based on a text query, making it easier overall to create a mask without image editing software.
- The DiffEdit algorithm works in three steps:
  - the diffusion model denoises an image conditioned on some query text and reference text which produces different noise estimates for different areas of the image; the difference is used to infer a mask to identify which area of the image needs to be changed to match the query text
  - the input image is encoded into latent space with DDIM
  - the latents are decoded with the diffusion model conditioned on the text query, using the mask as a guide such that pixels outside the mask remain the same as in the input image

- The `StableDiffusionDiffEditPipeline` requires an image mask and a set of partially inverted latents.
  - The image mask is generated from the `generate_mask()` function, and includes two parameters, `source_prompt` and `target_prompt`.
    - These parameters determine what to edit in the image.
    - ex. if you want to change a bowl of fruits to a bowl of pears.

In [None]:
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline
from transformers import AutoTokenizer, T5ForConditionalGeneration
from transformers import BlipForConditionalGeneration, BlipProcessor

- The partially inverted latents are generated from the `invert()` function, and it is generally a good idea to include a prompt or caption describing the image to help guide the inverse latent sampling process.
- Let’s load the pipeline, scheduler, inverse scheduler, and enable some optimizations to reduce memory usage:

In [None]:
source_prompt = "a bowl of fruits"
target_prompt = "a bowl of pears"

pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16,
    safety_checker=None,
    use_safetensors=True,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()

img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).resize((768, 768))
raw_image

- Use the `generate_mask()` function to generate the image mask.
  - You’ll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image:

In [None]:
source_prompt = "a bowl of fruits"
target_prompt = "a basket of pears"
mask_image = pipeline.generate_mask(
    image=raw_image,
    source_prompt=source_prompt,
    target_prompt=target_prompt,
)
Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768))

- Create the inverted latents and pass it a caption describing the image:
  - Pass the image mask and inverted latents to the pipeline.
  - The `target_prompt` becomes the prompt now, and the `source_prompt` is used as the `negative_prompt`:

In [None]:
inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image).latents

In [None]:
output_image = pipeline(
    prompt=target_prompt,
    mask_image=mask_image,
    image_latents=inv_latents,
    negative_prompt=source_prompt,
).images[0]
mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768))
make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3)

#### Generate source and target embeddings
- The source and target embeddings can be automatically generated with the `Flan-T5` model instead of creating them manually.
- Provide some initial text to prompt the model to generate the source and target prompts.
- Create a utility function to generate the prompts:

In [None]:
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)

source_concept = "bowl"
target_concept = "basket"

source_text = f"Provide a caption for images containing a {source_concept}. "
"The captions should be in English and should be no longer than 150 characters."

target_text = f"Provide a caption for images containing a {target_concept}. "
"The captions should be in English and should be no longer than 150 characters."

In [None]:
@torch.no_grad()
def generate_prompts(input_prompt):
    input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda")

    outputs = model.generate(
        input_ids, temperature=0.8, num_return_sequences=16, do_sample=True, max_new_tokens=128, top_k=10
    )
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

source_prompts = generate_prompts(source_text)
target_prompts = generate_prompts(target_text)
print(source_prompts)
print(target_prompts)

- Load the text encoder model used by the `StableDiffusionDiffEditPipeline` to encode the text.
  - You’ll use the text encoder to compute the text embeddings

In [None]:
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16, use_safetensors=True
)
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()


def embed_prompts(sentences, tokenizer, text_encoder, device="cuda"):
    embeddings = []
    for sent in sentences:
        text_inputs = tokenizer(
            sent,
            padding="max_length",
            max_length=tokenizer.model_max_length,
            truncation=True,
            return_tensors="pt",
        )
        text_input_ids = text_inputs.input_ids
        prompt_embeds = text_encoder(text_input_ids.to(device), attention_mask=None)[0]
        embeddings.append(prompt_embeds)
    return torch.concatenate(embeddings, dim=0).mean(dim=0).unsqueeze(0)

source_embeds = embed_prompts(source_prompts, pipeline.tokenizer, pipeline.text_encoder)
target_embeds = embed_prompts(target_prompts, pipeline.tokenizer, pipeline.text_encoder)

- Pass the embeddings to the generate_mask() and invert() functions, and pipeline to generate the image:

In [None]:
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
  pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)

  img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
  raw_image = load_image(img_url).resize((768, 768))

  mask_image = pipeline.generate_mask(
      image=raw_image,
      source_prompt=source_prompt,
      target_prompt=target_prompt,
# +     source_prompt_embeds=source_embeds,
# +     target_prompt_embeds=target_embeds,
  )

  inv_latents = pipeline.invert(
      prompt=source_prompt,
# +     prompt_embeds=source_embeds,
      image=raw_image,
  ).latents

  output_image = pipeline(
      mask_image=mask_image,
      image_latents=inv_latents,
      prompt=target_prompt,
      negative_prompt=source_prompt,
# +     prompt_embeds=target_embeds,
# +     negative_prompt_embeds=source_embeds,
  ).images[0]
  mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L")
  make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3)

#### Generate a caption for inversion
- While you can use the `source_prompt` as a caption to help generate the partially inverted latents, you can also use the BLIP model to automatically generate a caption.

In [None]:
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", torch_dtype=torch.float16, low_cpu_mem_usage=True)

- Create a utility function to generate a caption from the input image:

In [None]:
@torch.no_grad()
def generate_caption(images, caption_generator, caption_processor):
    text = "a photograph of"

    inputs = caption_processor(images, text, return_tensors="pt").to(device="cuda", dtype=caption_generator.dtype)
    caption_generator.to("cuda")
    outputs = caption_generator.generate(**inputs, max_new_tokens=128)

    # offload caption generator
    caption_generator.to("cpu")

    caption = caption_processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return caption

- Load an input image and generate a caption for it using the `generate_caption` function:

In [None]:
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).resize((768, 768))
caption = generate_caption(raw_image, model, processor)

----
### **Trajectory Consistency Distillation-LoRA**
- `Trajectory Consistency Distillation (TCD)` enables a model to generate higher quality and more detailed images with fewer steps.
- Owing to the effective error mitigation during the distillation process, `TCD` demonstrates superior performance even under conditions of large inference steps.

- The major advantages of TCD are:
  - Better than Teacher: TCD demonstrates superior generative quality at both small and large inference steps and exceeds the performance of `DPM-Solver++(2S)` with `Stable Diffusion XL (SDXL)`.

- For large models like `SDXL`, `TCD` is trained with `LoRA` to reduce memory usage.
  - This is also useful because you can reuse LoRAs between different finetuned models, as long as they share the same base model, without further training.
 
<br>

#### General tasks
- Let’s use the `StableDiffusionXLPipeline` and the `TCDScheduler`.
  - Use the `load_lora_weights()` method to load the `SDXL-compatible TCD-LoRA` weights.

- A few tips to keep in mind for TCD-LoRA inference are to:
  - Keep the `num_inference_steps` between 4 and 50
  - Set `eta` (used to control stochasticity at each step) between 0 and 1.
  - You should use a higher eta when increasing the number of inference steps, but the downside is that a larger eta in `TCDScheduler` leads to blurrier images.
  - A value of `0.3` is recommended to produce good results.

In [None]:
from diffusers import StableDiffusionXLPipeline, TCDScheduler

from transformers import DPTImageProcessor, DPTForDepthEstimation
from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline
from ip_adapter import IPAdapterXL

In [None]:
device = "cuda"
base_model_id = "stabilityai/stable-diffusion-xl-base-1.0"
tcd_lora_id = "h1t/TCD-SDXL-LoRA"

pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device)
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)

pipe.load_lora_weights(tcd_lora_id)
pipe.fuse_lora()

prompt = "Painting of the orange cat Otto von Garfield, Count of Bismarck-Schönhausen, Duke of Lauenburg, Minister-President of Prussia. Depicted wearing a Prussian Pickelhaube and eating his favorite meal - lasagna."

image = pipe(
    prompt=prompt,
    num_inference_steps=4,
    guidance_scale=0,
    eta=0.3,
    generator=torch.Generator(device=device).manual_seed(0),
).images[0]

#### Community models
- `TCD-LoRA` also works with many community finetuned models and plugins.
  - Load the `animagine-xl-3.0` checkpoint which is a community finetuned version of SDXL for generating anime images.

In [None]:
device = "cuda"
base_model_id = "cagliostrolab/animagine-xl-3.0"
tcd_lora_id = "h1t/TCD-SDXL-LoRA"

pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device)
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)

pipe.load_lora_weights(tcd_lora_id)
pipe.fuse_lora()

prompt = "A man, clad in a meticulously tailored military uniform, stands with unwavering resolve. The uniform boasts intricate details, and his eyes gleam with determination. Strands of vibrant, windswept hair peek out from beneath the brim of his cap."

image = pipe(
    prompt=prompt,
    num_inference_steps=8,
    guidance_scale=0,
    eta=0.3,
    generator=torch.Generator(device=device).manual_seed(0),
).images[0]

- `TCD-LoRA` also supports other LoRAs trained on different styles.
  - Load the `TheLastBen/Papercut_SDXL` LoRA and fuse it with the `TCD-LoRA` with the `~loaders.UNet2DConditionLoadersMixin.set_adapters` method.

In [None]:
device = "cuda"
base_model_id = "stabilityai/stable-diffusion-xl-base-1.0"
tcd_lora_id = "h1t/TCD-SDXL-LoRA"
styled_lora_id = "TheLastBen/Papercut_SDXL"

pipe = StableDiffusionXLPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device)
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)

pipe.load_lora_weights(tcd_lora_id, adapter_name="tcd")
pipe.load_lora_weights(styled_lora_id, adapter_name="style")
pipe.set_adapters(["tcd", "style"], adapter_weights=[1.0, 1.0])

prompt = "papercut of a winter mountain, snow"

image = pipe(
    prompt=prompt,
    num_inference_steps=4,
    guidance_scale=0,
    eta=0.3,
    generator=torch.Generator(device=device).manual_seed(0),
).images[0]

#### Adapters
- TCD-LoRA is very versatile, and it can be combined with other adapter types like ControlNets, IP-Adapter, and AnimateDiff.
- **Depth ControlNet**

In [None]:
device = "cuda"
depth_estimator = DPTForDepthEstimation.from_pretrained("Intel/dpt-hybrid-midas").to(device)
feature_extractor = DPTImageProcessor.from_pretrained("Intel/dpt-hybrid-midas")

def get_depth_map(image):
    image = feature_extractor(images=image, return_tensors="pt").pixel_values.to(device)
    with torch.no_grad(), torch.autocast(device):
        depth_map = depth_estimator(image).predicted_depth

    depth_map = torch.nn.functional.interpolate(
        depth_map.unsqueeze(1),
        size=(1024, 1024),
        mode="bicubic",
        align_corners=False,
    )
    depth_min = torch.amin(depth_map, dim=[1, 2, 3], keepdim=True)
    depth_max = torch.amax(depth_map, dim=[1, 2, 3], keepdim=True)
    depth_map = (depth_map - depth_min) / (depth_max - depth_min)
    image = torch.cat([depth_map] * 3, dim=1)

    image = image.permute(0, 2, 3, 1).cpu().numpy()[0]
    image = Image.fromarray((image * 255.0).clip(0, 255).astype(np.uint8))
    return image

base_model_id = "stabilityai/stable-diffusion-xl-base-1.0"
controlnet_id = "diffusers/controlnet-depth-sdxl-1.0"
tcd_lora_id = "h1t/TCD-SDXL-LoRA"

controlnet = ControlNetModel.from_pretrained(
    controlnet_id,
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    base_model_id,
    controlnet=controlnet,
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe.enable_model_cpu_offload()

pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)

pipe.load_lora_weights(tcd_lora_id)
pipe.fuse_lora()

prompt = "stormtrooper lecture, photorealistic"

image = load_image("https://huggingface.co/lllyasviel/sd-controlnet-depth/resolve/main/images/stormtrooper.png")
depth_image = get_depth_map(image)

controlnet_conditioning_scale = 0.5  # recommended for good generalization

image = pipe(
    prompt,
    image=depth_image,
    num_inference_steps=4,
    guidance_scale=0,
    eta=0.3,
    controlnet_conditioning_scale=controlnet_conditioning_scale,
    generator=torch.Generator(device=device).manual_seed(0),
).images[0]

grid_image = make_image_grid([depth_image, image], rows=1, cols=2)

- **IP Adapter**

In [None]:
device = "cuda"
base_model_path = "stabilityai/stable-diffusion-xl-base-1.0"
image_encoder_path = "sdxl_models/image_encoder"
ip_ckpt = "sdxl_models/ip-adapter_sdxl.bin"
tcd_lora_id = "h1t/TCD-SDXL-LoRA"

pipe = StableDiffusionXLPipeline.from_pretrained(
    base_model_path,
    torch_dtype=torch.float16,
    variant="fp16"
)
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)

pipe.load_lora_weights(tcd_lora_id)
pipe.fuse_lora()

ip_model = IPAdapterXL(pipe, image_encoder_path, ip_ckpt, device)

ref_image = load_image("https://raw.githubusercontent.com/tencent-ailab/IP-Adapter/main/assets/images/woman.png").resize((512, 512))

prompt = "best quality, high quality, wearing sunglasses"

image = ip_model.generate(
    pil_image=ref_image,
    prompt=prompt,
    scale=0.5,
    num_samples=1,
    num_inference_steps=4,
    guidance_scale=0,
    eta=0.3,
    seed=0,
)[0]

grid_image = make_image_grid([ref_image, image], rows=1, cols=2)