## **[Optional] Inpainting Pipelines**

> Original Source: https://huggingface.co/docs/diffusers/v0.33.1/en/using-diffusers/sdxl_turbo

```
> Stable Diffusion XL
> Kandinsky
> IP-Adapter
> Perturbed-Attention Guidance(PAG)
> ControlNet
> Latent Consistency Model(LCM)
> Trajectory Consistency Distillation-LoRA
```


- Stable Diffusion **inpainting** is a generative process that fills in missing or masked parts of an image using deep learning.
  - Unlike traditional inpainting, it operates in a latent space, where images are first compressed into a lower-dimensional representation.
- The user provides three main inputs: the original image, a binary mask indicating the region to modify, and an optional text prompt to guide the generation.
  - During the inpainting process, the model keeps the unmasked (black) regions fixed and only generates new content for the masked (white) areas.
  - It introduces noise into the masked region and then iteratively denoises it, guided by the surrounding context and the text prompt.
    - Because this is done in latent space, the model can generate high-quality, semantically meaningful results more efficiently.

- By leveraging powerful text-to-image capabilities, users can not only restore missing parts but also transform the image creatively.

In [None]:
import torch
import numpy as np
from PIL import Image
import cv2

from transformers import pipeline

from diffusers import DiffusionPipeline
from diffusers import DDPMScheduler

from diffusers.utils import load_image, make_image_grid

-----
### **Stable Diffusion XL**
- `Stable Diffusion XL (SDXL)` is a powerful text-to-image generation model that iterates on the previous Stable Diffusion models in three key ways:
  - The `UNet` is 3x larger and `SDXL` combines a second text encoder (`OpenCLIP ViT-bigG/14`) with the original text encoder to significantly increase the number of parameters
  - Introduces size and crop-conditioning to preserve training data from being discarded and gain more control over how a generated image should be cropped
  - Introduces a two-stage model process; the base model (can also be run as a standalone model) generates an image as an input to the refiner model which adds additional high-quality details

- Install:
```
pip install -q diffusers transformers accelerate invisible-watermark>=0.2.0
```

- We recommend installing the `invisible-watermark` library to help identify images that are generated. If the invisible-watermark library is installed, it is used by default. To disable the watermarker:

```
pipeline = StableDiffusionXLPipeline.from_pretrained(..., add_watermarker=False)
```

<br>

- Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the `from_pretrained()` method:

In [None]:
from diffusers import StableDiffusionXLPipeline
from diffusers import AutoPipelineForText2Image

from diffusers import StableDiffusionXLImg2ImgPipeline, AutoPipelineForImage2Image
from diffusers import StableDiffusionXLInpaintPipeline, AutoPipelineForInpainting

from diffusers import DiffusionPipeline

- For inpainting, you’ll need the original image and a mask of what you want to replace in the original image.
  - Create a prompt to describe what you want to replace the masked area with.

In [None]:
pipeline_text2image = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

In [None]:
# use from_pipe to avoid consuming additional memory when loading a checkpoint
pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda")

img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png"

init_image = load_image(img_url)
mask_image = load_image(mask_url)

prompt = "A deep sea diver floating"
image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0]
make_image_grid([init_image, mask_image, image], rows=1, cols=3)

- The refiner model can also be used for inpainting in the `StableDiffusionXLInpaintPipeline`:

In [None]:
base = StableDiffusionXLInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

refiner = StableDiffusionXLInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base.text_encoder_2,
    vae=base.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
).to("cuda")

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = load_image(img_url)
mask_image = load_image(mask_url)

prompt = "A majestic tiger sitting on a bench"
num_inference_steps = 75
high_noise_frac = 0.7

image = base(
    prompt=prompt,
    image=init_image,
    mask_image=mask_image,
    num_inference_steps=num_inference_steps,
    denoising_end=high_noise_frac,
    output_type="latent",
).images
image = refiner(
    prompt=prompt,
    image=image,
    mask_image=mask_image,
    num_inference_steps=num_inference_steps,
    denoising_start=high_noise_frac,
).images[0]
make_image_grid([init_image, mask_image, image.resize((512, 512))], rows=1, cols=3)

-----
### **Kandinsky**
- The Kandinsky models are a series of multilingual text-to-image generation models.
  - The `Kandinsky 2.0` model uses two multilingual text encoders and concatenates those results for the UNet.
  - `Kandinsky 2.1` changes the architecture to include an image prior model (CLIP) to generate a mapping between text and image embeddings and uses a `Modulating Quantized Vectors (MoVQ)` decoder - which adds a spatial conditional normalization layer to increase photorealism - to decode the latents into images.
  - `Kandinsky 2.2` improves on the previous model by replacing the image encoder of the image prior model with a larger `CLIP-ViT-G` model to improve quality.
    - The only difference with `Kandinsky 2.1` is `Kandinsky 2.2` doesn’t accept prompt as an input when decoding the latents. Instead, `Kandinsky 2.2` only accepts `image_embeds` during decoding.
  - `Kandinsky 3` simplifies the architecture and shifts away from the two-stage generation process involving the prior model and diffusion model and uses `Flan-UL2` to encode text, a `UNet` with BigGan-deep blocks, and `Sber-MoVQGAN` to decode the latents into images.
    - Text understanding and generated image quality are primarily achieved by using a larger text encoder and UNet.


In [None]:
from diffusers import KandinskyPriorPipeline, KandinskyPipeline
from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
from diffusers import Kandinsky3Pipeline

from diffusers import AutoPipelineForText2Image
from diffusers import AutoPipelineForImage2Image
from diffusers import AutoPipelineForInpainting

from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline
from diffusers import KandinskyV22Img2ImgPipeline
from diffusers import Kandinsky3Img2ImgPipeline

from diffusers import KandinskyInpaintPipeline
from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline

from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline

from diffusers.models.attention_processor import AttnAddedKVProcessor2_0

from diffusers.utils import load_image
from diffusers.utils import make_image_grid

- The Kandinsky models use white pixels to represent the masked area now instead of black pixels.
  - If you are using `KandinskyInpaintPipeline` in production, you need to change the mask to use white pixels:

In [None]:
# For PIL input
import PIL.ImageOps
mask = PIL.ImageOps.invert(mask)

# For PyTorch and NumPy input
mask = 1 - mask

- For inpainting, you’ll need the original image, a mask of the area to replace in the original image, and a text prompt of what to inpaint. Load the prior pipeline.
- Load an initial image and create a mask.
  - Generate the embeddings with the prior pipeline:

In [None]:
prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# mask area above cat's head
mask[:250, 250:-250] = 1

prompt = "a hat"
prior_output = prior_pipeline(prompt)

- Pass the initial image, mask, and prompt and embeddings to the pipeline to generate an image:
  - You can also use the end-to-end `KandinskyInpaintCombinedPipeline` and `KandinskyV22InpaintCombinedPipeline` to call the prior and decoder pipelines together under the hood.

In [None]:
output_image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)

In [None]:
pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# mask area above cat's head
mask[:250, 250:-250] = 1
prompt = "a hat"

output_image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)

-----
### **IP-Adapter**
- IP-Adapter is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model.
  - This adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like ControlNet.
  - The key idea behind `IP-Adapter` is the decoupled cross-attention mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features.
    - This allows the model to learn more image-specific features.
   
<br>

- `set_ip_adapter_scale()` method controls the amount of text or image conditioning to apply to the model.
  - A value of `1.0` means the model is only conditioned on the image prompt.
  - Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt.
  - Typically, a value of 0.5 achieves a good balance between the two prompt types and produces good results.

- Try adding `low_cpu_mem_usage=True` to the `load_ip_adapter()` method to speed up the loading

In [None]:
from diffusers import AutoPipelineForText2Image
from diffusers import AutoPipelineForImage2Image
from diffusers import AutoPipelineForInpainting

from diffusers.image_processor import IPAdapterMaskProcessor
from diffusers import StableDiffusionPipeline, DDIMScheduler, AutoPipelineForText2Image
from diffusers import DiffusionPipeline, LCMScheduler

from insightface.app import FaceAnalysis
from insightface.utils import face_align

from transformers import CLIPVisionModelWithProjection

- `IP-Adapter` is also useful for inpainting because the image prompt allows you to be much more specific about what you’d like to generate.
  - Pass a prompt, the original image, mask image, and the IP-Adapter image prompt to the pipeline to generate an image.

In [None]:
pipeline = AutoPipelineForInpainting.from_pretrained("diffusers/stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
pipeline.set_ip_adapter_scale(0.6)

mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_mask.png")
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png")
ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png")

generator = torch.Generator(device="cpu").manual_seed(4)
images = pipeline(
    prompt="a cute gummy bear waving",
    image=image,
    mask_image=mask_image,
    ip_adapter_image=ip_image,
    generator=generator,
    num_inference_steps=100,
).images
images[0]

----
### **Perturbed-Attention Guidance(PAG)**
- `Perturbed-Attention Guidance(PAG)` is a new diffusion sampling guidance that improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules.
  - `PAG` is designed to progressively enhance the structure of synthesized samples throughout the denoising process by considering the self-attention mechanisms’ ability to capture structural information.
  - It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, and guiding the denoising process away from these degraded samples.

<br>

- You can apply PAG to the `StableDiffusionXLPipeline` for tasks such as text-to-image, image-to-image, and inpainting.
- To enable PAG for a specific task, load the pipeline using the AutoPipeline API with the `enable_pag=True` flag and the `pag_applied_layers` argument.

In [None]:
from diffusers import AutoPipelineForText2Image, ControlNetModel
from diffusers import AutoPipelineForImage2Image
from diffusers import AutoPipelineForInpainting

from transformers import CLIPVisionModelWithProjection

- You can enable `PAG` on an exisiting inpainting pipeline like this
  - This still works when your pipeline has a different task:

In [None]:
pipeline = AutoPipelineForInpainting.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    enable_pag=True,
    torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()

In [None]:
pipeline_inpaint = AutoPipelineForInpaiting.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
pipeline = AutoPipelineForInpaiting.from_pipe(pipeline_inpaint, enable_pag=True)

pipeline_t2i = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
pipeline = AutoPipelineForInpaiting.from_pipe(pipeline_t2i, enable_pag=True)

In [None]:
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = load_image(img_url).convert("RGB")
mask_image = load_image(mask_url).convert("RGB")

prompt = "A majestic tiger sitting on a bench"

pag_scales =  3.0
guidance_scales = 7.5

generator = torch.Generator(device="cpu").manual_seed(1)
images = pipeline(
    prompt=prompt,
    image=init_image,
    mask_image=mask_image,
    strength=0.8,
    num_inference_steps=50,
    guidance_scale=guidance_scale,
    generator=generator,
    pag_scale=pag_scale,
).images
images[0]

----
### **ControlNet**
- `ControlNet` is a type of model for controlling image diffusion models by conditioning the model with an additional input image.
  - There are many types of conditioning inputs (canny edge, user sketching, human pose, depth, and more) you can use to control a diffusion model.
  - This is hugely useful because it affords you greater control over image generation, making it easier to generate specific images without experimenting with different text prompts or denoising values as much.

- A `ControlNet` model has two sets of weights (or blocks) connected by a zero-convolution layer:
  - a locked copy keeps everything a large pretrained diffusion model has learned
  - a trainable copy is trained on the additional conditioning input

<br>

- Since the locked copy preserves the pretrained model, training and implementing a `ControlNet` on a new conditioning input is as fast as finetuning any other model because you aren’t training the model from scratch.

In [None]:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers import StableDiffusionControlNetImg2ImgPipeline
from diffusers import StableDiffusionControlNetInpaintPipeline

from diffusers import StableDiffusionXLControlNetPipeline, AutoencoderKL

- For inpainting, you need an initial image, a mask image, and a prompt describing what to replace the mask with.
  - `ControlNet` models allow you to add another control image to condition a model with.
  - `ControlNet` can use the inpainting mask as a control to guide the model to generate an image within the mask area.

<br>

- Load an initial image and a mask image:

In [None]:
init_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg"
)
init_image = init_image.resize((512, 512))

mask_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg"
)
mask_image = mask_image.resize((512, 512))
make_image_grid([init_image, mask_image], rows=1, cols=2)

- Create a function to prepare the control image from the initial and mask images.
  - Create a tensor to mark the pixels in `init_image` as masked if the corresponding pixel in `mask_image` is over a certain threshold.

In [None]:
def make_inpaint_condition(image, image_mask):
    image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
    image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0

    assert image.shape[0:1] == image_mask.shape[0:1]
    image[image_mask > 0.5] = -1.0  # set as masked pixel
    image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
    image = torch.from_numpy(image)
    return image

control_image = make_inpaint_condition(init_image, mask_image)

- Load a `ControlNet` model conditioned on inpainting and pass it to the `StableDiffusionControlNetInpaintPipeline`.
  - Use the faster `UniPCMultistepScheduler` and enable model offloading to speed up inference and reduce memory usage.

In [None]:
controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

In [None]:
output = pipe(
    "corgi face with large ears, detailed, pixar, animated, disney",
    num_inference_steps=20,
    eta=1.0,
    image=init_image,
    mask_image=mask_image,
    control_image=control_image,
).images[0]
make_image_grid([init_image, mask_image, output], rows=1, cols=3)

----
### **Latent Consistency Model(LCMs)**
- `Latent Consistency Models (LCMs)` enable fast high-quality image generation by directly predicting the reverse diffusion process in the latent rather than pixel space.
  - `LCMs` try to predict the noiseless image from the noisy image in contrast to typical diffusion models that iteratively remove noise from the noisy image.
  - By avoiding the iterative sampling process, `LCMs` are able to generate high-quality images in 2-4 steps instead of 20-30 steps.

- `LCMs` are distilled from pretrained models which requires ~32 hours of A100 compute.
  - To speed this up, `LCM-LoRAs` train a `LoRA` adapter which have much fewer parameters to train compared to the full model.
  - The `LCM-LoRA` can be plugged into a diffusion model once it has been trained.

In [None]:
from diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler, AutoPipelineForInpainting
from diffusers import DiffusionPipeline

- To use `LCM-LoRAs` for inpainting, you need to replace the scheduler with the `LCMScheduler` and load the `LCM-LoRA` weights with the `load_lora_weights()` method.

In [None]:
pipe = AutoPipelineForInpainting.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")

pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5")

init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png")
mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png")

prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k"
generator = torch.manual_seed(0)
image = pipe(
    prompt=prompt,
    image=init_image,
    mask_image=mask_image,
    generator=generator,
    num_inference_steps=4,
    guidance_scale=4,
).images[0]
image

----
### **Trajectory Consistency Distillation-LoRA**
- `Trajectory Consistency Distillation (TCD)` enables a model to generate higher quality and more detailed images with fewer steps.
- Owing to the effective error mitigation during the distillation process, `TCD` demonstrates superior performance even under conditions of large inference steps.

- The major advantages of TCD are:
  - Better than Teacher: TCD demonstrates superior generative quality at both small and large inference steps and exceeds the performance of `DPM-Solver++(2S)` with `Stable Diffusion XL (SDXL)`.

- For large models like `SDXL`, `TCD` is trained with `LoRA` to reduce memory usage.
  - This is also useful because you can reuse LoRAs between different finetuned models, as long as they share the same base model, without further training.
 
<br>

#### General tasks
- Let’s use the `StableDiffusionXLPipeline` and the `TCDScheduler`.
  - Use the `load_lora_weights()` method to load the `SDXL-compatible TCD-LoRA` weights.

- A few tips to keep in mind for TCD-LoRA inference are to:
  - Keep the `num_inference_steps` between 4 and 50
  - Set `eta` (used to control stochasticity at each step) between 0 and 1.
  - You should use a higher eta when increasing the number of inference steps, but the downside is that a larger eta in `TCDScheduler` leads to blurrier images.
  - A value of `0.3` is recommended to produce good results.

In [None]:
from diffusers import AutoPipelineForInpainting, TCDScheduler

In [None]:
device = "cuda"
base_model_id = "diffusers/stable-diffusion-xl-1.0-inpainting-0.1"
tcd_lora_id = "h1t/TCD-SDXL-LoRA"

pipe = AutoPipelineForInpainting.from_pretrained(base_model_id, torch_dtype=torch.float16, variant="fp16").to(device)
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)

pipe.load_lora_weights(tcd_lora_id)
pipe.fuse_lora()

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = load_image(img_url).resize((1024, 1024))
mask_image = load_image(mask_url).resize((1024, 1024))

prompt = "a tiger sitting on a park bench"

image = pipe(
  prompt=prompt,
  image=init_image,
  mask_image=mask_image,
  num_inference_steps=8,
  guidance_scale=0,
  eta=0.3,
  strength=0.99,  # make sure to use `strength` below 1.0
  generator=torch.Generator(device=device).manual_seed(0),
).images[0]

grid_image = make_image_grid([init_image, mask_image, image], rows=1, cols=3)