## **4. Inference Techniques**
> Original Source: https://huggingface.co/docs/diffusers/v0.33.1/en/using-diffusers/overview_techniques

```
> Create a server
> Distributed Inference
> Merge LoRAs
> Scheduler Features
> Pipeline Callbacks
> Reproducible Pipelines
> Controlling Image Quality
> Prompt Techniques

```

- **Pipeline functionality**: these techniques modify the pipeline or extend it for other applications.
  - Pipeline callbacks add new features to a pipeline and a pipeline can also be extended for distributed inference.
- **Improve inference quality**: these techniques increase the visual quality of the generated images.
  - Enhance your prompts with GPT2 to create better images with lower effort.

In [22]:
import torch
import gc
import copy

from accelerate import PartialState
from diffusers import DiffusionPipeline, EulerDiscreteScheduler, DDIMPipeline, DDIMScheduler
from diffusers import FluxPipeline
from diffusers import FluxTransformer2DModel
from diffusers import AutoencoderKL, UNet2DConditionModel
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
from diffusers import AutoPipelineForText2Image

from diffusers.image_processor import VaeImageProcessor
from diffusers.schedulers import AysSchedules, UniPCMultistepScheduler
from diffusers.callbacks import SDXLCFGCutoffCallback, IPAdapterScaleCutoffCallback
from diffusers.utils import load_image

from transformers import GenerationConfig, GPT2LMHeadModel, GPT2Tokenizer, LogitsProcessor, LogitsProcessorList

import torch.distributed as dist
import torch.multiprocessing as mp
from peft import get_peft_model, LoraConfig, PeftModel

from sd_embed.embedding_funcs import get_weighted_text_embeddings_sdxl
from sd_embed.embedding_funcs import get_weighted_text_embeddings_sd15

-------------
### **Create a Server**
- Diffusers’ pipelines can be used as an inference engine for a server.
  - It supports **concurrent and multithreaded requests to generate images** that may be requested by multiple users at the same time.
 
- [`StableDiffusion3Pipeline`]()
  - Start by navigating to the examples/server folder and installing all of the dependencies.
 
```
pip install .
pip install -f requirements.txt
```

- Launch the server with the following command.

```
python server.py
```

- The server is accessed at `http://localhost:8000`.
  - You can curl this model with the following command.

```
curl -X POST -H "Content-Type: application/json" --data '{"model": "something", "prompt": "a kitten in front of a fireplace"}' http://localhost:8000/v1/images/generations
```

- If you need to upgrade some dependencies, you can use either `pip-tools` or `uv`.

```
uv pip compile requirements.in -o requirements.txt
```


- The server is built with `FastAPI`. The endpoint for `v1/images/generations` is shown below.

In [None]:
@app.post("/v1/images/generations")
async def generate_image(image_input: TextToImageInput):
    try:
        loop = asyncio.get_event_loop()
        scheduler = shared_pipeline.pipeline.scheduler.from_config(shared_pipeline.pipeline.scheduler.config)
        pipeline = StableDiffusion3Pipeline.from_pipe(shared_pipeline.pipeline, scheduler=scheduler)
        generator = torch.Generator(device="cuda")
        generator.manual_seed(random.randint(0, 10000000))
        output = await loop.run_in_executor(None, lambda: pipeline(image_input.prompt, generator = generator))
        logger.info(f"output: {output}")
        image_url = save_image(output.images[0])
        return {"data": [{"url": image_url}]}
    except Exception as e:
        if isinstance(e, HTTPException):
            raise e
        elif hasattr(e, 'message'):
            raise HTTPException(status_code=500, detail=e.message + traceback.format_exc())
        raise HTTPException(status_code=500, detail=str(e) + traceback.format_exc())

- The `generate_image` function is defined as **asynchronous** with the async keyword so that `FastAPI` knows that whatever is happening in this function won’t necessarily return a result right away.
  - Once it hits some point in the function that it needs to await some other Task, the main thread goes back to answering other HTTP requests. This is shown in the code below with the await keyword.
 
- **The execution of the pipeline function is placed onto a new thread**, and the main thread performs other things until a result is returned from the pipeline.
  - Another important aspect of this implementation is creating a pipeline from `shared_pipeline`.
  - The goal behind this is to avoid loading the underlying model more than once onto the GPU while still allowing for each new request that is running on a separate thread to have its own generator and scheduler.
  - The scheduler, in particular, is not thread-safe, and it will cause errors like: `IndexError: index 21 is out of bounds for dimension 0 with size 21` if you try to use the same scheduler across multiple threads.

In [None]:
output = await loop.run_in_executor(None, lambda: pipeline(image_input.prompt, generator = generator))

------------------
### **Distributed Inference**
- On distributed setups, you can run inference across multiple GPUs with `Accelerate` or `PyTorch Distributed`, which is useful for generating with multiple prompts in parallel.

#### `Accelerate`
- `Accelerate` is a library designed to make it easy to train or run inference across distributed setups and simplifies the process of setting up the distributed environment, allowing you to focus on your PyTorch code.

- To begin, create a Python file and initialize an accelerate.
  - PartialState to create a distributed environment; your setup is automatically detected so you don’t need to explicitly define the rank or `world_size`.
  - Move the `DiffusionPipeline` to `distributed_state.device` to assign a GPU to each process.
- Use the `split_between_processes` utility as a context manager to automatically distribute the prompts between the number of processes.

In [None]:
pipeline = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
)
distributed_state = PartialState()
pipeline.to(distributed_state.device)

with distributed_state.split_between_processes(["a dog", "a cat"]) as prompt:
    result = pipeline(prompt).images[0]
    result.save(f"result_{distributed_state.process_index}.png")

- Use the `--num_processes` argument to specify the number of GPUs to use, and call accelerate launch to run the script:
```
accelerate launch run_distributed.py --num_processes=2
```

#### PyTorch Distributed
- PyTorch supports `DistributedDataParallel` which enables data parallelism.
  - Create a Python file and import `torch.distributed` and `torch.multiprocessing` to set up the distributed process group and to spawn the processes for inference on each GPU.

In [4]:
sd = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
)

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

- You’ll want to create a function to run inference; `init_process_group` handles creating a distributed environment with the type of backend to use, the rank of the current process, and the `world_size` or the number of processes participating.
  - If you’re running inference in parallel over 2 GPUs, then the `world_size` is 2.

- Move the `DiffusionPipeline` to rank and use `get_rank` to assign a GPU to each process, where each process handles a different prompt:

In [5]:
def run_inference(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

    sd.to(rank)

    if torch.distributed.get_rank() == 0:
        prompt = "a dog"
    elif torch.distributed.get_rank() == 1:
        prompt = "a cat"

    image = sd(prompt).images[0]
    image.save(f"./{'_'.join(prompt)}.png")

- To run the distributed inference, call `mp.spawn` to run the `run_inference` function on the number of GPUs defined in `world_size`:

In [None]:
def main():
    world_size = 2
    mp.spawn(run_inference, args=(world_size,), nprocs=world_size, join=True)

if __name__ == "__main__":
    main()

- Once you’ve completed the inference script, use the `--nproc_per_node` argument to specify the number of GPUs to use and call torchrun to run the script:

```
torchrun run_distributed.py --nproc_per_node=2
```

#### Model sharding
- Modern diffusion systems such as Flux are very large and have multiple models.
- ex. `Flux.1-Dev` is made up of two text encoders - `T5-XXL` and `CLIP-L` - a diffusion transformer, and a VAE.
- **Model sharding** is a technique that distributes models across GPUs when the models don’t fit on a single GPU.
- The example below assumes two 16GB GPUs are available for inference.
  - Start by computing the text embeddings with the text encoders.
  - Keep the text encoders on two GPUs by setting `device_map="balanced"`.
    - The balanced strategy evenly distributes the model on all available GPUs.
  - Use the `max_memory` parameter to allocate the maximum amount of memory for each text encoder on each GPU.

In [None]:
prompt = "a photo of a dog with cat-like look"

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    transformer=None,
    vae=None,
    device_map="balanced",
    max_memory={0: "16GB", 1: "16GB"},
    torch_dtype=torch.bfloat16
)
with torch.no_grad():
    print("Encoding prompts.")
    prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
        prompt=prompt, prompt_2=None, max_sequence_length=512
    )

- Once the text embeddings are computed, remove them from the GPU to make space for the diffusion transformer.

In [None]:
def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()

del pipeline.text_encoder
del pipeline.text_encoder_2
del pipeline.tokenizer
del pipeline.tokenizer_2
del pipeline

flush()

- Load the diffusion transformer next which has 12.5B parameters.
  - Set `device_map="auto"` to automatically distribute the model across two 16GB GPUs.
  - The auto strategy is backed by `Accelerate` and available as a part of the Big Model Inference feature.
  - It starts by distributing a model across the fastest device first (GPU) before moving to slower devices like the CPU and hard drive if needed.
  - The trade-off of storing model parameters on slower devices is slower inference latency.
  - You can try `print(pipeline.hf_device_map)` to see how the various models are distributed across devices.
    - This is useful for tracking the device placement of the models.
    - You can also try `print(transformer.hf_device_map)` to see how the transformer model is sharded across devices.

In [10]:
transformer = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev", 
    subfolder="transformer",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    text_encoder=None,
    text_encoder_2=None,
    tokenizer=None,
    tokenizer_2=None,
    vae=None,
    transformer=transformer,
    torch_dtype=torch.bfloat16
)

print("Running denoising.")
height, width = 768, 1360
latents = pipeline(
    prompt_embeds=prompt_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    num_inference_steps=50,
    guidance_scale=3.5,
    height=height,
    width=width,
    output_type="latent",
).images

- Remove the pipeline and transformer from memory as they’re no longer needed.

In [None]:
del pipeline.transformer
del pipeline

flush()

- Decode the latents with the VAE into an image. The VAE is typically small enough to be loaded on a single GPU.

In [None]:
vae = AutoencoderKL.from_pretrained(ckpt_id, subfolder="vae", torch_dtype=torch.bfloat16).to("cuda")
vae_scale_factor = 2 ** (len(vae.config.block_out_channels))
image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor)

with torch.no_grad():
    print("Running decoding.")
    latents = FluxPipeline._unpack_latents(latents, height, width, vae_scale_factor)
    latents = (latents / vae.config.scaling_factor) + vae.config.shift_factor

    image = vae.decode(latents, return_dict=False)[0]
    image = image_processor.postprocess(image, output_type="pil")
    image[0].save("split_transformer.png")

-----------
### **Merge LoRAs**
- Merging multiple LoRA weights together to produce images that are a blend of different styles.
- To improve inference speed and reduce memory-usage of merged LoRAs, you’ll also see how to use the `fuse_lora()` method to fuse the LoRA weights with the original weights of the underlying model.
  - Load a `Stable Diffusion XL (SDXL)` checkpoint and the `KappaNeuro/studio-ghibli-style` and `Norod78/sdxl-chalkboarddrawing-lora` LoRAs with the `load_lora_weights()` method.
  - You’ll need to assign each LoRA an `adapter_name` to combine them later.

In [None]:
pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")

#### `set_adapters`
- The `set_adapters()` method merges LoRA adapters by concatenating their weighted matrices.
- Use the adapter name to specify which LoRAs to merge, and the `adapter_weights` parameter to control the scaling for each LoRA.
  - If `adapter_weights=[0.5, 0.5]`, then the merged LoRA output is an average of both LoRAs.
  - Try adjusting the adapter weights to see how it affects the generated image!

In [None]:
pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])

generator = torch.manual_seed(0)
prompt = "A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai"
image = pipeline(prompt, generator=generator, cross_attention_kwargs={"scale": 1.0}).images[0]
image

#### `add_weighted_adapter`
- There are three steps to merge LoRAs with the `add_weighted_adapter` method:
  - Create a `PeftModel` from the underlying model and LoRA checkpoint.
  - Load a base UNet model and the LoRA adapters.
- Merge the adapters using the `add_weighted_adapter` method and the merging method of your choice.
- Load a UNet that corresponds to the UNet in the LoRA checkpoint.

In [None]:
unet = UNet2DConditionModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
    subfolder="unet",
).to("cuda")

- Load the SDXL pipeline and the LoRA checkpoints, starting with the `ostris/ikea-instructions-lora-sdxl` LoRA.

In [None]:
pipeline = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    variant="fp16",
    torch_dtype=torch.float16,
    unet=unet
).to("cuda")
pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")

- Create a `PeftModel` from the loaded LoRA checkpoint by combining the SDXL UNet and the LoRA UNet from the pipeline.

In [14]:
sdxl_unet = copy.deepcopy(unet)
ikea_peft_model = get_peft_model(
    sdxl_unet,
    pipeline.unet.peft_config["ikea"],
    adapter_name="ikea"
)

original_state_dict = {f"base_model.model.{k}": v for k, v in pipeline.unet.state_dict().items()}
ikea_peft_model.load_state_dict(original_state_dict, strict=True)

- Repeat this process to create a `PeftModel` from the `lordjia/by-feng-zikai LoRA`.

In [None]:
pipeline.delete_adapters("ikea")
sdxl_unet.delete_adapters("ikea")

pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")
pipeline.set_adapters(adapter_names="feng")

feng_peft_model = get_peft_model(
    sdxl_unet,
    pipeline.unet.peft_config["feng"],
    adapter_name="feng"
)

original_state_dict = {f"base_model.model.{k}": v for k, v in pipe.unet.state_dict().items()}
feng_peft_model.load_state_dict(original_state_dict, strict=True)

- Load a base UNet model and then load the adapters onto it.

In [None]:
base_unet = UNet2DConditionModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
    subfolder="unet",
).to("cuda")

model = PeftModel.from_pretrained(base_unet, "stevhliu/ikea_peft_model", use_safetensors=True, subfolder="ikea", adapter_name="ikea")
model.load_adapter("stevhliu/feng_peft_model", use_safetensors=True, subfolder="feng", adapter_name="feng")

- Merge the adapters using the `add_weighted_adapter` method and the merging method of your choice (learn more about other merging methods in this blog post).

In [None]:
model.add_weighted_adapter(
    adapters=["ikea", "feng"],
    weights=[1.0, 1.0],
    combination_type="dare_linear",
    adapter_name="ikea-feng"
)
model.set_adapters("ikea-feng")

model = model.to(dtype=torch.float16, device="cuda")

pipeline = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", unet=model, variant="fp16", torch_dtype=torch.float16,
).to("cuda")

image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0]
image

#### fuse_lora
- Both the `set_adapters()` and `add_weighted_adapter` methods require loading the base model and the LoRA adapters separately which incurs some overhead.
  - The `fuse_lora()` method allows you to **fuse the LoRA weights directly with the original weights of the underlying model**.
  - You’re only loading the model once which can increase inference and lower memory-usage.

- You can use PEFT to easily `fuse/unfuse multiple adapters` directly into the model weights (both UNet and text encoder) using the `fuse_lora()` method, which can lead to a speed-up in inference and lower VRAM usage.

- Fuse these LoRAs into the UNet with the `fuse_lora()` method.
  - The `lora_scale` parameter controls how much to scale the output by with the LoRA weights.
  - It is important to make the `lora_scale` adjustments in the `fuse_lora()` method because it won’t work if you try to pass scale to the `cross_attention_kwargs` in the pipeline.

In [None]:
pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")

pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])

pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0)

- Then you should use `unload_lora_weights()` to unload the LoRA weights since they’ve already been fused with the underlying base model.
- Call `save_pretrained()` to save the fused pipeline locally or you could call `push_to_hub()` to push the fused pipeline to the Hub.

In [None]:
pipeline.unload_lora_weights()
# save locally
pipeline.save_pretrained("path/to/fused-pipeline")
# save to the Hub
pipeline.push_to_hub("fused-ikea-feng")

- You can quickly load the fused pipeline and use it for inference without needing to separately load the LoRA adapters.

In [None]:
pipeline = DiffusionPipeline.from_pretrained(
    "username/fused-ikea-feng", torch_dtype=torch.float16,
).to("cuda")

image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0]
image

- You can call `~~loaders.lora_base.LoraBaseMixin.unfuse_lora` to restore the original model’s weights (for example, if you want to use a different lora_scale value).
  - This only works if you’ve only fused one LoRA adapter to the original model.
  - If you’ve fused multiple LoRAs, you’ll need to reload the model.

In [None]:
pipeline.unfuse_lora()

#### torch.compile
- `torch.compile` can speed up your pipeline even more, but the LoRA weights must be fused first and then unloaded.
  - The UNet is compiled because it is such a computationally intensive component of the pipeline.

In [None]:
# load base model and LoRAs
pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea")
pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng")

# activate both LoRAs and set adapter weights
pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8])

# fuse LoRAs and unload weights
pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0)
pipeline.unload_lora_weights()

# torch.compile
pipeline.unet.to(memory_format=torch.channels_last)
pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)

image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0]

-----------
### **Scheduler Features**
- The scheduler is an important component of any diffusion model because it controls the entire denoising (or sampling) process.
  - With Diffusers, you can modify the scheduler configuration to use custom noise schedules, sigmas, and rescale the noise schedule.
  - Changing these parameters can have profound effects on inference quality and speed.

#### Timestep schedules
- The timestep or noise schedule determines the amount of noise at each sampling step.
  - The scheduler uses this to generate an image with the corresponding amount of noise at each step.

- `Align Your Steps (AYS)` is a method for optimizing a sampling schedule to generate a high-quality image in as little as 10 steps.
  - The optimal 10-step schedule for Stable Diffusion XL is:

In [None]:
sampling_schedule = AysSchedules["StableDiffusionXLTimesteps"]
print(sampling_schedule)

pipeline = StableDiffusionXLPipeline.from_pretrained(
    "SG161222/RealVisXL_V4.0",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, algorithm_type="sde-dpmsolver++")

prompt = "A cinematic shot of a cute little rabbit wearing a jacket and doing a thumbs up"
generator = torch.Generator(device="cpu").manual_seed(2487854446)
image = pipeline(
    prompt=prompt,
    negative_prompt="",
    generator=generator,
    timesteps=sampling_schedule,
).images[0]

#### Timestep spacing
- The way sample steps are selected in the schedule can affect the quality of the generated image, especially w.r.t rescaling the noise schedule, which can enable a model to generate much brighter or darker images.
- Diffusers provides three timestep spacing methods:
  - `Leading` creates evenly spaced steps
  - `Linspace` includes the first and last steps and evenly selects the remaining intermediate steps
  - `Trailing` only includes the last step and evenly selects the remaining intermediate steps starting from the end
    - It is recommended to use the `trailing` spacing method because it generates higher quality images with more details when there are fewer sample steps.
    - But the difference in quality is not as obvious for more standard sample step values.

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "SG161222/RealVisXL_V4.0",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, timestep_spacing="trailing")

prompt = "A cinematic shot of a cute little black cat sitting on a pumpkin at night"
generator = torch.Generator(device="cpu").manual_seed(2487854446)
image = pipeline(
    prompt=prompt,
    negative_prompt="",
    generator=generator,
    num_inference_steps=5,
).images[0]
image

#### Sigmas
- The `sigmas` parameter is **the amount of noise added at each timestep according to the timestep schedule**.
  - Like the timesteps parameter, you can customize the sigmas parameter to control how much noise is added at each step.
  - When you use a custom sigmas value, the timesteps are calculated from the custom sigmas value and the default scheduler configuration is ignored.

- You can manually pass the sigmas for something like the `10-step AYS schedule` from before to the pipeline.

In [None]:
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipeline = DiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.float16,
  variant="fp16",
).to("cuda")
pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config)

sigmas = [14.615, 6.315, 3.771, 2.181, 1.342, 0.862, 0.555, 0.380, 0.234, 0.113, 0.0]
prompt = "anthropomorphic capybara wearing a suit and working with a computer"
generator = torch.Generator(device='cuda').manual_seed(123)
image = pipeline(
    prompt=prompt,
    num_inference_steps=10,
    sigmas=sigmas,
    generator=generator
).images[0]

print(f" timesteps: {pipe.scheduler.timesteps}")

#### Karras sigmas
- Karras scheduler’s use the timestep schedule and sigmas from the Elucidating the Design Space of Diffusion-Based Generative Models paper.
  - This scheduler variant applies a smaller amount of noise per step as it approaches the end of the sampling process compared to other schedulers, and can increase the level of details in the generated image.

Enable Karras sigmas by setting use_karras_sigmas=True in the scheduler.
- Karras sigmas should not be used for models that weren’t trained with them.
  - ex. The base `Stable Diffusion XL` model shouldn’t use Karras sigmas but the `DreamShaperXL` model can since they are trained with Karras sigmas.


In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "SG161222/RealVisXL_V4.0",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, algorithm_type="sde-dpmsolver++", use_karras_sigmas=True)

prompt = "A cinematic shot of a cute little rabbit wearing a jacket and doing a thumbs up"
generator = torch.Generator(device="cpu").manual_seed(2487854446)
image = pipeline(
    prompt=prompt,
    negative_prompt="",
    generator=generator,
).images[0]

#### Rescale noise schedule
- Common noise schedules allowed some signal to leak into the last timestep.
  - This signal leakage at inference can cause models to only generate images with medium brightness.
  - By enforcing a zero signal-to-noise ratio (SNR) for the timstep schedule and sampling from the last timestep, the model can be improved to generate very bright or dark images.

- Load the `ptx0/pseudo-journey-v2` checkpoint which was trained with `v_prediction` and the `DDIMScheduler`.
  - `rescale_betas_zero_snr=True` to rescale the noise schedule to zero SNR
  - `timestep_spacing="trailing"` to start sampling from the last timestep

- Set `guidance_rescale` in the pipeline to prevent over-exposure.
  - A lower value increases brightness but some of the details may appear washed out.



In [None]:
pipeline.scheduler = DDIMScheduler.from_config(
    pipeline.scheduler.config, rescale_betas_zero_snr=True, timestep_spacing="trailing"
)
pipeline.to("cuda")
prompt = "cinematic photo of a snowy mountain at night with the northern lights aurora borealis overhead, 35mm photograph, film, professional, 4k, highly detailed"
generator = torch.Generator(device="cpu").manual_seed(23)
image = pipeline(prompt, guidance_rescale=0.7, generator=generator).images[0]
image

-----------
### **Pipeline Callbacks**
- The denoising loop of a pipeline can be modified with custom defined functions using the `callback_on_step_end` parameter.
  - The callback function is executed at the end of each step, and modifies the pipeline attributes and variables for the next step.
  - This is really useful for dynamically adjusting certain pipeline attributes or modifying tensor variables.


#### Official callbacks
- We provide a list of callbacks you can plug into an existing pipeline and modify the denoising loop.
  - `SDCFGCutoffCallback`: Disables the CFG after a certain number of steps for all SD 1.5 pipelines, including text-to-image, image-to-image, inpaint, and controlnet.
  - SDXLCFGCutoffCallback`: Disables the CFG after a certain number of steps for all SDXL pipelines, including text-to-image, image-to-image, inpaint, and controlnet.
  - `IPAdapterScaleCutoffCallback`: Disables the IP Adapter after a certain number of steps for all pipelines supporting IP-Adapter.
If you want to add a new official callback, feel free to open a feature request or submit a PR.

- To set up a callback, you need to specify the number of denoising steps after which the callback comes into effect.
  - `cutoff_step_ratio`: Float number with the ratio of the steps.
  - `cutoff_step_index`: Integer number with the exact number of the step.

In [None]:
callback = SDXLCFGCutoffCallback(cutoff_step_ratio=0.4)
# can also be used with cutoff_step_index
# callback = SDXLCFGCutoffCallback(cutoff_step_ratio=None, cutoff_step_index=10)

pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config, use_karras_sigmas=True)

prompt = "a sports car at the road, best quality, high quality, high detail, 8k resolution"

generator = torch.Generator(device="cpu").manual_seed(2628670641)

out = pipeline(
    prompt=prompt,
    negative_prompt="",
    guidance_scale=6.5,
    num_inference_steps=25,
    generator=generator,
    callback_on_step_end=callback,
)

out.images[0].save("official_callback.png")

#### Dynamic classifier-free guidance
- Dynamic classifier-free guidance (CFG) is a feature that allows you to disable CFG after a certain number of inference steps which can help you save compute with minimal cost to performance.
  - `pipeline` (or the pipeline instance) provides access to important properties such as `num_timesteps` and `guidance_scale`.
    - You can modify these properties by updating the underlying attributes. You’ll disable CFG by setting `pipeline._guidance_scale=0.0`.
  - `step_index` and `timestep` tell you where you are in the denoising loop.
    - Use `step_index` to turn off CFG after reaching 40% of `num_timesteps`.
  - `callback_kwargs` is a dict that contains tensor variables you can modify during the denoising loop.
    - It only includes variables specified in the `callback_on_step_end_tensor_inputs argument`, which is passed to the pipeline’s `__call__` method.
    - Different pipelines may use different sets of variables, so please check a pipeline’s `_callback_tensor_inputs` attribute for the list of variables you can modify.
    - Some common variables include latents and `prompt_embeds`. For this function, change the batch size of `prompt_embeds` after setting `guidance_scale=0.0` in order for it to work properly.

In [None]:
def callback_dynamic_cfg(pipe, step_index, timestep, callback_kwargs):
    # adjust the batch_size of prompt_embeds according to guidance_scale
    if step_index == int(pipeline.num_timesteps * 0.4):
            prompt_embeds = callback_kwargs["prompt_embeds"]
            prompt_embeds = prompt_embeds.chunk(2)[-1]

            # update guidance_scale and prompt_embeds
            pipeline._guidance_scale = 0.0
            callback_kwargs["prompt_embeds"] = prompt_embeds
    return callback_kwargs

- Pass the callback function to the `callback_on_step_end` parameter and the `prompt_embeds` to `callback_on_step_end_tensor_inputs`.

In [None]:
pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipeline = pipeline.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"

generator = torch.Generator(device="cuda").manual_seed(1)
out = pipeline(
    prompt,
    generator=generator,
    callback_on_step_end=callback_dynamic_cfg,
    callback_on_step_end_tensor_inputs=['prompt_embeds']
)

out.images[0].save("out_custom_cfg.png")

#### Interrupt the diffusion process
- Stopping the diffusion process early is useful when building UIs that work with Diffusers because it allows users to stop the generation process if they’re unhappy with the intermediate results.
  - You can incorporate this into your pipeline with a callback.

- This callback function should take the following arguments: `pipeline`, `i`, `t`, and `callback_kwargs` (this must be returned).   - Set the pipeline’s `_interrupt` attribute to `True` to stop the diffusion process after a certain number of steps.

In [None]:
pipeline = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5")
pipeline.enable_model_cpu_offload()
num_inference_steps = 50

def interrupt_callback(pipeline, i, t, callback_kwargs):
    stop_idx = 10
    if i == stop_idx:
        pipeline._interrupt = True

    return callback_kwargs

pipeline(
    "A photo of a cat",
    num_inference_steps=num_inference_steps,
    callback_on_step_end=interrupt_callback,
)

#### IP Adapter Cutoff
- IP Adapter is an image prompt adapter that can be used for diffusion models without any changes to the underlying model.
  - We can use the IP Adapter Cutoff Callback to disable the IP Adapter after a certain number of steps.
  - To set up the callback, you need to specify the number of denoising steps after which the callback comes into effect.
    - `cutoff_step_ratio`: Float number with the ratio of the steps.
    - `cutoff_step_index`: Integer number with the exact number of the step.

In [None]:
pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
pipeline.set_ip_adapter_scale(0.6)

In [None]:
pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", 
    torch_dtype=torch.float16
).to("cuda")


pipeline.load_ip_adapter(
    "h94/IP-Adapter", 
    subfolder="sdxl_models", 
    weight_name="ip-adapter_sdxl.bin"
)

pipeline.set_ip_adapter_scale(0.6)


callback = IPAdapterScaleCutoffCallback(
    cutoff_step_ratio=None, 
    cutoff_step_index=5
)

image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png"
)

generator = torch.Generator(device="cuda").manual_seed(2628670641)

images = pipeline(
    prompt="a tiger sitting in a chair drinking orange juice",
    ip_adapter_image=image,
    negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
    generator=generator,
    num_inference_steps=50,
    callback_on_step_end=callback,
).images

images[0].save("custom_callback_img.png")

#### Display image after each generation step
- Display an image after each generation step by accessing and converting the latents after each step into an image.
  - The latent space is compressed to 128x128, so the images are also 128x128 which is useful for a quick preview.
 
- Use the function below to convert the SDXL latents (4 channels) to RGB tensors (3 channels).

In [None]:
def latents_to_rgb(latents):
    weights = (
        (60, -60, 25, -70),
        (60,  -5, 15, -50),
        (60,  10, -5, -35),
    )

    weights_tensor = torch.t(torch.tensor(weights, dtype=latents.dtype).to(latents.device))
    biases_tensor = torch.tensor((150, 140, 130), dtype=latents.dtype).to(latents.device)
    rgb_tensor = torch.einsum("...lxy,lr -> ...rxy", latents, weights_tensor) + biases_tensor.unsqueeze(-1).unsqueeze(-1)
    image_array = rgb_tensor.clamp(0, 255).byte().cpu().numpy().transpose(1, 2, 0)

    return Image.fromarray(image_array)

- Create a function to decode and save the latents into an image.

In [None]:
def decode_tensors(pipe, step, timestep, callback_kwargs):
    latents = callback_kwargs["latents"]

    image = latents_to_rgb(latents[0])
    image.save(f"{step}.png")

    return callback_kwargs
Pass the decode_tensors function to the callback_on_step_end parameter to decode the tensors after each step. You also need to specify what you want to modify in the callback_on_step_end_tensor_inputs parameter, which in this case are the latents.
Copied
from diffusers import AutoPipelineForText2Image
import torch
from PIL import Image

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
).to("cuda")

image = pipeline(
    prompt="A croissant shaped like a cute bear.",
    negative_prompt="Deformed, ugly, bad anatomy",
    callback_on_step_end=decode_tensors,
    callback_on_step_end_tensor_inputs=["latents"],
).images[0]

-----------
### **Reproducible Pipelines**
- Diffusion models are inherently random which is what allows it to generate different outputs every time it is run.
  - But there are certain times when you want to generate the same output every time, like when you’re testing, replicating results, and even improving image quality.
  - While you can’t expect to get identical results across platforms, you can expect reproducible results across releases and platforms within a certain tolerance range (though even this may vary).

- Show you how to control randomness for deterministic generation on a CPU and GPU.

#### Control randomness
- During inference, pipelines rely heavily on random sampling operations which include creating the Gaussian noise tensors to denoise and adding noise to the scheduling step.
  - Take a look at the tensor values in the `DDIMPipeline` after two inference steps.

In [None]:
ddim = DDIMPipeline.from_pretrained( "google/ddpm-cifar10-32", use_safetensors=True)
image = ddim(num_inference_steps=2, output_type="np").images
print(np.abs(image).sum())

- Each time the pipeline is run, `torch.randn` uses a different random seed to create the Gaussian noise tensors.
  - This leads to a different result each time it is run and enables the diffusion pipeline to generate a different random image each time.
  - But if you need to reliably generate the same image, that depends on whether you’re running the pipeline on a CPU or GPU.

- It might seem unintuitive to pass Generator objects to a pipeline instead of the integer value representing the seed.
  - However, this is the recommended design when working with probabilistic models in PyTorch because a Generator is a random state that can be passed to multiple pipelines in a sequence.
  - As soon as the Generator is consumed, the state is changed in place which means even if you passed the same Generator to a different pipeline, it won’t produce the same result because the state is already changed.

- To generate reproducible results on a **CPU**, you’ll need to use a PyTorch Generator and set a seed.
  - Now when you run the code, it always prints a value of 1491.1711 because the Generator object with the seed is passed to all the random functions in the pipeline.

In [None]:
ddim = DDIMPipeline.from_pretrained("google/ddpm-cifar10-32", use_safetensors=True)
generator = torch.Generator(device="cpu").manual_seed(0)
image = ddim(num_inference_steps=2, output_type="np", generator=generator).images
print(np.abs(image).sum())

- Writing a reproducible pipeline on a **GPU** is a bit trickier, and full reproducibility across different hardware is not guaranteed because matrix multiplication - which diffusion pipelines require a lot of - is less deterministic on a GPU than a CPU.     - If you run the same code example from the CPU example, you’ll get a different result even though the seed is identical.
  - This is because the GPU uses a different random number generator than the CPU.

In [None]:
ddim = DDIMPipeline.from_pretrained("google/ddpm-cifar10-32", use_safetensors=True)
ddim.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(0)
image = ddim(num_inference_steps=2, output_type="np", generator=generator).images
print(np.abs(image).sum())

- To avoid this issue, Diffusers has a `randn_tensor()` function for creating random noise on the CPU, and then moving the tensor to a GPU if necessary.
  - The `randn_tensor()` function is used everywhere inside the pipeline.
  - Now you can call `torch.manual_seed` which automatically creates a CPU Generator that can be passed to the pipeline even if it is being run on a GPU.

In [None]:
ddim = DDIMPipeline.from_pretrained("google/ddpm-cifar10-32", use_safetensors=True)
ddim.to("cuda")
generator = torch.manual_seed(0)
image = ddim(num_inference_steps=2, output_type="np", generator=generator).images
print(np.abs(image).sum())

#### Deterministic algorithms
- You can also configure PyTorch to use deterministic algorithms to create a **reproducible pipeline**.
  - The downside is that deterministic algorithms may be slower than non-deterministic ones and you may observe a decrease in performance.
  - Non-deterministic behavior occurs when operations are launched in more than one CUDA stream.
    - To avoid this, set the environment variable `CUBLAS_WORKSPACE_CONFIG` to `:16:8` to only use one buffer size during runtime.

- PyTorch typically benchmarks multiple algorithms to select the fastest one, but if you want reproducibility, you should disable this feature because the benchmark may select different algorithms each time.
  - Set Diffusers `enable_full_determinism` to enable deterministic algorithms.

In [None]:
enable_full_determinism()

- You’ll get identical results.

In [None]:
pipe = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use_safetensors=True).to("cuda")
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
g = torch.Generator(device="cuda")

prompt = "A bear is playing a guitar on Times Square"

g.manual_seed(0)
result1 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type="latent").images

g.manual_seed(0)
result2 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type="latent").images

print("L_inf dist =", abs(result1 - result2).max())
"L_inf dist = tensor(0., device='cuda:0')"

#### Deterministic batch generation
- A practical application of creating reproducible pipelines is deterministic batch generation.
- You generate a batch of images and select one image to improve with a more detailed prompt.
  - The main idea is to pass a list of Generator’s to the pipeline and tie each Generator to a seed so you can reuse it.

In [None]:
pipeline = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
)
pipeline = pipeline.to("cuda")

generator = [torch.Generator(device="cuda").manual_seed(i) for i in range(4)]
prompt = "Labrador in the style of Vermeer"
images = pipeline(prompt, generator=generator, num_images_per_prompt=4).images[0]
make_image_grid(images, rows=2, cols=2)

- Let’s improve the first image (you can choose any image you want) which corresponds to the Generator with seed 0.
  - Add some additional text to your prompt and then make sure you reuse the same Generator with seed 0.
  - All the generated images should resemble the first image.

In [None]:
prompt = [prompt + t for t in [", highly realistic", ", artsy", ", trending", ", colorful"]]
generator = [torch.Generator(device="cuda").manual_seed(0) for i in range(4)]
images = pipeline(prompt, generator=generator).images
make_image_grid(images, rows=2, cols=2)

-----------
### **Controlling Image Quality**
- The components of a diffusion model, like the UNet and scheduler, can be optimized to improve the quality of generated images leading to better details.
  - These techniques are especially useful if you don’t have the resources to simply use a larger model for inference.
  - You can enable these techniques during inference without any additional training.

- FreeU improves image details by rebalancing the UNet’s backbone and skip connection weights.
  - The skip connections can cause the model to overlook some of the backbone semantics which may lead to unnatural image details in the generated image.
  - This technique does not require any additional training and can be applied on the fly during inference for tasks like image-to-image and text-to-video.

- Use the `enable_freeu()` method on your pipeline and configure the scaling factors for the backbone (b1 and b2) and skip connections (s1 and s2).
  - The number after each scaling factor corresponds to the stage in the UNet where the factor is applied.
  - Take a look at the FreeU repository for reference hyperparameters for different models.

In [None]:
pipeline = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, safety_checker=None
).to("cuda")
pipeline.enable_freeu(s1=0.9, s2=0.2, b1=1.5, b2=1.6)
generator = torch.Generator(device="cpu").manual_seed(33)
prompt = ""
image = pipeline(prompt, generator=generator).images[0]

pipeline.disable_freeu()

-----------
### **Prompt Techniques**
- Prompts are important because they describe what you want a diffusion model to generate.
  - The best prompts are detailed, specific, and well-structured to help the model realize your vision.
  - But crafting a great prompt takes time and effort and sometimes it may not be enough because language and words can be imprecise.

#### Prompt engineering
- New diffusion models do a pretty good job of generating high-quality images from a basic prompt, but it is still **important to create a well-written prompt to get the best results**.
- Here are a few tips for writing a good prompt:

```
1. What is the image medium? Is it a photo, a painting, a 3D illustration, or something else?
2. What is the image subject? Is it a person, animal, object, or scene?
3. What details would you like to see in the image? This is where you can get really creative and have a lot of fun experimenting with different words to bring your image to life. For example, what is the lighting like? What is the vibe and aesthetic? What kind of art or illustration style are you looking for? The more specific and precise words you use, the better the model will understand what you want to generate.
```

#### Prompt enhancing with GPT2
- Prompt enhancing is a technique for quickly improving prompt quality without spending too much effort constructing one.
  - It uses a model like GPT2 pretrained on Stable Diffusion text prompts to automatically enrich a prompt with additional important keywords to generate high-quality images.

- The technique works by curating a list of specific keywords and forcing the model to generate those words to enhance the original prompt.
  - Your prompt can be “a cat” and GPT2 can enhance the prompt to “cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain quality sharp focus beautiful detailed intricate stunning amazing epic”.
  - You should also use a offset noise LoRA to improve the contrast in bright and dark images and create better lighting overall.

In [None]:
styles = {
    "cinematic": "cinematic film still of {prompt}, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain",
    "anime": "anime artwork of {prompt}, anime style, key visual, vibrant, studio anime, highly detailed",
    "photographic": "cinematic photo of {prompt}, 35mm photograph, film, professional, 4k, highly detailed",
    "comic": "comic of {prompt}, graphic illustration, comic art, graphic novel art, vibrant, highly detailed",
    "lineart": "line art drawing {prompt}, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
    "pixelart": " pixel-art {prompt}, low-res, blocky, pixel art style, 8-bit graphics",
}

words = [
    "aesthetic", "astonishing", "beautiful", "breathtaking", "composition", "contrasted", "epic", "moody", "enhanced",
    "exceptional", "fascinating", "flawless", "glamorous", "glorious", "illumination", "impressive", "improved",
    "inspirational", "magnificent", "majestic", "hyperrealistic", "smooth", "sharp", "focus", "stunning", "detailed",
    "intricate", "dramatic", "high", "quality", "perfect", "light", "ultra", "highly", "radiant", "satisfying",
    "soothing", "sophisticated", "stylish", "sublime", "terrific", "touching", "timeless", "wonderful", "unbelievable",
    "elegant", "awesome", "amazing", "dynamic", "trendy",
]

- You may have noticed in the words list, there are certain words that can be paired together to create something more meaningful.
  - The words “high” and “quality” can be combined to create “high quality”.

In [None]:
word_pairs = ["highly detailed", "high quality", "enhanced quality", "perfect composition", "dynamic light"]

def find_and_order_pairs(s, pairs):
    words = s.split()
    found_pairs = []
    for pair in pairs:
        pair_words = pair.split()
        if pair_words[0] in words and pair_words[1] in words:
            found_pairs.append(pair)
            words.remove(pair_words[0])
            words.remove(pair_words[1])

    for word in words[:]:
        for pair in pairs:
            if word in pair.split():
                words.remove(word)
                break
    ordered_pairs = ", ".join(found_pairs)
    remaining_s = ", ".join(words)
    return ordered_pairs, remaining_s

- Implement a custom `LogitsProcessor` class that assigns tokens in the words list a value of 0 and assigns tokens not in the words list a negative value so they aren’t picked during generation.
  - Generation is biased towards words in the words list.
  - After a word from the list is used, it is also assigned a negative value so it isn’t picked again.

In [None]:
class CustomLogitsProcessor(LogitsProcessor):
    def __init__(self, bias):
        super().__init__()
        self.bias = bias

    def __call__(self, input_ids, scores):
        if len(input_ids.shape) == 2:
            last_token_id = input_ids[0, -1]
            self.bias[last_token_id] = -1e10
        return scores + self.bias

word_ids = [tokenizer.encode(word, add_prefix_space=True)[0] for word in words]
bias = torch.full((tokenizer.vocab_size,), -float("Inf")).to("cuda")
bias[word_ids] = 0
processor = CustomLogitsProcessor(bias)
processor_list = LogitsProcessorList([processor])

- Combine the prompt and the cinematic style prompt defined in the styles dictionary earlier.

In [None]:
prompt = "a cat basking in the sun on a roof in Turkey"
style = "cinematic"

prompt = styles[style].format(prompt=prompt)
prompt
"cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain"

- Load a GPT2 tokenizer and model from the `Gustavosta/MagicPrompt-Stable-Diffusion` checkpoint (this specific checkpoint is trained to generate prompts) to enhance the prompt.

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion")
model = GPT2LMHeadModel.from_pretrained("Gustavosta/MagicPrompt-Stable-Diffusion", torch_dtype=torch.float16).to(
    "cuda"
)
model.eval()

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
token_count = inputs["input_ids"].shape[1]
max_new_tokens = 50 - token_count

generation_config = GenerationConfig(
    penalty_alpha=0.7,
    top_k=50,
    eos_token_id=model.config.eos_token_id,
    pad_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
    do_sample=True,
)

with torch.no_grad():
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=max_new_tokens,
        generation_config=generation_config,
        logits_processor=proccesor_list,
    )

- You can combine the input prompt and the generated prompt.
  - Feel free to take a look at what the generated prompt (`generated_part`) is, the word pairs that were found (`pairs`), and the remaining words (`words`).
  - This is all packed together in the `enhanced_prompt`.

In [None]:
output_tokens = [tokenizer.decode(generated_id, skip_special_tokens=True) for generated_id in generated_ids]
input_part, generated_part = output_tokens[0][: len(prompt)], output_tokens[0][len(prompt) :]
pairs, words = find_and_order_pairs(generated_part, word_pairs)
formatted_generated_part = pairs + ", " + words
enhanced_prompt = input_part + ", " + formatted_generated_part
enhanced_prompt
["cinematic film still of a cat basking in the sun on a roof in Turkey, highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain quality sharp focus beautiful detailed intricate stunning amazing epic"]

- Load a pipeline and the offset noise LoRA with a low weight to generate an image with the enhanced prompt.

In [None]:
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "RunDiffusion/Juggernaut-XL-v9", torch_dtype=torch.float16, variant="fp16"
).to("cuda")

pipeline.load_lora_weights(
    "stabilityai/stable-diffusion-xl-base-1.0",
    weight_name="sd_xl_offset_example-lora_1.0.safetensors",
    adapter_name="offset",
)
pipeline.set_adapters(["offset"], adapter_weights=[0.2])

image = pipeline(
    enhanced_prompt,
    width=1152,
    height=896,
    guidance_scale=7.5,
    num_inference_steps=25,
).images[0]
image

#### Prompt weighting
- Prompt weighting provides a way to emphasize or de-emphasize certain parts of a prompt, allowing for more control over the generated image.
  - A prompt can include several concepts, which gets turned into contextualized text embeddings.
  - The embeddings are used by the model to condition its cross-attention layers to generate an image (read the Stable Diffusion blog post to learn more about how it works).

- Prompt weighting works by increasing or decreasing the scale of the text embedding vector that corresponds to its concept in the prompt because you may not necessarily want the model to focus on all concepts equally.
  - The easiest way to prepare the prompt embeddings is to use Stable Diffusion Long Prompt Weighted Embedding (`sd_embed`).
  - Once you have the prompt-weighted embeddings, you can pass them to any pipeline that has a `prompt_embeds` (and optionally `negative_prompt_embeds`) parameter, such as `StableDiffusionPipeline`, `StableDiffusionControlNetPipeline`, and `StableDiffusionXLPipeline`.

- Make sure you have the latest version of `sd_embed` installed:

```
pip install git+https://github.com/xhinker/sd_embed.git@main
```

In [None]:
pipe = StableDiffusionXLPipeline.from_pretrained("Lykon/dreamshaper-xl-1-0", torch_dtype=torch.float16)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")

- To upweight or downweight a concept, surround the text with parentheses.
  - More parentheses applies a heavier weight on the text.
  - You can also append a numerical multiplier to the text to indicate how much you want to increase or decrease its weights by.

- Create a prompt and use a combination of parentheses and numerical multipliers to upweight various text.

In [None]:
prompt = """A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. 
This imaginative creature features the distinctive, bulky body of a hippo, 
but with a texture and appearance resembling a golden-brown, crispy waffle. 
The creature might have elements like waffle squares across its skin and a syrup-like sheen. 
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, 
possibly including oversized utensils or plates in the background. 
The image should evoke a sense of playful absurdity and culinary fantasy.
"""

neg_prompt = """\
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
(normal quality:2),lowres,((monochrome)),((grayscale))
"""

- Use the `get_weighted_text_embeddings_sdxl` function to generate the prompt embeddings and the negative prompt embeddings.
  - It’ll also generated the pooled and negative pooled prompt embeddings since you’re using the SDXL model.

In [None]:
( 
  prompt_embeds,
  prompt_neg_embeds,
  pooled_prompt_embeds,
  negative_pooled_prompt_embeds
) = get_weighted_text_embeddings_sdxl(
    pipe,
    prompt=prompt,
    neg_prompt=neg_prompt
)

image = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=prompt_neg_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
    num_inference_steps=30,
    height=1024,
    width=1024 + 512,
    guidance_scale=4.0,
    generator=torch.Generator("cuda").manual_seed(2)
).images[0]
image

#### Textual inversion
- Textual inversion is a technique for learning a specific concept from some images which you can use to generate new images conditioned on that concept.
  - Create a pipeline and use the `load_textual_inversion()` function to load the textual inversion embeddings (feel free to browse the Stable Diffusion Conceptualizer for 100+ trained concepts):

In [None]:
pipe = StableDiffusionPipeline.from_pretrained(
  "stable-diffusion-v1-5/stable-diffusion-v1-5",
  torch_dtype=torch.float16,
).to("cuda")
pipe.load_textual_inversion("sd-concepts-library/midjourney-style")

prompt = """<midjourney-style> A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. 
This imaginative creature features the distinctive, bulky body of a hippo, 
but with a texture and appearance resembling a golden-brown, crispy waffle. 
The creature might have elements like waffle squares across its skin and a syrup-like sheen. 
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, 
possibly including oversized utensils or plates in the background. 
The image should evoke a sense of playful absurdity and culinary fantasy.
"""

neg_prompt = """\
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
(normal quality:2),lowres,((monochrome)),((grayscale))
"""

- Use the `get_weighted_text_embeddings_sd15` function to generate the prompt embeddings and the negative prompt embeddings.

In [None]:
( 
  prompt_embeds,
  prompt_neg_embeds,
) = get_weighted_text_embeddings_sd15(
    pipe,
    prompt=prompt,
    neg_prompt=neg_prompt
)

image = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=prompt_neg_embeds,
    height=768,
    width=896,
    guidance_scale=4.0,
    generator=torch.Generator("cuda").manual_seed(2)
).images[0]
image

#### DreamBooth
- DreamBooth is a technique for generating contextualized images of a subject given just a few images of the subject to train on.
  - It is similar to textual inversion, but DreamBooth trains the full model whereas textual inversion only fine-tunes the text embeddings.
  - This means you should use `from_pretrained()` to load the DreamBooth model:

In [None]:
pipe = DiffusionPipeline.from_pretrained("sd-dreambooth-library/dndcoverart-v1", torch_dtype=torch.float16).to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

prompt = """dndcoverart of A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. 
This imaginative creature features the distinctive, bulky body of a hippo, 
but with a texture and appearance resembling a golden-brown, crispy waffle. 
The creature might have elements like waffle squares across its skin and a syrup-like sheen. 
It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, 
possibly including oversized utensils or plates in the background. 
The image should evoke a sense of playful absurdity and culinary fantasy.
"""

neg_prompt = """\
skin spots,acnes,skin blemishes,age spot,(ugly:1.2),(duplicate:1.2),(morbid:1.21),(mutilated:1.2),\
(tranny:1.2),mutated hands,(poorly drawn hands:1.5),blurry,(bad anatomy:1.2),(bad proportions:1.3),\
extra limbs,(disfigured:1.2),(missing arms:1.2),(extra legs:1.2),(fused fingers:1.5),\
(too many fingers:1.5),(unclear eyes:1.2),lowers,bad hands,missing fingers,extra digit,\
bad hands,missing fingers,(extra arms and legs),(worst quality:2),(low quality:2),\
(normal quality:2),lowres,((monochrome)),((grayscale))
"""

(
    prompt_embeds
    , prompt_neg_embeds
) = get_weighted_text_embeddings_sd15(
    pipe
    , prompt = prompt
    , neg_prompt = neg_prompt
)