# GenAI Applications

**Created by [Laia Tarrés](https://www.linkedin.com/in/laia-tarres) for the [Postgraduate course in artificial intelligence with deep learning](https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgrau-artificial-intelligence-deep-learning/) in [UPC School](https://www.talent.upc.edu/ing/) (2024).**

**Disclaimer** : *this lab is a bit different. We will load and do inference for a lot of models. In case you run out of memory, shut down the environment and launch it again, only running the requirements cell and the cell that has broken the memory.*

In this lab we will learn about more advanced applications using state of the art for images/videos, but can be easily extended to NLP tasks.

Actually, this lab builds on top of [🤗 Hugging Face](https://huggingface.co/), which is an open-source platform that has from datasets, to models, to direct applications. Tools like 🤗 Hugging Face have been transformative for the AI community in terms of making open source models easily accessible.

In this lab, we will not train any models, but we will see many "building blocks", which are open source components that will allow you to later build something else quickly.

The goal of the lab is not that you learn about the specific examples, but that you learn about this building blocks so that you'll be able to combine them yourself for your own unique applications. Notice that google colab is restricted, with 16GB of GPU memory, so we will have to learn some tricks to use current state of the art models.

In more detail, here is what you will do in this lab:

*   We will give an introduction to Hugging Face so you can learn what it is and how to discover models and tasks easily.
*   Use the 🧨Diffusers library to explore **text-to-image generation**, a task democratized and popularized by Stable Diffusion. Including "basic generation", LoRa to stylize the output images and adding ControlNet, to add spatial control to the generation, this task is known as **controllable text-to-image generation**.
*   Explore **inpainting**, an image-to-image translation task.
*   Explore **Text-to-Video** task, an extension on the famous text-to-image task.
*   Explore **Image-to-Video** task
*   **Text-to-3D** task, and an extension which is **Image-to-3D**

## Introduction to Hugging Face

Hugging Face is a collaboration platform, that contains many subspaces. The mostly used ones are:

*   **Hub**: is the platform that hosts over 350k models, 75k datasets, and 150k demo apps (Spaces).
*   **Models**: this contains the repositories for each of the models, and are easy to find through the Hub.
*   **Datasets**: is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks. They are also easy to find in the hub.
*   **Spaces**: offers a simple way to host ML demo apps directly on your profile.
*   **Docs**: Contains the documentation for all the contents present in Hugging Face.

Hugging face also includes other libraries that you might have seen before in other labs:


*   [Transformers](https://huggingface.co/docs/transformers/index)
*   [Diffusers](https://huggingface.co/docs/diffusers/index)
*   [Accelerate](https://huggingface.co/docs/accelerate/index)
*   ...

How do you find a model that you need for your project?


1.   Identify the task in machine learning terms (image classification, object detection, image segmentation, text-to-image, question answering... you can find a list of tasks and definitions in the [tasks](https://huggingface.co/tasks) tab).
2.   Go to [Models](https://huggingface.co/models) page, and find the task that you have identified previously. This will return a list of opensource models that tackle this same task.
3.   Narrow the search further, select different options (like libraries, dataset used to train, languages used, license type...), and sort them by downloads or trending.
4.   Read the model cards for each model, to make sure that it has the information such as the task it performs, how it was trained, number of parameters. Oftentimes, models come in different sizes, and the checkpoints can be found in the model. To know which size to pick is tricky, as it depends on the hardware that you are using. A simple trick: go to files and versions, and check the size (in GB) of each of the variants of the model.

Once found, how can we load and call the model for inference?
1.   The selected Model Card, when well-defined, will contain a "Usage" part, where it will be explained how to load and use each model.
2.   Even easier, in some cases we can load them from either the Transformers or Diffusers library, you should see a button related to this. With fewer lines of code, one can download and call the models for later inference. This easier loading is what we will see, and it uses a "pipeline" wrapper that not only defines and loads the model, but also checks the validity of the input, weights of the model, preprocesses data, etc.

Easy right? Now let's check some examples!



### Requirements

This takes a while, please be patient.

In [None]:
# torch should be 2.5.1+cu124 in google colab
#!pip install --upgrade diffusers[torch]==0.32.2
#!pip install transformers==4.48.3 scipy==1.13.1 ftfy accelerate==1.3.0
#!pip install peft==0.14.0
!pip install -q controlnet_aux
!command -v ffmpeg >/dev/null || (apt update && apt install -y ffmpeg)
!pip install -q mediapy

In [None]:
import diffusers
diffusers.logging.set_verbosity_error()
import torch

In [None]:
#auxiliary functions
from PIL import Image
import cv2
import numpy as np
import mediapy as media
from diffusers.utils import export_to_gif, load_image, export_to_video
import imageio

def image_grid(imgs, rows, cols):
    assert len(imgs) == rows * cols

    w, h = imgs[0].size
    grid = Image.new("RGB", size=(cols * w, rows * h))
    grid_w, grid_h = grid.size

    for i, img in enumerate(imgs):
        grid.paste(img, box=(i % cols * w, i // cols * h))

    return grid

import requests
from io import BytesIO

def download_image(url):
    response = requests.get(url)
    return Image.open(BytesIO(response.content)).convert("RGB")

## Text-to-Image

This task has been popularized and democratized by Stable Diffusion. Stable Diffusion is a [Latent Diffusion model](https://github.com/CompVis/latent-diffusion) developed by researchers from [CompVis](https://ommer-lab.com/), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). It's trained on 512x512 images from a subset of the [LAION-5B](https://laion.ai/blog/laion-5b/) database. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on many consumer GPUs.

The model checkpoints were publicly released at the end of August 2022. Since then, it has revolutionized the Deep Learning field. You can check out the [official model card](https://huggingface.co/CompVis/stable-diffusion) for more information.

The model has been implemented in the 🤗 Hugging Face [🧨 Diffusers library](https://github.com/huggingface/diffusers). With its own task, the [StableDiffusionPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline).

The pipeline is thought for end-to-end inference to generate images from text with just a few lines of code. We will be using Stable Diffusion version 1.4 ([CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4)), but there are other variants that you may want to try:
* [runwayml/stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5)
* [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base)
* [stabilityai/stable-diffusion-2-1](https://huggingface.co/stabilityai/stable-diffusion-2-1). This version can produce images with a resolution of 768x768, while the others work at 512x512.


When we load the pipeline, we are actually loading multiple components: CLIP text encoder, denoising unet and VAE decoder. This is detailed in the architecture scheme:

<p align="center">
    <img src="https://miro.medium.com/v2/resize:fit:720/format:webp/0*Y5H4xSKtQsQOp6OZ" alt="stablediffusion-architecture" width=800><br>
    <em>The diagram is taken from <a href=https://medium.com/latinxinai/text-to-image-with-stable-diffusion-4df16da2cfd5>here</a>.</em>
</p>

### Basic pipeline

To provide more detail, the stable diffusion model is composed of 3 different parts: **VAE**, **CLIP** and **denoising U-Net**.

First, this is a latent model, this means that the diffusion and denoising processes are applied to the latent space. That way, the denoising process is done in the reduced latent space, instead of a high-resolution image. And this reduces the time and cost for the computatiions while maintaining the quality of the generated image.
To convert images from and to the latent space, a **VAE** is used. The VAE model has two parts, an encoder and a decoder. The encoder is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net model. The decoder, conversely, transforms the latent representation back into an image.

During latent diffusion training, the encoder is used to get the latent representations (latents) of the images for the forward diffusion process, which applies more and more noise at each step. During inference, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder. As we will see during inference we only need the VAE decoder.

The **denoising U-Net** has an encoder part and a decoder part both comprised of ResNet and Transformer blocks. The encoder compresses an image representation into a lower resolution image representation and the decoder decodes the lower resolution image representation back to the original higher resolution image representation that is supposedly less noisy. More specifically, the U-Net output predicts the noise residual which can be used to compute the predicted denoised image representation.

To prevent the U-Net from losing important information while downsampling, short-cut connections are usually added between the downsampling ResNets of the encoder to the upsampling ResNets of the decoder. Additionally, the stable diffusion U-Net is able to condition its output on text-embeddings via the cross-attention layers in the Transformer. The cross-attention layers can be found in both the encoder and decoder part of the U-Net after the ResNet and self-attention blocks.

The **text-encoder (CLIP)** is responsible for transforming the input prompt, e.g. "A corgi with a cute bow" into an embedding space that can be understood by the U-Net. Stable Diffusion does not train the text-encoder during training and simply uses an CLIP's already trained text encoder, CLIPTextModel.

The three modules are represented in the following figure, which is the figure presented in the original paper:
<p align="center">
    <img src="https://jalammar.github.io/images/stable-diffusion/article-Figure3-1-1536x762.png" alt="stablediffusion-original-architecture" width=600><br>
    <em>The diagram is taken from the <a href=https://arxiv.org/abs/2112.10752>original paper</a>.</em>
</p>


We have mentioned that diffusers has the **StableDiffusionPipeline** ready for easy text-to-image inference. In the following text cell, you will find how to call it.

Notice that in addition to the model id [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4), we're also passing a specific `revision` and `torch_dtype` to the `from_pretrained` method.

To lower the memory usage, so that it runs in free Google Colab, we're loading the weights from the half-precision branch [`fp16`](https://huggingface.co/CompVis/stable-diffusion-v1-4/tree/fp16) and also tell `diffusers` to expect the weights in float16 precision by passing `torch_dtype=torch.float16`.

In [None]:
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
# we move to GPU to have faster inferences
pipe = pipe.to("cuda")

Once we have the pipeline loaded, we can actually explore its components, to find what is being used under the hood. In the next cell, you will be able to see what is being used in terms of: scheduler, text_encoder, tokenizer, unet and vae.

In [None]:
#check the different components of the pipeline:
print(pipe)

With the pipeline loaded, we can proceed to do the inference. This step is also simple: simply select a prompt, pass it to the pipeline defined above and return the first position on the images array, since it is already in PIL format, we can directly show it in google colab.

In [None]:
prompt = "a photograph of a corgi with a cute bow, photorealistic, high quality, 8k"

image = pipe(prompt).images[0]  # image here is in [PIL format](https://pillow.readthedocs.io/en/stable/)

#display the image
image

# TODO: Run this cell again and check the differences of starting with different random noise. Also change the prompt to generate what you heart desires

Notice different things:

1.   The `height` and `width` of the generated image is 512x512. This are the defaults but can be adjusted to different aspect ratios.
2.   The pipeline does 50 denoising steps, `num_inference_steps`, which takes 7 seconds. In general, results are better the more steps you use, the default number typically generates good results, but you can experiment with a smaller number for faster results.
3.   We can apply different `guidance_scale`, for classifier free guidance. You can check more information [here](https://arxiv.org/abs/2207.12598). Basically, it is a way to increase the adherence to the conditional singal which in this case is text as well as overall sample quality. Values are typically between 7-9, if you use a very large number the images might look good but will be less diverse.
4.   Every time you run the above cell, even with the same prompt, you get a different image every time. This is because the random noise used to start the denoising process is different. If you want deterministic output, you can pass a random seed to the pipeline by using a `generator`.

All of these parameters can be manually modified so that the generation is better and runs a bit faster. There are many more parameters that you can experiment with, check them in the [documentation](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline).


In [None]:
#The different parameters that the pipe takes. You can experiment with this!
prompt = "a photograph of a corgi with a cute bow, photorealistic, high quality, 8k"

height = 512                        # default height of Stable Diffusion
width = 512                         # default width of Stable Diffusion

num_inference_steps = 30            # Number of denoising steps

guidance_scale = 8                # Scale for classifier-free guidance

generator = torch.Generator("cuda").manual_seed(999) # Seed generator to create the inital latent noise

image = pipe(prompt, height=height, width=width, num_inference_steps=num_inference_steps, guidance_scale=guidance_scale, generator=generator).images[0] #Also play with inference steps

image

# TODO: change the seed, the height and width, the number of inderence steps and guidance scale, until you find something that you like

### LoRA

It is clear that Stable Diffusion has taken the world by storm. But it is also clear, that even if the model is able to perform very well for some text-to-image prompts, there are some cases where it does not perform so well. Fine tuning the model in some specific data (for medical data for example) it is very computationally expensive, so this is where LoRA's come into place.

Low-Rank Adaptation of Large Language Models (LoRA) is an approach developed by Microsoft to reduce the number of trainable parameters by learning pairs of rank-decompostion matrices while freezing the original weights. LoRA attempts to fine-tune the "residual" of the model instead of the entire model: i.e., train the $ΔW$ instead of $W$.

$W'=W+ΔW$

Where $ΔW$ can be further decomposed into low-rank matrices : $ΔW = ABᵗ$, where $A \in \mathbb{R}^{n \times d}$, $B \in \mathbb{R}^{m \times d}$, $(d \ll n)$. This is the key idea of LoRA. We can then fine-tune $A$ and $B$ instead of $W$. In the end, you get a much smaller model than $W$. As represented in the figure:

<p align="left">
    <img src="https://images.datacamp.com/image/upload/v1705430151/image4_b814637cd2.png" alt="LoRA-architecture" width=200><br>
    <em>The diagram is taken from the <a href=https://arxiv.org/abs/2106.09685>paper</a>.</em>
</p>

This training trick is quite useful for fune-tuning customized models on a large general base model. Various text to image models have been developed built on the top of the official Stable Diffusion. Now, with LoRA, you can efficiently train your own model with much less resources.

In our case, we will experiment with a LoRA that has been trained on a Pokemon database. Although the Stable Diffusion that we are using has been trained on a very bast amount of data, including some images of pokemons, it is not specifically designed for this end, so it struggles to generate a specific style only by indicating it in the prompt.

In [None]:
prompt = ["Green dog with menacing face, pokemon style"]*3
images = pipe(prompt, height=height, width=width, num_inference_steps=num_inference_steps, guidance_scale=guidance_scale).images #Also play with inference steps
#Careful here, if you have already uploaded lora weights, this will give you the results -> call pipe.unload_lora_weights()
grid = image_grid(images, rows=1, cols=3)
grid

For our example, we have selected parameters that have been trained on the [Pokemon Dataset](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions). More details on the selected LoRA can be found on the [model card](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4). Basically, to add the new LoRA weights to the model, we need to first find the version of stable diffusion that was used during the training of the LoRA. Since the weights have been trained on top of specific stable diffusion version, changing the versions means that our LoRas will not work.

**Question:** What version of stable diffusion did the authors use in this case?

**Answer:** ...

In [None]:
#Check that the model that we have loaded is compatible with the lorabase model.
from huggingface_hub.repocard import RepoCard

card = RepoCard.load("sayakpaul/sd-model-finetuned-lora-t4")
base_model = card.data.to_dict()["base_model"]
base_model

To do inference, we need to modify the pipeline to add the LoRA weights during inference. To do so, we can use `pipe.load_lora_weights` function. Once the inference is done, we can unload the LoRA weights and go back to the base model in the pipeline by running `pipe.unload_lora_weights()`.
Notice now that when calling the pipeline to run inference, an extra argument is passed `cross_attention_kwargs`, this indicates the weight given to the LoRA weights. If it is set to 0, the LoRA weights will not have any effect, and if it is set to 1, only the LoRA weights will effect. Typical values are around 0.5.
Also, notice that the LoRA weights are only 3.29M, which is the only thing that we need to load to do the personalization.

In [None]:
model_path = "sayakpaul/sd-model-finetuned-lora-t4"
#pipe.unet.load_attn_procs(model_path)
pipe.load_lora_weights(model_path)
prompt = ["Green dog with menacing face, pokemon style"]*3 #notice here that we don't have to se the pokemon style
images = pipe(prompt, num_inference_steps=30, guidance_scale=7.5, cross_attention_kwargs={"scale": 0.6}).images
grid = image_grid(images, rows=1, cols=3)
grid

Notice that with this weights, the model is able to produce novel pokemons, with a much better style matching. And we can do this by loading only a few amount of parameters.
In hugging face spaces, you can find other LoRA's by filtering results with the lora tag. Be careful to check the version of stable diffusion that it matches the model that we are usin (in this case, 1.4).

**Exercise**: Find a different LoRA and apply to the model, search [here](https://huggingface.co/models?other=lora). Then visualize some results!

In [None]:
pipe.unload_lora_weights()
# TODO: find a different lora and select the model path
model_path = "..."

pipe.load_lora_weights(model_path)
# TODO: define the prompt with the style of the LoRa that you have selected
prompt = ["..."]

images = pipe(prompt, num_inference_steps=30, guidance_scale=7.5, cross_attention_kwargs={"scale": 0.6}).images
grid = image_grid(images, rows=1, cols=3)
grid

### ControlNet

Besides the capabilities that LoRA provides, there are other forms of control that text is not able to provide. There are some works that tackle this: adding other forms of control, besides of text. The most commonly used are [T2I adapter](https://arxiv.org/abs/2302.08453) and [ControlNet](https://arxiv.org/abs/2302.05543).

ControlNet provides a minimal interface allowing users to customize the generation process up to a great extent. With [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/controlnet), users can easily condition the generation with different spatial contexts such as a depth map, a segmentation map, a scribble, keypoints, and so on!

ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) by Lvmin Zhang and Maneesh Agrawala.
It introduces a framework that allows for supporting various spatial contexts that can serve as additional conditionings to Diffusion models such as Stable Diffusion.

Training ControlNet is comprised of the following steps:

1. Cloning the pre-trained parameters of a Diffusion model, such as Stable Diffusion's latent UNet, (referred to as “trainable copy”) while also maintaining the pre-trained parameters separately (”locked copy”). It is done so that the locked parameter copy can preserve the vast knowledge learned from a large dataset, whereas the trainable copy is employed to learn task-specific aspects.
2. The trainable and locked copies of the parameters are connected via “zero convolution” layers (see [here](https://github.com/lllyasviel/ControlNet#controlnet) for more information) which are optimized as a part of the ControlNet framework. This is a training trick to preserve the semantics already learned by frozen model as the new conditions are trained.

Pictorially, training a ControlNet looks like so:

<p align="center">
    <img src="https://github.com/lllyasviel/ControlNet/raw/main/github_page/sd.png" alt="controlnet-structure" width=600><br>
    <em>The diagram is taken from <a href=https://github.com/lllyasviel/ControlNet/blob/main/github_page/sd.png>here</a>.</em>
</p>

Every new type of conditioning requires training a new copy of ControlNet weights.
The paper proposed 8 different conditioning models that are all [supported](https://huggingface.co/lllyasviel?search=controlnet) in 🧨Diffusers!

For inference, both the pre-trained diffusion models weights as well as the trained ControlNet weights are needed. For example, using [Stable Diffusion v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5)
with a ControlNet checkpoint require roughly 700 million more parameters compared to just using the original Stable Diffusion model, which makes ControlNet a bit more memory-expensive for inference.

In [None]:
#Imports
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from controlnet_aux import OpenposeDetector

As the first example, we are loading the `girl with a pearl earring` image first, and then we are extracting the canny edges. This edges will become our control image.

In [None]:
#load the image
image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)

#Extract the canny edges
image_canny = np.array(image)
low_threshold = 100
high_threshold = 200

image_canny = cv2.Canny(image_canny, low_threshold, high_threshold)
image_canny = image_canny[:, :, None]
image_canny = np.concatenate([image_canny, image_canny, image_canny], axis=2)
canny_image = Image.fromarray(image_canny)

grid = image_grid([image, canny_image], rows=1, cols=2)
grid

For controlnet, we need to define a new pipeline that supports the extra conditoning. In this case it will be `StableDiffusionControlNetPipeline`. With similar parameters to the `StableDiffusionPipeline` but adding a `controlnet` argument that will contain the specific ControlNet module that we are interested in.

In [None]:
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)
print(pipe)

Instead of using Stable Diffusion's default [PNDMScheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/pndm), we use one of the currently fastest
diffusion model schedulers, called [UniPCMultistepScheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/unipc).
Choosing an improved scheduler can drastically reduce inference time - in our case we are able to reduce the number of inference steps from 50 to 20 while more or less
keeping the same image generation quality. More information regarding schedulers can be found [here](https://huggingface.co/docs/diffusers/main/en/using-diffusers/schedulers).

In [None]:
from diffusers import UniPCMultistepScheduler

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

Instead of loading our pipeline directly to GPU, we instead enable smart CPU offloading which
can be achieved with the [`enable_model_cpu_offload` function](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/controlnet#diffusers.StableDiffusionControlNetPipeline.enable_model_cpu_offload).

Remember that during inference diffusion models, such as Stable Diffusion require not just one but multiple model components that are run sequentially.
In the case of Stable Diffusion with ControlNet, we first use the CLIP text encoder, then the diffusion model unet and control net, then the VAE decoder and finally run a safety checker.
Most components are only run once during the diffusion process and are thus not required to occupy GPU memory all the time. By enabling smart model offloading, we make sure
that each component is only loaded into GPU when it's needed so that we can significantly save memory consumption without significantly slowing down infenence.

**Note**: When running `enable_model_cpu_offload`, do not manually move the pipeline to GPU with `.to("cuda")` - once CPU offloading is enabled, the pipeline automatically takes care of GPU memory management.

In [None]:
pipe.enable_model_cpu_offload()

It has come the time to do some inference using the pipeline and the conditions. To do so, we will still provide a prompt to guide the image generation process, and combine it with the canny edge image we just created to allow more control over the generated image.

To keep it fun, we will generate some contemporary celebrities posing exactly as our 17th century painting. The text will control the content while the canny image will control the structure of the image.

One thing that we haven't mentioned is the argument `negative_prompt`, which we are introducing in the inference, and in it we are specifying what you don't want to see in the generated image.

In [None]:
prompt = ", best quality, extremely detailed"
prompt = [t + prompt for t in ["Sandra Oh", "Kim Kardashian", "rihanna", "taylor swift"]]
generator = [torch.Generator(device="cpu").manual_seed(2) for i in range(len(prompt))]

output = pipe(
    prompt,
    canny_image,
    negative_prompt=["monochrome, lowres, bad anatomy, worst quality, low quality"] * len(prompt),
    generator=generator,
    num_inference_steps=20,
)

image_grid(output.images, 2, 2)

Another application from ControlNet is that we can take a pose from one image and reuse it to generate a different image with the exact same pose. I this example, you can teach scientists how to do yoga using [Open Pose ControlNet](https://huggingface.co/lllyasviel/sd-controlnet-openpose).

In [None]:
urls = "yoga1.jpeg", "yoga2.jpeg", "yoga3.jpeg", "yoga4.jpeg"
imgs = [
    load_image("https://hf.co/datasets/YiYiXu/controlnet-testing/resolve/main/" + url)
    for url in urls
]

model = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")

poses = [model(img) for img in imgs]
#Reshape to visualize
poses_reshaped = [pose.resize((194,256)) for pose in poses]
image_grid(imgs+poses_reshaped, 2, 4)

In [None]:
# TODO: define the controlnet model that we need to use
controlnet = ControlNetModel.from_pretrained("...", torch_dtype=torch.float16)

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    model_id,
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

In [None]:
generator = [torch.Generator(device="cpu").manual_seed(3) for i in range(4)]
prompt = "scientist character, best quality, extremely detailed"
output = pipe(
    [prompt] * 4,
    poses,
    negative_prompt=["monochrome, lowres, bad anatomy, worst quality, low quality"] * 4,
    generator=generator,
    num_inference_steps=20,
)
image_grid(output.images, 2, 2)

## Image-to-Image

There are many image-to-image applications. In fact, ControlNet is an Image-to-Image method that gets its own section because of its popularity.
But in this group of works, we focus more on image editing, that is, taking an input image and editing it according to some other conditions.
There are many examples, one of the most popular ones is inpainting. This task was already defined for the first version of stable diffusion.

The task of inpainting consists on masking a part of an image and generating some novel masked content, that in our case will be guided by the text.

There are several approaches to incorporate Stable Diffusion for inpainting tasks.
The first method, introduced in the [StableDiffusion paper](https://arxiv.org/abs/2112.10752) proposes to further fine-tune the original StableDiffusion model, to be able to do the image editing.


1.   They first take a model trained for 595k steps for text-to-image generation, that is trained on “laion-aesthetics v2 5+” dataset.
2.   Then train for 440k steps on for inpainting. For inpainting, at the input of the denoising unet, we add 5 extra channels: 4 for the latents of the encoded masked-image and 1 for the mask itself. Since we need extra weights to process this extra channels, they are zero-initialized to start from the text-to-image model.

Similarly to before, we will define a new pipeline `StableDiffusionInpaintPipeline`to load correctly the corresponding modules and later do inference. Note that the arguments are very similar to previous cases.

In [None]:
from diffusers import StableDiffusionInpaintPipeline
pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-inpainting",
    torch_dtype=torch.float16,
)
pipe.to("cuda")
pipe.enable_attention_slicing()
print(pipe)

In [None]:
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

This application allows us to modify a selected area of the image. For this pipeline to work, we need three inputs: the image, the mask, and the prompt that the generated pixels will follow. In the following cells you can visualize the image and the mask that we are going to use.

In [None]:
image

In [None]:
mask_image

In [None]:
prompt = "a black cat sitting on a bench"

guidance_scale=7.5
num_samples = 3
generator = torch.Generator(device="cuda").manual_seed(1) # change the seed to get different results

images = pipe(
    prompt=prompt,
    image=image,
    mask_image=mask_image,
    guidance_scale=guidance_scale,
    generator=generator,
    num_images_per_prompt=num_samples,
).images

In [None]:
# insert initial image in the list so we can compare side by side
images.insert(0, image)
#show the ground truth vs the inpainted images
image_grid(images, 1, num_samples + 1)

## Text-to-Video

Recently, some works have extended text-to-image pipelines to be text-to-video. This task is more complex because now the models have to learn temporal relationships.

There are many approaches taken: from defining new models that are able to generate multiple frames at the same time, to modifying only some particular layers of stable diffusion models to extend its capabilities.

[TextToVideoZero](https://arxiv.org/abs/2303.13439) tackles video generation by reusing text-to-image stable diffusion and propose two modifications:

<p align="center">
    <img src="https://miro.medium.com/v2/resize:fit:2000/1*g-drahk33DVhmp1jtbQCNg.png" alt="text2zero-architecture" width=800><br>
    <em>The diagram is taken from the <a href=https://arxiv.org/abs/2303.13439>original paper</a>.</em>
</p>



1.   The model now takes multiple noisy latents, and decodes it into multiple image latents. Instead of having a random generation of the noisy latents, the authors enrich this latent codes with motion dynamics. This is done to keep the global scene and the background time consistent.
2.   Modify the self-attention inside all the transformers to do cross-frame attention instead of frame-level self-attention. This change helps preserving the context, appearance and identity of the foreground object.

With this, the model is able to generate 8 consecutive frames that have some temporal coherence, without much compute overhead. Similarly to before, it is already included in diffusers for easy loading and inference.


In [None]:
from diffusers import TextToVideoZeroPipeline

model_id = "runwayml/stable-diffusion-v1-5"
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

prompt = "A panda is playing guitar on times square"
result = pipe(prompt=prompt).images
result = [(r * 255).astype("uint8") for r in result]
imageio.mimsave("generated_video.mp4", result, fps=4)

In [None]:
video = media.read_video("generated_video.mp4")
media.show_video(video, fps=4)

This approach can be extended to generate longer videos, by concatenating generations

In [None]:
seed = 0
video_length = 24  #24 ÷ 4fps = 6 seconds
chunk_size = 8
prompt = "A panda is playing guitar on times square"

# Generate the video chunk-by-chunk
result = []
chunk_ids = np.arange(0, video_length, chunk_size - 1)
generator = torch.Generator(device="cuda")
for i in range(len(chunk_ids)):
    print(f"Processing chunk {i + 1} / {len(chunk_ids)}")
    ch_start = chunk_ids[i]
    ch_end = video_length if i == len(chunk_ids) - 1 else chunk_ids[i + 1]
    # Attach the first frame for Cross Frame Attention
    frame_ids = [0] + list(range(ch_start, ch_end))
    # Fix the seed for the temporal consistency
    generator.manual_seed(seed)
    # TODO: do inference with the text-to-video model
    output = pipe(prompt=..., video_length=len(...), generator=..., frame_ids=...)

# Concatenate chunks and save
result = np.concatenate(result)
result = [(r * 255).astype("uint8") for r in result]
imageio.mimsave("generated_video_longer.mp4", result, fps=4)

In [None]:
video = media.read_video("generated_video_longer.mp4")
media.show_video(video, fps=4)

Moreover, since this approach is based on stable diffusion and the base weights are kept, their approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing.

The idea here is to combine the changes added to control/edit text-to-image generation and combine them with the video modifications but without the need of training extra models.

In [None]:
#Download the demo video
from huggingface_hub import hf_hub_download

filename = "__assets__/pix2pix video/camel.mp4"
repo_id = "PAIR/Text2Video-Zero"
video_path = hf_hub_download(repo_type="space", repo_id=repo_id, filename=filename)

#Read video from path
from PIL import Image
import imageio

reader = imageio.get_reader(video_path, "ffmpeg")
frame_count = 4
video = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]

#Visualize the video
video_aux = media.read_video(video_path)
media.show_video(video_aux, fps=4)

In [None]:
import torch
from diffusers import StableDiffusionInstructPix2PixPipeline
from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor

model_id = "timbrooks/instruct-pix2pix"
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=3))

prompt = "make it Van Gogh Starry Night style"
result = pipe(prompt=[prompt] * len(video), image=video).images #This should take around 2 minutes
imageio.mimsave("edited_video.mp4", result, fps=4)

#Visualize the video
video_aux = media.read_video("edited_video.mp4")
media.show_video(video_aux, fps=4)

The possibilities with video editing are also endless, and some of the techniques that we have seen to add control to models that generate images, also apply to models that generate videos. We can add LoRa to video models or ControlNet to have extra control over generated video. You can find more information [here](https://huggingface.co/docs/diffusers/api/pipelines/text_to_video_zero).

## Image-to-Video

Given the bast number of openly available videos, one can take similar approaches as those in Large Language Models, and try to generate video autoregressively, to have a Large Video Model.

This is the case for [Stable Video Diffusion](https://stability.ai/research/stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets)
models. In this paper, they propose a three step training:


1.   First, using pre-trained text-to-image models. In their case, Stable Diffusion.
2.   Then, pretraining on videos, with a similar approach as we have seen in text-to-video.
3.   Finally, finetune on high quality videos to have a more consistent high resolution text-to-image video.

Then, they fine tune the model to perform image-to-video. In order to do this, they condition the model in two ways: first, get image embeddings from CLIPImageEncoder, and substitute the text embeddings by image embeddings. Then, similarly to the Image-to-Image pipeline, concatenate the image latents to the input of the denoising unet.


Given that this models are also openly available, techniques such as LoRA can also be used here. This is the case of LoRAs to perform specific motions, such as zoom out, zoom in, horizontal movement, etc.

For this model, we will only have one input: the image. In this case, the model has learned motion priors so we do condition on the prompt. We will use this image

In [None]:
# Load the conditioning image
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")

image = image.resize((1024, 576))

# TODO: try it again with your own image! Make sure that it is horizontal
image

In [None]:
from diffusers import StableVideoDiffusionPipeline

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()

#generate the frames
generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=4, generator=generator, num_frames=8).frames[0]

export_to_video(frames, "generated_image2video.mp4", fps=7)

#visualizing the video
video = media.read_video("generated_image2video.mp4")
media.show_video(video, fps=7)

## Text-to-3D

This tasks consists on generating a 3D asset from text. The main motivation being that this 3D assets could be used for ideo game development, interior design, and architecture.

There are many works that tackle this tasks, but one of the firsts that use diffusion is [Shap-E](https://huggingface.co/docs/diffusers/main/api/pipelines/shap_e). It is trained on a large dataset of 3D assets, and post-processed to render more views of each object and produce 16K instead of 4K point clouds.
The Shap-E model has two models that are trained independently:

1.   An encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset.
2.   A diffusion model is trained on the latents produced by the encoder to generate either neural radiance fields (NeRFs) or a textured 3D mesh, making it easier to render and use the 3D asset in downstream applications.

Since it is implemented in Diffusers, it is easy to download and do inference. Similarly to other pipelines and models.

In [None]:
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_gif

repo = "openai/shap-e"
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

guidance_scale = 15.0
prompt = "a shark" #we are guiding it with a prompt

images = pipe(
    prompt,
    guidance_scale=guidance_scale,
    num_inference_steps=64,
    frame_size=256,
).images

gif_path = export_to_gif(images[0], "shark_3d.gif")

video = media.read_video("shark_3d.gif")
media.show_video(video, fps=4)

It exists a more complex task which is Image-to-3D, which you can think as an extension to the Text-to-3D, where the condition is an image, which contains many more details about the object. In this lab, we take an extension of Shap-E to be conditioned on CLIP Image instead of CLIP text embeddings. This works relativately well, but the details are not preserved.

There are other approachesthat work better for Image-to-3D. For example, [Zero123](https://huggingface.co/spaces/cvlab/zero123-live) which requires more memory, but has a live demo.

In [None]:
#This is the image that we are using as conditioning:
image_url = "https://hf.co/datasets/diffusers/docs-images/resolve/main/shap-e/corgi.png"
image = load_image(image_url).convert("RGB")
image

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

repo = "openai/shap-e-img2img"
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
pipe = pipe.to(device)

guidance_scale = 3.0

images = pipe(
    image,
    guidance_scale=guidance_scale,
    num_inference_steps=64,
    frame_size=256,
).images

gif_path = export_to_gif(images[0], "corgi_3d.gif")

video = media.read_video("corgi_3d.gif")
media.show_video(video, fps=4)

# Conclusions

It is now simpler than ever before to explore state of the art research on generative AI. This is mainly thanks to open-source tools like hugging face with its diffusers and transformers libraries, but there are [many others](https://github.com/filipecalegario/awesome-generative-ai?tab=readme-ov-file#online-tools-and-applications)! Such as [OpenNN](https://www.opennn.net/), [NVIDIA Deep Learning Studio](), [AI-Flow](https://ai-flow.net/), models on [AWS](https://aws.amazon.com/es/ai/generative-ai/?gclid=CjwKCAjw1K-zBhBIEiwAWeCOF-hQCj1y0DIjC_ZChmDhcld0NwqkFTRlJfJaWvfNJJtMsJ-a3jn5QhoClF8QAvD_BwE&trk=a64f9344-53f2-4a8f-917f-b9aec93898eb&sc_channel=ps&ef_id=CjwKCAjw1K-zBhBIEiwAWeCOF-hQCj1y0DIjC_ZChmDhcld0NwqkFTRlJfJaWvfNJJtMsJ-a3jn5QhoClF8QAvD_BwE:G:s&s_kwcid=AL!4422!3!686079251274!p!!g!!aws%20generative%20ai!20901657119!155125549577), [RunwayML](http://runwayml.com/)... to name a few.

The examples shown in this lab do not even scratch the surface on all the applications and open libraries that are out there to apply generative AI to real-world use cases, but we hope that this has encouraged you to go on and find/build your own particular applications by diving deep into state-of-the-art implementations.

## References

This notebook is heavily inspired by:
1.   [Diffusers documentation](https://huggingface.co/docs/diffusers/index)
2.   [DeepLearning.ai course](https://info.deeplearning.ai/e3t/Ctc/LX+113/cJhC404/VXf9pB7_LlN1W6RlhHh1Kn3YLW4vJbXQ5bmym1N28qGxb3qgyTW8wLKSR6lZ3mHW5qKbYb5bbvHyW1QJ0WL8pWv0jW7tXScv7X9tJzW4WYs5y7pfr0-W1VpPlk2d1Nl5W7Hv6H75hxTjKN2113sXhKLVmV7DK4z27fsTKW3cVtZK2PxHSBW3wQgjh8Z7Np8W8DCxph6ZzbP2W47_XgL2hDXvYVtpgCc6smWwwW2Bvxmh6R9b9tW83snjH7XyWhvW789xs11bCGX3W1QGd-b5BL9w0W4y7HT38krWg0W6dJ-Jg2KBj5BW4dts1B706_p6W3RwlP51M1LJCW4mfBF42RLlbMW6hDJRR6pdhVrW4VL7_Q92fkyQW3jHHsv6Z8zgrW31RMCX8J_VX6W3_JCsm4Yf13gW4mp1fb4t8Gd5f3c9RQq04)
3.   [Hugging Face docs](https://huggingface.co/docs)
4.   [Stable Diffusion official notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/stable_diffusion.ipynb)
5.   [LoRA1](https://github.com/haofanwang/Lora-for-Diffusers)
6.   [LoRA2](https://huggingface.co/docs/diffusers/v0.13.0/en/training/lora)
7.   [ControlNet official notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/controlnet.ipynb#scrollTo=PMdYxVKaqGeg)
8.   [Inpainting](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/in_painting_with_stable_diffusion_using_diffusers.ipynb#scrollTo=loXzMOIxv9OZ)