# Deep Learning - Exercise 12

The aim of the lecture is to get an overview of possibilities in the generative artificial intelligence (GenAI) domain

## 🔎 Do you know any famous models from this area?

* We will use [Huggingface](https://huggingface.co/) library

![meme01](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/ai_meme_02.jpg?raw=true)

## ⚡ Let's install the basic libraries first

* We will use HuggingFace library for the **Stable Diffusion** model
    * https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
    * 🔎 What is a **Text-to-Image** task?

* You can download pre-trained models from the Hub and use them leveraging simple unified API
    * But I get that you already know this 🙂

* The main one is `diffusers`
    * The `diffusers` library, developed by Hugging Face, is designed for running, training, and deploying diffusion models
    * 📌 They work by gradually denoising a signal from a random distribution to generate data that resembles the distribution of a training dataset

In [None]:
!pip install transformers diffusers
!pip install invisible_watermark transformers accelerate safetensors

## Running own Stable Diffusion instance is quite easy
* You just need to download the pretrained model and load it into the GPU
* There are many different models in the [HuggingFace Models Hub](https://huggingface.co/models)
    * 💡 Filter by task, e.g. Text-to-Image

* We can see that the `DiffusionPipeline.from_pretrained` has several parameters set
    * `torch_dtype`: This parameter specifies the data type for the tensors used in the model. `torch.float16` is used here to indicate that the model should use 16-bit floating-point numbers
        * This is often done to reduce memory usage and potentially speed up computations, at the cost of some precision
    * `use_safetensors`: SafeTensors are a feature designed to ensure that tensor operations are performed in a way that minimizes the risk of out-of-memory errors and other issues related to tensor management
    * `variant`: This parameter allows you to specify a variant of the model to use. In this case, `fp16` is specified, which likely indicates that the model variant optimized for 16-bit floating-point operations should be used. 
        * 💡 This is consistent with the choice of `torch.float16`

In [None]:
from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda")

## You can specify the device to use
# pipe.to("cuda:1")

## Now the model is ready and you can start to use it
* 💡 The most important part is so-called **prompt** definition - the same concept as in the ChatGPT

# Let's create our own image using the model

#### 💡 TIP: Run the code multiple times if you do not like the result 🙂

In [None]:
prompt = "A knight riding a majestic lion"

image = pipe(prompt=prompt).images[0]

image

## 📒 The image can be very easily saved

In [None]:
image.save('sd_output.png')

# 📌 The most difficult part is to define the prompt
* There are several *tips&tricks* how to get maximum out of the model

## You can add keywords after the main prompt delimited by commas to be more specific

In [None]:
prompt = "A knight riding a majestic lion, cyberpunk, japan city background"

image = pipe(prompt=prompt).images[0]

image

## You can put an emphasis on a keyword by adding `[]` - e.g. `[cyberpunk]`

In [None]:
prompt = "A knight riding a majestic lion, cyberpunk, japan city background, [black and white]"

image = pipe(prompt=prompt).images[0]

image

# Funny use-case is to use some specific painter style
* 💡 You can see results of multiple styles at https://www.urania.ai/top-sd-artists

## Leonid Afremov

In [None]:
prompt = "A knight riding a majestic lion, [Leonid Afremov]"

image = pipe(prompt=prompt).images[0]

image

## Vincent Van Gogh

In [None]:
prompt = "A knight riding a majestic lion, [Vincent Van Gogh], cyberpunk"

image = pipe(prompt=prompt).images[0]

image

![meme02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/ai_meme_01.jpg?raw=true)

## ⚡ We need to reset the session not to run OOM now

## We will load the base model and refiner separately
* 💡 The base model is used to generate (noisy) latent vectors
* 💡 Refiner is specialized for the final denoising steps of the latent vector thus it will generate our image

In [None]:
from diffusers import DiffusionPipeline
import torch

base = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
base.to("cuda")
refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base.text_encoder_2,
    vae=base.vae,
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)
refiner.to("cuda")

## We can now tune parameters of the models
* The most important parameters are these two:
    * `n_steps`
    * `high_noise_frac`

* Stable Diffusion XL base is trained on timesteps 0-999 and Stable Diffusion XL refiner is finetuned from the base model on low noise timesteps 0-199
* We use the base model for the first 800 timesteps (high noise) and the refiner for the last 200 timesteps (low noise)
    * 📌 This is set byt setting `high_noise_frac` to 0.8

* `n_steps` represents the number of inference steps to be used in the generative process by both the base and refiner functions
    * 💡 In the context of generative models each inference step typically involves a denoising operation
    * Starting from a noisy state or latent space representation, the model iteratively refines this input through a series of steps, gradually reducing noise and adding detail
* 📌 The n_steps parameter controls the granularity of this process—the more steps, the more gradual and potentially detailed the transformation

* 💡 We can set `high_noise_frac = 1` to obtain very low-detail unrefined image and also we can set `n_steps = 1` to get just the noise 

### 🚀 Nice experiment is to set `n_steps` parameter sequentially to 1, 3, 5, 10 and 20 to see how the noise is refined

In [None]:
n_steps = 20
high_noise_frac = 0.8
prompt = "A knight riding a majestic lion"

# run both experts
latent_vector = base(
    prompt=prompt,
    num_inference_steps=n_steps,
    denoising_end=high_noise_frac,
    output_type="latent",
).images

image = refiner(
    prompt=prompt,
    num_inference_steps=n_steps,
    denoising_start=high_noise_frac,
    image=latent_vector,
).images[0]

image

## 📌 In case that we have an OoM error, here are the outputs:

* n_steps = 1

![meme02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/n_steps_1.png?raw=true)

* n_steps = 3

![meme02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/n_steps_3.png?raw=true)

* n_steps = 5

![meme02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/n_steps_5.png?raw=true)

* n_steps = 10

![meme02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/n_steps_10.png?raw=true)

* n_steps = 20

![meme02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/n_steps_20.png?raw=true)

* high_noise_frac = 1.0

![meme02](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/high_noise_frac.png?raw=true)

### ⚡ If you are interested in this topic I recommend to visit [lexica.art](https://lexica.art/) or [https://www.reddit.com/r/StableDiffusion/](https://www.reddit.com/r/StableDiffusion/) 🙂
* 💡 You can try models even using web-based GUI on [HuggingFace Spaces](https://huggingface.co/spaces/stabilityai/stable-diffusion)