![nebullvm nebuly AI accelerate inference optimize DeepLearning](https://user-images.githubuserinstruction.com/38586138/201391643-a80407e5-2c28-409c-90c9-327795cd27e8.png)

# Accelerate Stable Diffusion with Speedster


Hi and welcome 👋

In this notebook we will discover how in just a few steps you can speed up the response time of Stable Diffusion inference using the Speedster module from the open-source library nebullvm. In the first section we will try using `Speedster` with the default configuration, then we will explore a more advanced option that involves the TensorRT plugins, that allow to accelerate Stable Diffusion further on GPU.

Let's jump to the code.

# Installation

Install Speedster:

In [None]:
!pip install speedster

Install deep learning compilers:

In [None]:
!python -m nebullvm.installers.auto_installer --frameworks diffusers --compilers all

# Environment check (GPU only)

**Please skip this section if you don't have a GPU**

If you want to optimize Stable Diffusion on a Nvidia GPU, in order to work properly, the following requirements must be installed on your machine:
- `CUDA>=12.0`
- `tensorrt>=8.6.0`
- `torch<=1.13.1`

From TensorRT 8.6, all the tensorrt pre-built wheels released by nvidia support only `CUDA>=12.0`. Speedster will install `tensorrt>=8.6.0` automatically in the auto-installer only if it detects CUDA>=12.0, otherwise it will install `tensorrt==8.5.3.1`. In that case, you will have to upgrade your CUDA version and then to upgarde tensorrt to 8.6.0 or above to execute this notebook.

There should be a way to run TensorRT 8.6 also with CUDA 11, but it requires installing TensorRT in a different way, you can check this issue: https://github.com/NVIDIA/TensorRT/issues/2773. Otherwise, we highly suggest to just upgrade to CUDA 12.

For now PyTorch>=2.0.0 is not supported due to an [issue](https://github.com/pytorch/pytorch/issues/97262) in the conversion to onnx, so until they fix it you must have torch<=1.13.1 to optimize Stable Diffusion successfully.

First of all, Let's check the CUDA version installed on the machine

In [None]:
import torch
import subprocess

if torch.cuda.is_available():
    cuda_version = subprocess.check_output(["nvidia-smi"])
    cuda_version = int(cuda_version.decode("utf-8").split("\n")[2].split("|")[-2].split(":")[-1].strip().split(".")[0])
    assert cuda_version >= 12, ("This notebook requires CUDA>=12.0 to be executed, please upgrade your CUDA version.")

If you have CUDA<12.0, you can upgrade it at this link: https://developer.nvidia.com/cuda-downloads

Then, let's check the tensorrt version installed on the platform. Stable Diffusion optimization is supported starting from `tensorrt==8.6.0`

In [None]:
import tensorrt
from nebullvm.tools.utils import check_module_version

if torch.cuda.is_available():
    assert check_module_version(tensorrt, "8.6.0"), ("This notebook can be run only with tensorrt>=8.6.0, if using an older version you could have issues during the optimization. Please upgrade your version.")

If you have an older version, after ensuring you have `CUDA>=12.0` installed, you can upgrade your TensorRT version by running:
```
pip install -U tensorrt
```

Finally, let's check the PyTorch version

In [None]:
import torch

from nebullvm.tools.utils import check_module_version

assert check_module_version(torch, max_version="1.13.1+cu117"), ("This notebook can be run only with torch<=1.13.1, if using an older version you could have issues during the optimization. Please downgrade your version.")

## Model and Dataset setup

Once we have ensured that the the required libraries are installed, we have to choose the version of Stable Diffusion we want to optimize, speedster officially supports the most used versions:
- `CompVis/stable-diffusion-v1-4`
- `runwayml/stable-diffusion-v1-5`
- `stabilityai/stable-diffusion-2-1-base`
- `stabilityai/stable-diffusion-2-1` (only on gpus with at least 22GB of Memory, if you want to try with a GPU with a lower memory, you have to uncomment `pipe.enable_attention_slicing()` in the cell below)

Other Stable Diffusion versions from the Diffusers library should work but have never been tested. If you try a version not included among these and it works, please feel free to report it to us on [Discord](https://discord.com/invite/RbeQMu886J) so we can add it to the list of supported versions. If you try a version that does not work, you can open an issue and possibly a PR on [GitHub](https://github.com/nebuly-ai/nebullvm/issues).

For this notebook, we are going to select Stable Diffusion 1.4. Let's download and load it using the diffusers API:

In [None]:
import torch
from diffusers import StableDiffusionPipeline

# Select Stable Diffusion version
model_id = "CompVis/stable-diffusion-v1-4"

device = "cuda" if torch.cuda.is_available() else "cpu"

if device == "cuda":
    # On GPU we load by default the model in half precision, because it's faster and lighter.
    pipe = StableDiffusionPipeline.from_pretrained(model_id, revision='fp16', torch_dtype=torch.float16)
    # pipe.enable_attention_slicing() # Uncomment for stable-diffusion-2.1 on gpus with 16GB of memory like V100-16GB and T4
else:
    pipe = StableDiffusionPipeline.from_pretrained(model_id)


Let's now create an example dataset with some random sentences, that will be used later for the optimization process

In [None]:
input_data = [
    "a photo of an astronaut riding a horse on mars",
    "a monkey eating a banana in a forest",
    "white car on a road surrounded by palm trees",
    "a fridge full of bottles of beer",
    "madara uchiha throwing asteroids against people"
]

## Speed up inference with Speedster

It's now time of improving a bit the performance in terms of speed. Let's use `Speedster`.

In [None]:
from speedster import optimize_model, save_model, load_model

Let's move the pipe back to CPU to save up GPU memory, `Speedster` will automatically move it back to GPU when required.

In [None]:
import gc

# Move the pipe back to cpu
pipe.to("cpu")

# Clean memory
torch.cuda.empty_cache()
gc.collect()

Using Speedster is very simple and straightforward! Just use the `optimize_model` function and provide as input the model, some input data as example and the optimization time mode. Optionally a dynamic_info dictionary can be also provided, in order to support inputs with dynamic shape.

**Optimisation of stable diffusion requires a lot of RAM. If you are running this notebook on google colab, make sure to use the high RAM option, otherwise the kernel may crash. If the kernel crashes also when using the high RAM option, please try adding also `"torchscript"` to the `ignore_compilers` list. 
If running on GPU, the optimization requires at least 16GB og GPU memory to exploit the best techniques for optimizing the model, otherwise it may fail with a Memory Error**.

In [None]:
optimized_model = optimize_model(
    model=pipe,
    input_data=input_data,
    optimization_time="unconstrained",
    ignore_compilers=["torch_tensor_rt", "tvm"],  # Some compilers have issues with Stable Diffusion, so it's better to skip them.
    metric_drop_ths=0.2,
)

If running on GPU, here you should obtain a speedup of about 124% on the UNet. We run the optimization on a **3090Ti** and here are our results:
- **Original Model (PyTorch, fp16): 51,557 ms/batch**
- **Optimized Model (TensorRT, fp16): 23,055 ms/batch**

If the optimized model you obtained is not a TensorRT one, probably there was an error during the optimization. If running on colab, it could happen that the standard gpu is not enough to run the optimization, so we suggest to select a premium gpu with more memory.


If everything worked correctly, let's check the output of the optimized model

In [None]:
test_prompt = "futuristic llama with a cyberpunk city on the background"


In [None]:
optimized_model(test_prompt).images[0]

Let's run the prediction 10 times to calculate the average response time of the original model.

In [None]:
if device == "cuda":
    pipe = StableDiffusionPipeline.from_pretrained(model_id, revision='fp16', torch_dtype=torch.float16)
    # pipe.enable_attention_slicing() # Uncomment for stable-diffusion-2.1 on gpus with 16GB of memory like V100-16GB and T4
else:
    pipe = StableDiffusionPipeline.from_pretrained(model_id)

pipe.to(device)

In [None]:
import time

times = []

# Warmup for 2 iterations
for _ in range(2):
    with torch.no_grad():
        final_out = pipe(test_prompt).images[0]

# Benchmark
for _ in range(8):
    st = time.time()
    with torch.no_grad():
        final_out = pipe(test_prompt).images[0]
    times.append(time.time()-st)
original_model_time = sum(times)/len(times)
print(f"Average response time for original Stable Diffusion 1.4: {original_model_time} s")

Let's run the prediction 10 times to calculate the average response time of the optimized model.

In [None]:
times = []

for _ in range(2):
    with torch.no_grad():
        final_out = optimized_model(test_prompt).images[0]

# Benchmark
for _ in range(8):
    st = time.time()
    with torch.no_grad():
        final_out = optimized_model(test_prompt).images[0]
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)
print(f"Average response time for optimized Stable Diffusion 1.4: {optimized_model_time} s")

## Save and reload the optimized model

We can easily save to disk the optimized model with the following line:

In [None]:
save_model(optimized_model, "model_save_path")

We can then load again the model:

In [None]:
optimized_model = load_model("model_save_path", pipe=pipe)

Great! Was it easy? How are the results? Do you have any comments?
Share your optimization results and thoughts with <a href="https://discord.gg/RbeQMu886J" target="_blank"> our community on Discord</a>, where we chat about Speedster and AI acceleration.

Note that the acceleration of Speedster depends very much on the hardware configuration and your AI model. Given the same input model, Speedster can accelerate it by 10 times on some machines and perform poorly on others.

If you want to learn more about how Speedster works, look at other tutorials and performance benchmarks, check out the links below or write to us on Discord.

<center> 
    <a href="https://discord.com/invite/RbeQMu886J" target="_blank" style="text-decoration: none;"> Join the community </a> |
    <a href="https://nebuly.gitbook.io/nebuly/welcome/questions-and-contributions" target="_blank" style="text-decoration: none;"> Contribute to the library </a>
</center>

<center> 
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#key-concepts" target="_blank" style="text-decoration: none;"> How speedster works </a> •
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#documentation" target="_blank" style="text-decoration: none;"> Documentation </a> •
    <a href="https://github.com/nebuly-ai/nebullvm/tree/main/apps/accelerate/speedster#quick-start" target="_blank" style="text-decoration: none;"> Quick start </a> 
</center>