<a href="https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/AudioLDM-2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## AudioLDM 2, but faster ⚡️

AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734)
by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate realistic sound effects, human speech and music.

In this Colab, we showcase how to use AudioLDM 2 in the Hugging Face 🧨 Diffusers library, exploring a range of code optimisations such as half-precision and flash attention, and model optimisations such as scheduler choice and negative prompting, to reduce the inference time by over **10 times**, with minimal degradation in quality of the output audio.

Read to the end to find out how to generate a 10 second audio sample in just 2 seconds!

<a name="cell-id"></a>
## Set-up environment

Let’s make sure we’re connected to a GPU to run this notebook. To get a GPU, click `Runtime` -> `Change runtime type`, then change `Hardware accelerator` from `None` to `GPU`. We can verify that we’ve been assigned a GPU and view its specifications through the `nvidia-smi` command:

In [1]:
!nvidia-smi

Wed Aug 30 14:30:41 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We see here that we've got on Tesla T4 16GB GPU, although this may vary for you depending on GPU availablity and Colab GPU assignment.

Next, we can install the required Python packages, namely:
1. 🧨 Diffusers for running the AudioLDM 2 diffusion pipeline
2. 🤗 Transformers for the CLAP, Flan-T5 and GPT2 models
3. 🤗 Accelerate for CPU offload features

We'll install the first of these two packages from the `main` branch of their respective repositories, since AudioLDM 2 is not yet in the latest PyPi release:

In [2]:
!pip install --quiet --upgrade git+https://github.com/huggingface/diffusers.git git+https://github.com/huggingface/transformers.git accelerate

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m71.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for diffusers (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone


## Load the pipeline

For the purposes of this tutorial, we'll initialise the pipeline with the pre-trained weights from the base checkpoint, [cvssp/audioldm2](https://huggingface.co/cvssp/audioldm2). We can load the entirety of the pipeline using the [`.from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) method, which will instantiate the pipeline and load the pre-trained weights:

In [None]:
from diffusers import AudioLDM2Pipeline

model_id = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(model_id)

Loading pipeline components...:   0%|          | 0/11 [00:00<?, ?it/s]

The pipeline can be moved to the GPU in much the same way as a standard PyTorch nn module:

In [None]:
pipe.to("cuda");

Great! We'll define a Generator and set a seed for reproducibility. This will allow us to tweak our prompts and observe the effect that they have on the generations by fixing the starting latents in the LDM model:

In [None]:
import torch

generator = torch.Generator("cuda").manual_seed(0)

Now we're ready to perform our first generation! We'll use the same running example throughout this notebook, where we'll condition the audio generations on a fixed text prompt
and use the same seed throughout. The [`audio_length_in_s`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.audio_length_in_s) argument controls the length of the generated audio. It defaults to the audio length that the LDM was trained on (10.24 seconds):

In [None]:
prompt = "The sound of Brazilian samba drums with waves gently crashing in the background"

audio = pipe(prompt, audio_length_in_s=10.24, generator=generator).audios[0]

  0%|          | 0/200 [00:00<?, ?it/s]

Cool! That run took about 35 seconds to generate. Let's have a listen to the output audio:

In [None]:
from IPython.display import Audio

Audio(audio, rate=16000)

Sounds much like our text prompt! The quality is good, but still has artefacts of background noise. We can provide the pipeline with a [*negative prompt*](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.negative_prompt) to discourage the pipeline from generating certain features. In this case, we'll pass a negative prompt that discourages the model from generating low quality audio in the outputs. We'll omit the `audio_length_in_s` argument and leave it to take its default value:

In [None]:
negative_prompt = "Low quality, average quality."

audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual_seed(0)).audios[0]

  0%|          | 0/200 [00:00<?, ?it/s]

The inference time is un-changed when using a negative prompt\\({}^1\\); we simply replace the unconditional input to the LDM with the negative input. That means any gains we get in audio quality we get for free!

Let's take a listen to the resulting audio:

In [None]:
Audio(audio, rate=16000)

There's definitely an improvement in the overall audio quality - there are less noise artefacts and the audio generally sounds sharper.

\\({}^1\\) Note that in practice, we typically see a reduction in inference time going from our first generation to our second. This is due to a CUDA "warm-up" that occurs the first time we run the computation. The second generation is a better benchmark for our actual inference time.

## Optimisation 1: Flash Attention

PyTorch 2.0 and upwards includes an optimised and memory-efficient implementation of the attention operation through the [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA) function. This function automatically applies several in-built optimisations depending on the inputs, and runs faster and more memory-efficient than the vanilla attention implementation. Overall, the SDPA function gives similar behaviour to *flash attention*, as proposed in the paper [Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135) by Dao et. al.

These optimisations will be enabled by default in Diffusers if PyTorch 2.0 is installed and if `torch.nn.functional.scaled_dot_product_attention` is available. To use it, just install torch 2.0 as suggested above and simply use the pipeline as is 🚀

In [None]:
audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual_seed(0)).audios[0]

For more details on the use of SDPA in `diffusers`, refer to the corresponding [documentation](https://huggingface.co/docs/diffusers/optimization/torch2.0).

## Optimisation 2: Half-Precision

By default, the `AudioLDM2Pipeline` loads the model weights in float32 (full) precision. All the model computations are also performed in float32 precision. For inference, we can safely convert the model weights and computations to float16 (half) precision, which will give us an improvement to inference time and GPU memory, with an impercivable change to generation quality.

We can load the weights in float16 precision by passing the [`torch_dtype`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained.torch_dtype) argument to `.from_pretrained`:

In [None]:
pipe = AudioLDM2Pipeline.from_pretrained(model_id, torch_dtype=torch.float16)

pipe.to("cuda");

Loading pipeline components...:   0%|          | 0/11 [00:00<?, ?it/s]

Let's run generation in float16 precision and listen to the audio outputs:

In [None]:
audio = pipe(prompt, negative_prompt=negative_prompt, generator=generator.manual_seed(0)).audios[0]

Audio(audio, rate=16000)

  0%|          | 0/200 [00:00<?, ?it/s]

The audio quality is largely un-changed from the full precision generation, with an inference speed-up of about 10 seconds. In our experience, we've not seen any significant audio degradation using `diffusers` pipelines with float16 precision, but consistently reap a substantial inference speed-up. Thus, we recommend using float16 precision by default.

## Optimisation 3: Scheduler

Another option is to reduce the number of inference steps. Choosing a more efficient scheduler can help decrease the number of steps without sacrificing the output audio quality. You can find which schedulers are compatible with the `AudioLDM2Pipeline` by calling the [`schedulers.compatibles`](https://huggingface.co/docs/diffusers/v0.20.0/en/api/schedulers/overview#diffusers.SchedulerMixin) attribute:

In [None]:
pipe.scheduler.compatibles

[diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler,
 diffusers.schedulers.scheduling_ddpm.DDPMScheduler,
 diffusers.schedulers.scheduling_pndm.PNDMScheduler,
 diffusers.utils.dummy_torch_and_torchsde_objects.DPMSolverSDEScheduler,
 diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler,
 diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler,
 diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler,
 diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler,
 diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler,
 diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler,
 diffusers.schedulers.scheduling_ddim.DDIMScheduler,
 diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler,
 diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler,
 diffusers.schedulers.scheduling_deis_multistep.DEISMultiste

Alright! We've got a long list of schedulers to choose from 📝. By default, AudioLDM 2 uses the [`DDIMScheduler`](https://huggingface.co/docs/diffusers/api/schedulers/ddim), and requires 200 inference steps to get good quality audio generations. However, more performant schedulers, like [`DPMSolverMultistepScheduler`](https://huggingface.co/docs/diffusers/main/en/api/schedulers/multistep_dpm_solver#diffusers.DPMSolverMultistepScheduler), require only **20-25 inference steps** to achieve similar results.

Let's see how we can switch the AudioLDM 2 scheduler from DDIM to DPM Multistep. We'll use the [`ConfigMixin.from_config()`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method to load a [`DPMSolverMultistepScheduler`](https://huggingface.co/docs/diffusers/main/en/api/schedulers/multistep_dpm_solver#diffusers.DPMSolverMultistepScheduler) from the configuration of our original [`DDIMScheduler`](https://huggingface.co/docs/diffusers/api/schedulers/ddim):

In [None]:
from diffusers import DPMSolverMultistepScheduler

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

Let's set the number of inference steps to 20 and re-run the generation with the new scheduler:

In [None]:
audio = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=20, generator=generator.manual_seed(0)).audios[0]

  0%|          | 0/20 [00:00<?, ?it/s]

That took just **2 seconds** to generate the audio! Let's have a listen to the resulting generation:

In [None]:
Audio(audio, rate=16000)

More or less the same as our original audio sample, but only a fraction of the generation time! 🧨 Diffusers pipelines are designed to be *composable*, allowing you two swap out schedulers and other components for more performant counterparts with ease.

## What about memory?

The length of the audio sample we want to generate dictates the *width* of the latent variables we de-noise in the LDM. Since the memory of the cross-attention layers in the UNet scales with sequence length (width) squared, generating very long audio samples might lead to out-of-memory errors. Our batch size also governs our memory usage, controlling the number of samples that we generate.

We've already mentioned that loading the model in float16 half precision gives strong memory savings. Using PyTorch 2.0 SDPA also gives a memory improvement, but this might not be suffienct for extremely large sequence lengths.

Let's try generating an audio sample 60 seconds in duration. We'll also generate 4 candidate audios by setting [`num_waveforms_per_prompt`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.num_waveforms_per_prompt)`=4`. Once [`num_waveforms_per_prompt`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.num_waveforms_per_prompt)`>1`, automatic scoring is performed between the generated audios and the text prompt: the audios and text prompts are embedded in the CLAP audio-text embedding space, and then ranked based on their cosine similarity scores. We can access the 'best' waveform as that in position `[0]`.

Since we've changed the width of the latent variables in the UNet, we'll have to perform another torch compilation step with the new latent variable shapes. In the interest of time, we'll re-load the pipeline without torch compile, such that we're not hit with a lengthy compilation step first up:

In [None]:
pipe = AudioLDM2Pipeline.from_pretrained(model_id, torch_dtype=torch.float16)

pipe.to("cuda")

audio = pipe(prompt, negative_prompt=negative_prompt, num_waveforms_per_prompt=4, audio_length_in_s=60, num_inference_steps=20, generator=generator.manual_seed(0)).audios[0]

Loading pipeline components...:   0%|          | 0/11 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

OutOfMemoryError: ignored

Unless you have a GPU with high RAM, the code above probably returned an OOM error. While the AudioLDM 2 pipeline involves several components, only the model being used has to be on the GPU at any one time. The remainder of the modules can be offloaded to the CPU. This technique, called *CPU offload*, can reduce memory usage, with a very low penalty to inference time.

We can enable CPU offload on our pipeline with the function [enable_model_cpu_offload()](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.enable_model_cpu_offload):

In [None]:
pipe.enable_model_cpu_offload()

Running generation with CPU offload is then the same as before:

In [None]:
audio = pipe(prompt, negative_prompt=negative_prompt, num_waveforms_per_prompt=4, audio_length_in_s=60, num_inference_steps=20, generator=generator.manual_seed(0)).audios[0]

And with that, we can generate 4 samples of 60 second audios, all in one call to the pipeline! Using the large AudioLDM 2 checkpoint will result in higher overall memory usage than the base checkpoint, since the UNet is over twice the size (750M parameters compared to 350M), so this memory saving trick is particularly beneficial here.

## Conclusion

In this Colab, we showcased four optimisation methods that are available out of the box with 🧨 Diffusers, taking the generation time of AudioLDM 2 from 30 seconds down to less than 1 second. We also highlighted how to employ memory saving tricks, such as half-precision and CPU offload, to reduce peak memory usage for long audio samples or large checkpoint sizes.

Notebook by [Sanchit Gandhi](https://huggingface.co/sanchit-gandhi). Spectrogram image source: [Getting to Know the Mel Spectrogram](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0). Waveform image source: [Aalto Speech Processing](https://speechprocessingbook.aalto.fi/Representations/Waveform.html).