## AudioLDM 2, but faster ⚡️

AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) 
by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate realistic sound effects, human speech and music.

While the generated audios are of high quality, running inference with the model is very slow: a single audio sample takes upwards of 30 seconds to generate, with a 
real-time factor of approximately 0.3 (1 second of audio takes 3 seconds to generate). This is due to a combination of factors, including a deep multi-stage modelling 
approach, large checkpoint sizes, and un-optimised code.

In this Colab, we showcase how to use AudioLDM 2 in the Hugging Face 🧨 Diffusers library, exploring a range of code optimisations such as half-precision, flash attention,
and compilation, and model optimisations such as scheduler choice and negative prompting, to reduce the inference time by over **10 times**, with minimal degradation in quality of the output audio.

Read to the end to find out how to generate a 10 second audio sample in just 1 second, with a real-time factor of 10!

## Set-up environment

Let’s make sure we’re connected to a GPU to run this notebook. To get a GPU, click `Runtime` -> `Change runtime type`, then change `Hardware accelerator` from `None` to `GPU`. We can verify that we’ve been assigned a GPU and view its specifications through the `nvidia-smi` command:

In [None]:
!nvidia-smi

We see here that we've got on Tesla T4 16GB GPU, although this may vary for you depending on GPU availablity and Colab GPU assignment.

Next, we can install the required Python packages, namely 🧨 Diffusers for running the AudioLDM 2 diffusion process, and 🤗 Transformers for the CLAP, Flan-T5 and GPT2 models respectively. We'll install these packages from the `main` branch of their respective repositories, since AudioLDM 2 is not yet in the latest PyPi release:

In [None]:
!pip install --quiet --upgrade git+https://github.com/huggingface/diffusers.git git+https://github.com/huggingface/transformers.git

We'll also install the nightly version of PyTorch, to leverage the latest updates to [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html):

In [None]:
!pip install --quiet --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

## Model overview

Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2
is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings.

The overall generation process is summarised as follows:

1. Given a text input $\boldsymbol{x}$, two text encoder models are used to compute the text embeddings: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap), and the text-encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5)

$$
\boldsymbol{E}_{1} = \text{CLAP}\left(\boldsymbol{x} \right); \quad \boldsymbol{E}_{2} = \text{T5}\left(\boldsymbol{x}\right)
$$

2. These text embeddings are projected to a shared embedding space through individual linear projections:

$$
\boldsymbol{P}_{1} = \boldsymbol{W}_{\text{CLAP}} \boldsymbol{E}_{1}; \quad \boldsymbol{P}_{2} = \boldsymbol{W}_{\text{T5}}\boldsymbol{E}_{2}
$$

In the `diffusers` implementation, these projections are defined by the [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2ProjectionModel).

3. A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) language model (LM) is used to auto-regressively generate a sequence of $N$ new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings:

$$
\boldsymbol{E}_{i} = \text{GPT2}\left(\boldsymbol{P}_{1}, \boldsymbol{P}_{2}, \boldsymbol{E}_{1:i-1}\right) \qquad \text{for } i=1,\dots,N
$$

4. The generated embedding vectors $\boldsymbol{E}_{1:N}$ and Flan-T5 text embeddings $\boldsymbol{E}_{2}$ are used as cross-attention conditioning in the LDM, which *de-noises* 
a random latent via a reverse diffusion process. The LDM is run in the reverse diffusion process for a total of $T$ inference steps:

$$
\boldsymbol{z}_{t} = \text{LDM}\left(\boldsymbol{z}_{t-1} | \boldsymbol{E}_{1:N}, \boldsymbol{E}_{2}\right) \qquad \text{for } t = 1, \dots, T
$$

where the initial latent variable $\boldsymbol{z}_{0}$ is drawn from a normal distribution $\mathcal{N} \left(\boldsymbol{0}, \boldsymbol{I} \right)$. The [UNet](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2UNet2DConditionModel) of the LDM is unique in 
the sense that it takes **two** sets of cross-attention embeddings, $\boldsymbol{E}_{1:N}$ from the GPT2 langauge model, and $\boldsymbol{E}_{2}$ from Flan-T5, as opposed to one cross-attention conditioning as in most other LDMs.

5. The final de-noised latents $\boldsymbol{z}_{T}$ are passed to the VAE decoder to recover the Mel spectrogram $\boldsymbol{s}$:

$$
\boldsymbol{s} = \text{VAE}_{\text{dec}} \left(\boldsymbol{z}_{T}\right)
$$

6. The Mel spectrogram is passed to the vocoder to obtain the output audio waveform $\mathbf{y}$:

$$
\boldsymbol{y} = \text{Vocoder}\left(\boldsymbol{s}\right)
$$

The diagram below demonstrates how a text input is passed through the text conditioning models, with the two prompt embeddings used as cross-conditioning in the LDM:

<p align="center">
  <img src="https://raw.githubusercontent.com/sanchit-gandhi/notebooks/main/audioldm2.png?raw=true" width="600"/>
</p>

Hugging Face 🧨 Diffusers provides an end-to-end inference pipeline class [`AudioLDM2Pipeline`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2) that wraps this multi-stage generation process into a single callable object, enabling you to generate audio samples from text in just a few lines of code. 

AudioLDM 2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio generation. The third checkpoint is trained exclusively on text-to-music generation. See the table below for details on the three official checkpoints, which can all be found on the [Hugging Face Hub](https://huggingface.co/models?search=cvssp/audioldm2):

| Checkpoint                                                      | Task          | Model Size | Training Data / h |
|-----------------------------------------------------------------|---------------|------------|-------------------|
| [audioldm2](https://huggingface.co/cvssp/audioldm2)             | Text-to-audio | 1.1B       | 1150k             |
| [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 1.1B       | 665k              |
| [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 1.5B       | 1150k             |

Now that we've covered a high-level overview of how the AudioLDM 2 generation process works, let's put this theory into code!

## Load the pipeline

For the purposes of this tutorial, we'll initialise the pipeline with the pre-trained weights from the base checkpoint, [cvssp/audioldm2](https://huggingface.co/cvssp/audioldm2). We can load the entirety of the pipeline using the [`.from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) method, which will instantiate the pipeline and load the pre-trained weights:

In [None]:
from diffusers import AudioLDM2Pipeline

model_id = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(model_id)

The pipeline can be moved to the GPU in much the same way as a standard PyTorch nn module:

In [None]:
pipe.to("cuda");

Great! We'll define a Generator and set a seed for reproducibility. This will allow us to tweak our prompts and observe the effect that they have on the generations by fixing the starting latents in the LDM model:

In [None]:
import torch

generator = torch.Generator("cuda").manual_seed(0)

Now we're ready to perform our first generation! We'll use the same running example throughout this notebook, where we'll condition the audio generations on a fixed text prompt
and use the same seed throughout:

In [None]:
prompt = "The sound of Brazilian samba drums with waves gently crashing in the background"

audio = pipe(prompt, audio_length_in_s=10, generator=generator).audios[0]

Cool! That run took about 13 seconds to generate. Let's have a listen to the output audio:

In [None]:
from IPython.display import Audio

Audio(audio, rate=16000)

Sounds much like our text prompt! The quality is good, but still has artefacts of background noise. We can provide the pipeline with a [*negative prompt*](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.negative_prompt) to discourage the pipeline from generating certain features. In this case, we'll pass a negative prompt that discourages the model from generating low quality audio in the outputs:

In [None]:
negative_prompt = "Low quality, average quality."

audio = pipe(prompt, negative_prompt=negative_prompt, audio_length_in_s=10, generator=generator.manual_seed(0)).audios[0]

The inference time is un-changed when using a negative prompt; we simply replace the unconditional input to the LDM with the negative input. That means any gains we get in audio quality we get for free!

Let's take a listen to the resulting audio:

In [None]:
Audio(audio, rate=16000)

There's definitely an improvement in the overall audio quality - there are less noise artefacts and the audio generally sounds sharper. 

## Optimisation 1: Flash Attention

PyTorch 2.0 and upwards includes an optimised and memory-efficient implementation of the attention operation through the [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA) function. This function automatically applies several in-built optimisations depending on the inputs. Overall, the SDPA function gives similar behaviour to the [flash attention](https://arxiv.org/abs/2205.14135) implementation.

These optimisations will be enabled by default in Diffusers if PyTorch 2.0 is installed and if `torch.nn.functional.scaled_dot_product_attention` is available. To use it, just install torch 2.0 as suggested above and simply use the pipeline as is!

## Optimisation 2: Half-Precision

By default, the `AudioLDM2Pipeline` loads the model weights in float32 (full) precision. All the model computations are also performed in float32 precision. For inference, we can safely convert the model weights and computations to float16 (half) precision, which will give us an improvement to inference time and GPU memory, with an impercivable change to generation quality.

We can load the weights in float16 precision by passing the [`torch_dtype`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained.torch_dtype) argument to `.from_pretrained`:

In [None]:
pipe = AudioLDM2Pipeline.from_pretrained(model_id, torch_dtype=torch.float16)

pipe.to("cuda");

Let's run generation in float16 precision and listen to the audio outputs:

In [None]:
audio = pipe(prompt, negative_prompt=negative_prompt, audio_length_in_s=10, generator=generator.manual_seed(0)).audios[0]

Audio(audio, rate=16000)

The audio quality is largely un-changed from the full precision generation, with an inference speed-up of about 2 seconds. In our experience, we've not seen any significant audio degradation using `diffusers` pipelines with float16 precision, thus we recommend using float16 precision by default.

## Optimisation 3: Torch Compile

To get an additional speed-up, we can use the new `torch.compile` feature. Since the UNet of the pipeline is usually the most computationally expensive, 
we wrap the unet with `torch.compile`, leaving the rest of the sub-models (text encoders and VAE) as is:

In [None]:
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

After wrapping the UNet with `torch.compile` the first inference step we run is typically going to be slow, due to the overhead of compiling the forward pass of the UNet. Let's run the pipeline forward with the compilation step get this longer run out of the way. Note that the first inference step might take up to 2 minutes to compile, so be patient!

In [None]:
audio = pipe(prompt, negative_prompt=negative_prompt, audio_length_in_s=10, generator=generator.manual_seed(0)).audios[0]

Great! Now that the UNet is compiled, we can now run the full diffusion process and reap the benefits of faster inference:

In [None]:
audio = pipe(prompt, negative_prompt=negative_prompt, audio_length_in_s=10, generator=generator.manual_seed(0)).audios[0]

Only 4 seconds to generate! In practice, you will only have to compile the UNet once, and then get faster inference for all successive generations. This means that the time taken to compile the model is amortised by the gains in subsequent inference time. For more information and options regarding `torch.compile`, refer to the [torch compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) docs.

## Optimisation 4: Scheduler

Another option is to reduce the number of inference steps. Choosing a more efficient scheduler can help decrease the number of steps without sacrificing the output audio quality. You can find which schedulers are compatible with the `AudioLDM2Pipeline` by calling the [`schedulers.compatibles`](https://huggingface.co/docs/diffusers/v0.20.0/en/api/schedulers/overview#diffusers.SchedulerMixin) attribute:

In [None]:
pipe.scheduler.compatibles

Alright! We've got a long list of schedulers to choose from 📝. By default, AudioLDM 2 uses the [`DDIMScheduler`](https://huggingface.co/docs/diffusers/api/schedulers/ddim), and requires 200 inference steps to get good quality audio generations. However, more performant schedulers, like [`DPMSolverMultistepScheduler`](https://huggingface.co/docs/diffusers/main/en/api/schedulers/multistep_dpm_solver#diffusers.DPMSolverMultistepScheduler), require only **20-25 inference steps** to achieve similar results.

Let's see how we can switch the AudioLDM 2 scheduler from DDIM to DPM Multistep. We'll use the [`ConfigMixin.from_config()`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method to load a [`DPMSolverMultistepScheduler`](https://huggingface.co/docs/diffusers/main/en/api/schedulers/multistep_dpm_solver#diffusers.DPMSolverMultistepScheduler) from the configuration of our original [`DDIMScheduler`](https://huggingface.co/docs/diffusers/api/schedulers/ddim):

In [None]:
from diffusers import DPMSolverMultistepScheduler

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

Let's set the number of inference steps to 20 and re-run the generation. We'll have to re-compile the UNet, so we can leave it to run for one generation process. Again, this will take up to 2 minutes to compile:

In [None]:
audio = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=20, audio_length_in_s=10, generator=generator.manual_seed(0)).audios[0]

Now we're ready to benchmark! We can re-run generation again using the compiled UNet with the DPM Multistep scheduler:

In [None]:
audio = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=20, audio_length_in_s=10, generator=generator.manual_seed(0)).audios[0]

That took less than **1 second** to generate the audio! Let's have a listen to the resulting generation:

In [None]:
Audio(audio, rate=16000)

More or less the same as our original audio sample, but only a fraction of the generation time! 🧨 Diffusers pipelines are designed to be *composable*, allowing you two swap out schedulers and other components for more performant counterparts with ease.

## What about memory?

The length of the audio sample we want to generate dictates the *width* of the latent variables we de-noise in the LDM. Since the memory of the cross-attention layers in the UNet scales with sequence length (width) squared, generating very long audio samples might lead to out-of-memory errors.

We've already mentioned that loading the model in float16 half precision gives strong memory savings. Using PyTorch 2.0 SDPA also gives a memory improvement, but this might not be suffienct for extremely large sequence lengths.

Let's try generating an audio sample five minutes (300 seconds) in duration. We'll also generate three candidate audios by setting [`num_waveforms_per_prompt`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.num_waveforms_per_prompt)`=3`. Once [`num_waveforms_per_prompt`](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.num_waveforms_per_prompt)`>1`, automatic scoring is performed between the generated audios and the text prompt: the audios and text prompts are embedded in the CLAP audio-text embedding space, and then ranked based on their cosine similarity scores.

Since we've changed the width of the latent variables in the UNet, we'll have to perform another torch compilation step with the new latent variable shapes. In the interest of time, we'll re-load the pipeline without torch compile, such that we're not hit with a lengthy compilation step first up:

In [None]:
pipe = AudioLDM2Pipeline.from_pretrained(model_id, torch_dtype=torch.float16)

audio = pipe(prompt, negative_prompt=negative_prompt, num_waveforms_per_prompt=3, num_inference_steps=20, audio_length_in_s=300, generator=generator.manual_seed(0)).audios[0]

Unless you have a GPU with high RAM, the code above probably returned an OOM error. While the AudioLDM 2 pipeline involves sevaral components, only the model being used has to be on the GPU at any one time. The remainder of the modules can be offloaded to the CPU. This technique, called *CPU offload*, can reduce memory usage, with a very low penalty to inference time.

We can enable CPU offload on our pipeline with the function [enable_model_cpu_offload()](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.enable_model_cpu_offload):

In [None]:
pipe.enable_model_cpu_offload()

Running generation with CPU offload is then the same as before:

In [None]:
audio = pipe(prompt, negative_prompt=negative_prompt, num_waveforms_per_prompt=3, num_inference_steps=20, audio_length_in_s=300, generator=generator.manual_seed(0)).audios[0]

And with that, we can generate a five minute audio sample in one call to the pipeline! Using the large AudioLDM 2 checkpoint will result in higher overall memory usage than the base checkpoint, since the UNet is over twice the size (750M parameters compared to 350M), so this memory saving trick is particularly beneficial here.

## Conclusion

In this Colab, we showcased four optimisation methods that are available out of the box with 🧨 Diffusers, taking the generation time of AudioLDM 2 from 14 seconds down to less than 1 second. We also highlighted how to employ memory saving tricks, such as half-precision and CPU offload, to reduce peak memory usage for long audio samples or large checkpoint sizes.

Notebook by [Sanchit Gandhi](https://huggingface.co/sanchit-gandhi). Spectrogram image source: [Getting to Know the Mel Spectrogram](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0). Waveform image source: [Aalto Speech Processing](https://speechprocessingbook.aalto.fi/Representations/Waveform.html).