# Session 8: Diffusion Models for Music Generation

Agenda:
- Introduction to Diffusion Models
- Conditioning & Classifier-Free Guidance
- Hands On: Using Stable Audio

## Introduction to Diffusion Models

Image example from _developer.nvidia.com_.

![](./assets/diffusion.png)


### Some concepts

- $X_0$ is a random variable, distributed according to a distribution of interest
that admits density $p_0$.

- $X_t$ is a noisy version of $X_0$, where the noise is generated through a
noise model:

    $$
    X_t = \sqrt{1-\sigma^2_t}X_0 + \sigma_t Z, \; \; Z \sim \mathcal{N}(0, I_d)
    $$,

    where $\sigma_0=0$, $\sigma_t=\sigma(t) \in [0, 1]$ is an increasing
    function of $t$ and $\lim_{t \to \infty} \sigma(t) = 1$.

- We call $\sigma(t)$ the _noise schedule_. It can be generated by a few
different processes, e.g.:

    $$
    \sigma(t) = \sin \Big( \frac{t/T + s}{1 + s}\frac{\pi}{2} \Big)
    $$

    <center><img src="assets/cosine_noise.png" width="50%"/></center>

    In this case, we have a continuous time $t \in \mathbb{R}^+$. Some other
    models might use a finite set of noise levels with $t \in \mathbb{N}$.

### Corruption Process
 
The _corruption process_ describes how the noise is added onto our data. We can
use:

1. **Variance Exploding:** $X_t = X_0 + \sigma_t Z, \; \; \sigma_t \geq 0, \lim_{t \to \infty} \sigma_t = \infty$

2. **Variance Preserving:** $X_t = \sqrt{1-\sigma_t^2} X_0 + \sigma_t Z, \; \; 0 \leq \sigma_t \leq 1, \lim_{t \to \infty} \sigma_t = 1$

3. **Flow Matching:** $X_t = (1 - \sigma_t) X_0 + \sigma_t Z, \; \; 0 \leq \sigma_t \leq 1, \lim_{t \to \infty} \sigma_t = 1$

The corruption process can also be described by a differential equation:

$$
dX_t = f(X_t, t)dt + g(t)dB
$$,

where $f(X_t,t)$ is the "_drift_", $g(t)$ is the _diffusion coefficient_, and
$B_t$ is _Brownian noise_.

**Important:** This process is _reversible_, and we have:

$$
d\tilde{X}_t = \big( -f(\tilde{X}_t, T-t) + g^2(T-t) {\color{orange} \nabla \log p_{T-t} (\tilde{X}_t) } \big) dt + g(T-t) d\tilde{B}_t
$$,

where $X_t$ and $\tilde{X}_t$ follow the same distribution, $B_t$ and
$\tilde{B}_t$ follow the same distribution, and ${\color{orange} \nabla \log p_t(\cdot)}$
is the <span style="color: orange">**score function**</span>.

<div class="alert alert-info">

Instead of approximating $p(X)$ directly like likelihood-based models, diffusion
models try to approximate $\nabla \log p(X)$.

</div>

By moving in the direction of the score function, we can model a trajectory that
guides to a sample of $p(X_0)$.

![](./assets/sgld.png)

### The UNet Architecture

From [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597) and [Attention U-Net: Learning Where to Look for the Pancreas](https://arxiv.org/abs/1804.03999).

![](./assets/unets.png)

### Conditioning & Classifier-Free Guidance

From Bayes's Theorem, we have a conditional score function:

$$
\nabla \log p(X|Y) = \nabla \log p(Y|X) + \nabla \log p(X)
$$

This means that the **score function of our conditional model** is the sum of the _unconditional score function_, and a _conditioning term_. We usually scale our _conditioning term_ by a factor and get $\gamma \nabla \log p(Y|X)$.

We can either train a classifier to estimate $p(Y|X)$ (_Classifier Guidance_), or train our generative model to estimate both $p(X)$ and $p(X|Y)$ by using _conditioning dropout_, making our model capable of handling the conditioning signal when it is present. Using Bayes's Theorem, we further have that:

$$
\nabla \log p(Y|X) = \nabla \log p(X|Y) - \nabla \log p(X)
$$

Using our scaling factor, we have:

$$
\nabla \log p(X|Y) = (1-\gamma) \nabla \log p(X) + \gamma \log p(X|Y)
$$

When $\gamma = 0$, we have the unconditional model, and when $\gamma = 1$, we have the conditional model. The interesting results of CFG happen when $\gamma > 1$.

### Inference-Time Optimization: DITTO

From [DITTO: Diffusion Inference-Time T-Optimization for Music Generation](https://arxiv.org/abs/2401.12179).

![](./assets/ditto.png)

## Hands On: Using Stable Audio

<div class="alert alert-info">

**IMPORTANT:** Make sure you use the Python 3.10 venv for this Hands-On

</div>

### Using the API

In [None]:
!git clone https://github.com/Stability-AI/stable-audio-tools.git ../repositories/stable-audio-tools
!pip install ../repositories/stable-audio-tools

In [None]:
import huggingface_hub

huggingface_hub.login("HF_TOKEN")

In [None]:
import huggingface_hub
import json
import torch
from stable_audio_tools.models.factory import create_model_from_config
from stable_audio_tools.models.utils import load_ckpt_state_dict

repo_id = "stabilityai/stable-audio-open-1.0"
cache_dir = "../huggingface_hub_cache"

model_config_path = huggingface_hub.hf_hub_download(
    repo_id=repo_id,
    filename="model_config.json",
    cache_dir=cache_dir,
    force_download=False,
)

with open(model_config_path, "r") as f:
    model_config = json.load(f)

model = create_model_from_config(model_config)

model_ckpt_path = huggingface_hub.hf_hub_download(
    repo_id=repo_id,
    filename="model.safetensors",
    cache_dir=cache_dir,
    force_download=False,
)

model.load_state_dict(load_ckpt_state_dict(model_ckpt_path))

device = "cuda" if torch.cuda.is_available() else "cpu"

model.to(device).eval().requires_grad_(False)

In [None]:
from torchinfo import summary

summary(model)

In [None]:
import numpy as np
from stable_audio_tools.inference.generation import generate_diffusion_cond

sample_length = ...
sample_size = ...
conditioning = ...
seed = ...

generate_args = {
    "model": ...,
    "conditioning": ...,
    "negative_conditioning": ...,
    "steps": ...,
    "cfg_scale": ...,
    "cfg_interval": ...,
    "batch_size": ...,
    "sample_size": ...,
    "seed": ...,
    "device": ...,
    "sampler_type": ...,
    "sigma_min": ...,
    "sigma_max": ...,
    "init_audio": ...,
    "init_noise_level": ...,
    "callback": ...,
    "scale_phi": ...,
    "rho": ...
}

audio = ...

In [None]:
from IPython.display import Audio, display

display(Audio(audio.squeeze(0).cpu().numpy(), rate=model.sample_rate))

In [None]:
import librosa

y, sr = librosa.load("../session2_setup/assets/stargazing.wav", sr=model.sample_rate)

display(Audio(y, rate=sr))

input_audio = torch.from_numpy(y).to(device)

sample_length = ...
sample_size = ...
conditioning = ...
seed = ...

generate_args = {
    "model": ...,
    "conditioning": ...,
    "negative_conditioning": ...,
    "steps": ...,
    "cfg_scale": ...,
    "cfg_interval": ...,
    "batch_size": ...,
    "sample_size": ...,
    "seed": ...,
    "device": ...,
    "sampler_type": ...,
    "sigma_min": ...,
    "sigma_max": ...,
    "init_audio": ...,
    "init_noise_level": ...,
    "callback": ...,
    "scale_phi": ...,
    "rho": ...
}

cond_audio = ...

In [None]:
display(Audio(cond_audio.squeeze(0).cpu().numpy(), rate=model.sample_rate))

### Deeper Dive

In [None]:
# We will be using the downsampled latent size
latent_size = ...

# Set the seed for noise
seed = ...
torch.manual_seed(seed)

# Define the latent noise
noise = ...
print(noise.shape)

In [None]:
# Some config

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False
torch.backends.cudnn.benchmark = False

In [None]:
# Inspecting the conditioner

print(model.conditioner)

In [None]:
# We create the conditioning tensors with the model's conditioner
conditioning_tensors = ...

print(conditioning_tensors["prompt"])
print("Shape of prompt embedding: ", conditioning_tensors["prompt"][0].shape)
print("Shape of seconds_start embedding: ", conditioning_tensors["seconds_start"][0].shape)
print("Shape of seconds_total embedding: ", conditioning_tensors["seconds_total"][0].shape)
# conditioning_tensors["prompt"]

In [None]:
# We build the cross attention conditioning inputs and the global conditioning
# inputs
conditioning_inputs = ...

print(conditioning_inputs)
print("Shape of the cross attention conditioning inputs: ", conditioning_inputs["cross_attn_cond"].shape)
print("Shape of the global conditioning inputs: ", conditioning_inputs["global_cond"].shape)

In [None]:
# What features are used for the conditioning?
print("Features used for cross attention:", model.cross_attn_cond_ids)
print("Features used for global conditioning:", model.global_cond_ids)

In [None]:
from stable_audio_tools.inference.utils import prepare_audio

io_channels = model.pretransform.io_channels

init_audio = ...

encoded_audio = model.pretransform.encode(init_audio)
print("Encoded audio shape: ", encoded_audio.shape)

In [None]:
import k_diffusion as K

denoiser = ...

In [None]:
import matplotlib.pyplot as plt

# Let's use the noise schedule and plot the sigmas

sigmas = ...

# Scale the initial noise by the first sigma
final_noise = ...

# Add the encoded audio to the noise
final_noise = ...

plt.figure(figsize=(10, 5))

...

plt.show()

In [None]:
extra_args = ...

# Use dpmpp-3m-sde to sample the noise
sampled = ...

In [None]:
# Decode the sampled audio
decoded_audio = ...

display(Audio(decoded_audio.squeeze(0).cpu().numpy(), rate=model.sample_rate))

In [None]:
# Let's play around with things!
# First, let's look at different noise levels (0, 50, 100, 150, 200, 250)

decoded_audio = ...

In [None]:
# Let's experiment with a linear noise schedule
linear_sigmas = ...

sampled = ...
decoded_audio = ...

display(Audio(decoded_audio.squeeze(0).cpu().numpy(), rate=model.sample_rate))

In [None]:
# How about starting with the input audio directly?

sampled = ...
decoded_audio = ...

display(Audio(decoded_audio.squeeze(0).cpu().numpy(), rate=model.sample_rate))

In [None]:
# Let's work with less steps

sigmas_20 = ...

sampled = ...
decoded_audio = ...

display(Audio(decoded_audio.squeeze(0).cpu().numpy(), rate=model.sample_rate))