# (Latent) Diffusion Models (to be continued soon)

As for the Variational Autoencoder (VAE), let's first have a look at the theory behind.

**Goal**: We want to sample from the true underlying distribution of our data $p^*(x)$. 

**Problem**: We don't know $p^*(x)$

**Solution**: Similar to the Variational Autoencoder, we aim to approximate the distribution by learning a model distribution $p_\theta(x)$

However, different to the VAE, the Diffusion Model (DM) is a chain of latent variables ($z_1, ..., z_N$). To get $p^*(x)$, we need to marginalize out all latent variables: $p^*(x)=\int p_\theta(x_0, z_{1:N})dz_1 ... dz_N$

This is intractable to compute. 

But let's first define what the joint distribution $p_\theta(x, z_{1:N})$ actually is. For an image, $x_0$, we may write $p_\theta(x_0, z_{1:N})=p(z_N)\prod_{n>1}p_\theta(z_{n-1}|z_n)p_\theta(x_0|z_1)$

Additionally, we define the individual distributions as:

- $p_\theta(z_{n-1}|z_n)=\mathcal{N}(\mu_\theta(z_n,n),\Sigma_\theta(z_n,n))$ --> $z_n$ is the previous latent state; n is where we are in the process
- $p_\theta(x_0|z_1)=\mathcal{N}(\mu_\theta(z_1,1),\Sigma_\theta(z_1,1))$
- $p(z_N)=\mathcal{N}(0,\mathcal{I})$ --> our start (complete noise)

Note: Usually, $\mu_\theta$ is learned with a neural network, while $\Sigma_\theta$ is fixed instead of learned.

<br />

### How can we learn the neural network?
As for the VAE, we use Variational Inference!

To learn our parameters, $\theta$, we use the ELBO method again. Namely, we try to approximate the distribution $p(z_{1:N}|x_0)$ with a distribution $q(z_{1:N})$ just to learn $\theta$.

<br />

### Choosing q (Forward Process)
We choose q to be the following factorization $q_{\phi(x_0)}(z_1,...,z_N)=q_{\phi(x_0)}(z_1)\prod_{n>1}q_{\phi(x_0)}(z_n|z_{n-1})$

Every image has its own q distribution. For the VAE we used amortized inference to learn a network that predicts the optimal parameters of q for each image. However, for the DM, we are not learning the parameters of q, instead we fix them as follows:
- $q_{\phi(x_0)}(z_1)=\mathcal{N}(\sqrt{1-\beta_1}x_0,\beta_1\mathcal(I))$
- $q_{\phi(x_0)}(z_n|z_{n-1})=\mathcal{N}(\sqrt{1-\beta_n}z_{n-1},\beta_n\mathcal(I))$
- $0<\beta_1<...<\beta_N<1$ --> scaling the noise such that each diffusion step (n) makes the image noisier

--> This (markov) process is called the **forward process** (i.e. the noising process)

Remember: q is only a helper to learn $\theta$ and thereby approximate $\log p_\theta(x_0)$. 

We make it easy for ourselves by fixing its parameters and by letting q be a Gaussian distribution because a) it simplifies later calculations and b) it makes sense to assume that $q_{\phi(x_0)}(z_n|z_{n-1})$ is a Gaussian since $p_\theta(z_{n-1}|z_n)$ is a Gaussian.

Note: by formulating q in this way, we also impose that $p(z_{1:N}|x_0)$ approximates the noising trajectory and follows a Gaussian distribution

<br />

#### Reparameterization of q
With our parameterization of q, it is possible to sample $z_n$ (or any other $z_{n-i}$) directly from $x_0$, making training a lot faster:

$q_{\phi(x_0)}(z_n)=\mathcal{N}(\sqrt{\bar a_n}x_0,(1-\bar a_n)\mathcal{I})$ where $\bar a_n=\prod_{i}^{n}a_i, a_i=1-\beta_i$

Using the reparameterization trick (see VAE), we can get any $z_n$ directly from $x_0$: $z_n=\sqrt{\bar a_n}x_0+\sqrt{(1-\bar a_n)}\epsilon$ where $\epsilon\sim\mathcal{N}(0,\mathcal{I})$

<br />

### Learning $\theta$ using the ELBO (Backward Process)
The latent variable model $p_\theta(x_0)=\int p_\theta(x_0, z_{1:N})dz_1 ... dz_N$ is intractable and therefore, maximizing the likelihood to obtain $\theta$, i.e. $\underset{\theta}{\mathrm{max}}\:\sum_{x_0} \log p_\theta(x_0)$ is intractable.

But instead of maximizing the actual likelihood, we can maximize the ELBO.

In the VAE notebook we derived the ELBO to be: $\mathcal{L} = \mathbb{E}_{z \sim q(z)}[\log\left (\frac{p(x,z)}{q(z)}\right)]$ (Check the VAE notebook for the derivation)

For our case here, this is equal to (plug in the formulations for the distributions):

$\log p_\theta(x_0)\geq\mathbb{E}_{q(z_{1:N})}[\log p_\theta(x_0,z_{1:N})-\log q_{\phi(x_0)}(z_{1:N})]=\mathbb{E}_{q(z_{1:N})}[\log p(z_N)+\sum_{n>1}\log p_\theta(z_{n-1}|z_n)+\log p_\theta(x_0|z_1)-\sum_{n>1}\log q_{\phi(x_0)}(z_n|z_{n-1})-\log q_{\phi(x_0)}(z_1)]$

Now, let's match the parts for the KL Divergences

$=\mathbb{E}_{q(z_{1:N})}[\log p(z_N)+\sum_{n>1}\log \frac {p_\theta(z_{n-1}|z_n)}{\log q_{\phi(x_0)}(z_n|z_{n-1})}+\log \frac {p_\theta(x_0|z_1)}{\log q_{\phi(x_0)}(z_1)}]$

Note that: $q_{\phi(x_0)}(z_n|z_{n-1}) = \frac {q_{\phi(x_0)}(z_{n-1}|z_n)q_{\phi(x_0)}(z_n)} {q_{\phi(x_0)}(z_{n-1})}$. Therefore we can rewrite to:

$=\mathbb{E}_{q(z_{1:N})}[\log p(z_N)+\sum_{n>1}\log \frac {p_\theta(z_{n-1}|z_n)}{q_{\phi(x_0)}(z_{n-1}|z_n)} \frac {q_{\phi(x_0)}(z_{n-1})} {\phi(x_0)(z_n)} +\log \frac {p_\theta(x_0|z_1)}{\log q_{\phi(x_0)}(z_1)}]$

$=\mathbb{E}_{q(z_{1:N})}[\log p(z_N)+\sum_{n>1}\log \frac {p_\theta(z_{n-1}|z_n)}{q_{\phi(x_0)}(z_{n-1}|z_n)} +\sum_{n>1}\log \frac {q_{\phi(x_0)}(z_{n-1})} {\phi(x_0)(z_n)} +\log \frac {p_\theta(x_0|z_1)}{\log q_{\phi(x_0)}(z_1)}]$

Note that: $\sum_{n>1}\log \frac {q_{\phi(x_0)}(z_{n-1})} {\phi(x_0)(z_n)} = \log \frac {q_{\phi(x_0)}(z_{1})} {\phi(x_0)(z_2)} + \log \frac {q_{\phi(x_0)}(z_{2})} {\phi(x_0)(z_3)} + ... + \log \frac {q_{\phi(x_0)}(z_{n-1})} {\phi(x_0)(z_n)} = \log \frac {q_{\phi(x_0)}(z_{1})*\sout{q_{\phi(x_0)}(z_{2})}*...*\sout{q_{\phi(x_0)}(z_{n-1})}} {\sout{q{\phi(x_0)}(z_2)}*\sout{q{\phi(x_0)}(z_3)}*...*q{\phi(x_0)}(z_n)} = \log q_{\phi(x_0)}(z_1) - \log q_{\phi(x_0)}(z_n) $

Thus, we get this expression in which we can cancel further terms:

$=\mathbb{E}_{q(z_{1:N})}[\log p(z_N)+\sum_{n>1}\log \frac {p_\theta(z_{n-1}|z_n)}{q_{\phi(x_0)}(z_{n-1}|z_n)} + \log \sout{q_{\phi(x_0)}(z_1)} - \log q_{\phi(x_0)}(z_n)  +\log \frac {p_\theta(x_0|z_1)}{\sout{\log q_{\phi(x_0)}(z_1)}}]$

$=\mathbb{E}_{q(z_{1:N})}[\log \frac {p(z_N)}{q_{\phi(x_0)}(z_n)}+\sum_{n>1}\log \frac {p_\theta(z_{n-1}|z_n)}{q_{\phi(x_0)}(z_{n-1}|z_n)} + \log p_\theta(x_0|z_1)]$

Lastly, we distribute the expected value and flip the first two fractions by pulling out a minus


$=-\mathbb{E}_{q(z_{1:N})}[\log \frac {q_{\phi(x_0)}(z_n)}{p(z_N)}] - \sum_{n>1} \mathbb{E}_{q(z_{1:N})}[\log \frac {q_{\phi(x_0)}(z_{n-1}|z_n)}{p_\theta(z_{n-1}|z_n)}] + \mathbb{E}_{q(z_{1:N})}[\log p_\theta(x_0|z_1)]$

$=-\text{KL}[q_{\phi(x_0)}(z_n) || p(z_N)] - \sum_{n>1}\text{KL}[q_{\phi(x_0)}(z_{n-1}|z_n) || p_\theta(z_{n-1}|z_n)] + \mathbb{E}_{q(z_{1:N})}[\log p_\theta(x_0|z_1)]$

Note that $p(z_N)$ is known and we want to optimize for $\theta$, so we can get rid of the first term, which finally gets us to this expression:

$\mathbf{=\mathbb{E}_{z_1 \sim q_{\phi(x_0)}(z_1)} [\log p_\theta(x_0|z_1)] - \sum_{n>1}\text{KL}[q_{\phi(x_0)}(z_{n-1}|z_n) || p_\theta(z_{n-1}|z_n)]}$

where $q_{\phi(x_0)}(z_{n-1}|z_n)=\mathcal{N}(\widetilde{\mu}(x_0,z_n,n),\tilde{\beta}_n \mathcal{I})$ with

$\widetilde{\mu}(x_0,z_n,n)=\frac {\sqrt{a_n}-(1-\bar a_{n-1})}{1-\bar a_n}z_n + (\sqrt{\bar a_{n-1}}\beta_n)$ and $\tilde{\beta}_n = \frac {1-\bar a_{n-1}}{1-\bar a_n}\beta_n$

Since all distributions are Gaussians, we can calculate the loss in closed-form.

So, we optimize two things with this equation:
1. With $\mathbb{E}_{z_1 \sim q_{\phi(x_0)}(z_1)} [\log p_\theta(x_0|z_1)]$ optimize for the reconstruction of the image $x_0$ from the latent variable $z_1$.
2. With $\sum_{n>1}\text{KL}[q_{\phi(x_0)}(z_{n-1}|z_n) || p_\theta(z_{n-1}|z_n)]$ we aim to make the transitions in p similar to those in q.

<br />

### Model Parameterization
We defined $p_\theta(z_{n-1}|z_n)=\mathcal{N}(\mu_\theta(z_n,n),\Sigma_\theta(z_n,n))$. But how should we choose $\mu_\theta$ and $\Sigma_\theta$?

As already mentioned, we fix $\Sigma_\theta$. For simplicity, we let $\Sigma_\theta:=\tilde \beta_n \mathcal{I}$

Additionally, to minimize $\sum_{n>1}\text{KL}[q_{\phi(x_0)}(z_{n-1}|z_n) || p_\theta(z_{n-1}|z_n)]$, it makes sense to define $\mu_\theta(z_n,n) := \widetilde{\mu}(x_0,z_n,n)$

Notably, $\widetilde{\mu}$ requires $x_0$, which we don't have in the reverse process. However, we can estimate it:

By reparameterization: $z_n=\sqrt{\bar a_n}x_0 + \sqrt{(1-\bar a_n)}\epsilon$

Rewriting this, we can estimate $x_0$ given $z_n$ by predicting the noise ($\epsilon$):

$x_0 \approx f_\theta(z_n,n)=\frac{z_n - \sqrt{(1-\bar a_n)}\epsilon_\theta(z_n,n)}{\sqrt{\bar a_n}}$

Note what this means: We only need to predict the noise as from the noise we may get $x_0$ from which we may get $\widetilde{\mu}$!

At each step of the reverse process you sample from the last Gaussian distribution, predict the noise, calculate $x_0$ and $\widetilde{\mu}$, then repeat...

<br />

### The Final Loss
We have seen that in practice, we only need to predict the noise. For this reason, it is common to use a simpler loss:

$\mathbf{\mathcal{L_{DM}}=\mathbb{E}_{n,x_0,\epsilon}[||\epsilon-\epsilon_\theta(z_n,n)||^2] = \mathbb{E}_{n,x_0,\epsilon}[||\epsilon-\epsilon_\theta(\sqrt{\bar a_n}x_0 + \sqrt{(1-\bar a_n)}\epsilon,n)||^2]}$


### The Training Loop
1. Take a sample from the dataset
2. Generate a random integer between 1 and N from a uniform distribution (this will be the "timestep" that defines the amount of noise we add to the image): $n \sim \text{Uniform}(\{1,...,N\})$
3. Sample some noise: $\epsilon \sim \mathcal{N}(0,\mathcal{I})$
4. Take gradient step on $\nabla_\theta ||\epsilon-\epsilon_\theta(\sqrt{\bar a_n}x_0 + \sqrt{(1-\bar a_n)}\epsilon,n)||^2$
5. Repeat until convergence

<br />
<br />


### Latent Diffusion Models (LDM)
Operating in pixel space comes with high memory demands for DMs. In turn, Rombach et al. (2022) have shown that the computational cost of DMs can greatly be reduced by operating in a latent space. Furthermore, operating in the latent space requires less spatial downsampling which improves the synthesis quality. Therefore, the authors have suggested feeding the input image through a Vector Quantized-Variational Autoencoder (VQ-VAE) to obtain a lower dimensional representation, called latent (not to be confused with the latent variables of the DM!). Namely, the encoder, $\mathcal{E}$, of the VQ-VAE is used to reduce the dimensionality of the high-dimensional image, $x \in \mathbb{R}^{HxWx3}$, to its latent representation, $l = \mathcal{E}(𝑥)$. Subsequently, $l$ is passed to the modified DM, making it a LDM. The decoder, Ɗ, from the VQ-VAE is then used to decode the denoised latent from the LDM back to the image space. Our objective can then be slightly reformulated for the LDM: $\mathcal{L_{LDM}}=\mathbb{E}_{n,\mathcal{E}(x_0),\epsilon}[||\epsilon-\epsilon_\theta(z_n,n)||^2]$

### Why a VQ-VAE instead of a normal Autoencoder?
VAEs learn a (multivariate) latent distribution where ideally images of similar semantic content are close together. 

In [1]:
import torch
from torch import nn
from torch.nn import functional as F

### Conditioning the LDM
In order to condition the LDM, we need two main ingredients:
1. A way to associate a text prompt with certain images (e.g. associating the text "An image of a dog" and the image of a dog) --> This is accomplished by Contrastive Learning (CLIP) which helps us to learn "joint embeddings" that can be passed to the LDM.
2. A way to integrate the embedding into the model to condition it --> For this we need the "Attention" mechanism.

<br />

#### CLIP


<br />

#### Attention


### Classifier-Free Guidance
Classifier-Free Guidance is essentially a conditional diffusion model with a dropout probability on the conditioning. In a certain amount of cases (e.g. 20%), we randomly drop the conditioning information (or replace it with one that represents the dropout case). This means our model learns to work with and without "guidance". An additional benefit is that we can work with a guidance scale that tells the model how much it should pay attention to the conditioning during generation (by combining the weighted conditioned and unconditioned output).