# Score-based models

* Question: Can we create samples that are as nice as GAN samples without unstable adversarial training?

Like in a GAN, the basic idea is to create samples from some distribution $p_X$ from samples $p_Z=\mathcal{N}(0,\mathbb{1})$. Unlike GANs the idea to let the image slowly emerge from the noise.


```{figure} images/duoduo.jpg
---
height: 100px
---
We want a model that can go iteratively from right to left. [Source.](https://yang-song.net/blog/2021/score/)
```

## Score-based models ([source](https://yang-song.net/blog/2021/score/), [source](https://arxiv.org/abs/2011.13456))

For Score-based models I highly recommend [this blog post](https://yang-song.net/blog/2021/score/) by Yang Song. This section is heavily based on it.

The basic idea is to step by step reverse the process of adding more and more Gaussian noise to a sample.

### First of two ingredients: Langevin dynamics

The **Langevin dynamic** is given by the iteration

$$x_{i+1} = x_i + \epsilon \nabla \log p(x) + \sqrt{2\epsilon} z_i, \ i=0\cdots K$$

where $p$ some "sufficiently nice" probabiltiy density $z_i\sim\mathcal{N}(0,\mathbb{1})$ and some arbitrary $x_0$. For $\epsilon\to0$ and $K\to\infty$ this iteration produces a sample from $p(x)$.

* In practice we can choose $\epsilon$ small and $K$ large.

```{figure} images/langevin.gif
---
height: 200px
---
Using Langevin dynamics to sample from a mixture of two Gaussians. [Source.](https://yang-song.net/blog/2021/score/)
```

### Second of two ingredients: Tweedie's formula

**Tweedie's formula** tells us that

$$\sigma^2 \nabla \log p_{\sigma}(x^\delta) = \mathbb{E}(x|x^\delta, \sigma) - x^\delta,$$

where $\eta\sim p_\eta = \mathcal{N}(0,\mathbb{1})$ and the distribution $p_{\sigma}(x^\delta)$ is given by the samples $x^\delta = x+\sigma\eta$, for some given $\sigma>0$.

We can interpret $\mathbb{E}(x|x+\sigma\eta, \sigma)$ as the "perfect denoiser" for the noise level $\sigma$.

Using $\mathbb{E}(\eta|x^\delta, \sigma) = x^\delta - \mathbb{E}(x|x^\delta, \sigma)$ Tweedie's formula gives us

$$-\sigma^2 \nabla \log p_{\sigma}(x^\delta) = \mathbb{E}(\eta|x^\delta, \sigma).$$

**Here is what that means in practice.** Given a dataset $\mathcal{D}=\{x_i\}_i$, we can training a network

$$s_\theta:\mathbb{R}^n\times\mathbb{R}_{\ge0}\to\mathbb{R}^n$$

via the loss

$$L_\sigma(\theta) = \frac{1}{|\mathcal{D}|}\sum_{x\sim \mathbb{D}} \mathbb{E}_{\eta\sim \mathcal{N}(0,\mathbb{1})} \|s(x + \sigma \eta, \sigma) - \eta\|^2$$

to ideally give us

$$s(x+\sigma\eta,\sigma) = \mathbb{E}\left[\eta|x+\sigma\eta,\sigma\right] = -\sigma^2 \nabla \log p_{\sigma}(x^\delta).$$

### Putting the ingredients together

We will train model $s$ based on the "Tweedie" loss from to work for different noise levels, e.g., via the loss

$$L(\theta) = \sum_{i=1}^N \sigma_i^2 L_{\sigma_i}(\theta)$$

where $\sigma_i < \sigma_{i+1}$ and the $\sigma_i$ usually follow some geometric progression.

With that model we can now run a slightly modified Lagavin dynamic

$$x_{i+1} = x_i + \tilde \epsilon_i \nabla\log p_{\sigma_{N-i}}(x_i) + \sqrt{2\epsilon_i} z_i$$

$$= x_i - \epsilon_i \ s(x_i, \sigma_{L-i}) + \sqrt{2\epsilon_i} z_i$$

for $N$ iterations (usually on the order of $1000$). Here on picks $\epsilon_i = \epsilon \sigma_i^2 / \sigma_L^2$ for some $\epsilon>0$. One often also holds the $\sigma_i$ and $\epsilon_i$ constant over a number of iterations before deceasing its index, i.e., for some $k$ one uses $\sigma_j$, with $j  = \lceil \frac{L-i}{k} \rceil k$ and use $Nk$ iterations.

We can use this iteration to approximate samples from $p_X$.

```{note}
Using Langevin dynamic with this decaying noise level is called **annealed Langevin dynamics**.
```


```{figure} images/ald.gif
---
height: 200px
---
Langevin dynamics if we would run it for different $\sigma_i$. [Source](https://yang-song.net/blog/2021/score/)
```

```{warning}
Question: Why do we use annealed Langevin dynamics? Shouldn't a single small $\sigma_i$ be enough? In theory, yes. In practice the network $s$ only works for samples similar to the ones it has seen during training. If we therefore would only use a single small $\sigma_i$ the network would not know how to deal with extremly noisy samples and therefore would not work for the earily $x_i$.
```