# <center>Deep Generative Models</center>
## <center>Seminar 8</center>

<center>10.04.2025</center>

## Score matching

<center><img src="pics/score_matching.png" width=1200 /></center>

**Objective:**

$$
\frac{1}{2} \mathbb{E}_{\pi} \left\| \mathbf{s}_{\theta}(\mathbf{x}) - \nabla_{\mathbf{x}} \log \pi(\mathbf{x}) \right\|_2^2 \rightarrow \min_{\theta}
$$

**Problem:** We don't know $\pi(\mathbf{x})$ ($p(\mathbf{x})$ on picture) :(

## Denoising score matching

Instead, we can train $\mathbf{s}_{\theta}$ on **noise distribution** $q(\mathbf{x}_\sigma) = \int \pi(\mathbf{x}) q(\mathbf{x}_\sigma \mid \mathbf{x})$, where $q(\mathbf{x}_\sigma \mid \mathbf{x}) = \mathcal{N}(\mathbf{x}, \sigma^2 \mathbf{I})$, using only the known conditional distribution.

$$
\frac{1}{2} \mathbb{E}_{q(x_\sigma)} \left\| \mathbf{s}_{\theta, \sigma}(\mathbf{x}_\sigma) - \nabla_{\mathbf{x}_\sigma} \log q(\mathbf{x}_\sigma) \right\|_2^2 \rightarrow \min_{\theta}
$$
$$
\downarrow
$$
$$
\frac{1}{2}\mathbb{E}_{\pi(x)} \mathbb{E}_{q(x_\sigma \mid x)} \left\| \mathbf{s}_{\theta, \sigma}(\mathbf{x}_\sigma) - \nabla_{\mathbf{x}_\sigma} \log q(\mathbf{x}_\sigma \mid \mathbf{x}) \right\|_2^2 + \text{const}(\theta) \rightarrow \min_{\theta}
$$


Then to sample we can use **Langevin dynamics** to sample from it:
$$
\mathbf{x}_l = \mathbf{x}_{l-1} + \frac{\eta_t}{2} \mathbf{s}_{\theta, \sigma}(\mathbf{x}_l) + \sqrt{\eta_t} \cdot \boldsymbol{\epsilon}_l,
$$
where $\boldsymbol{\epsilon}_l \sim \mathcal{N}(0, \mathbf{I})$.

## **Qustion 1.** What $\sigma$ should we choose?

With **small** $\sigma$ the problem comes from **the manifold hypothesis**:

> Real-world data tend to concentrate on a *low dimensional manifolds* embedded in a *high dimensional space*. 

The key challenge is the fact that the estimated score functions are inaccurate in low density regions, where few data points are available for computing the score matching objective.
<center><img src="pics/inaccurate.png" width=1200 /></center>

When the noise magnitude is sufficiently large, it can populate low data density regions to improve the accuracy of estimated scores.
<center><img src="pics/accurate.png" width=1200 /></center>

## **Qustion 2.** How to change training and sampling to improve quality?

To achieve the best of both worlds, we use multiple scales of noise perturbations **simultaneously**.

### Training
1. Get the sample $\mathbf{x}_0 \sim \pi(\mathbf{x})$.
2. Sample noise level $t \sim \mathcal{U}\{1, T\}$ and the noise $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$.
3. Get noisy image $\mathbf{x}_t = \mathbf{x}_0 + \sigma_t \cdot \boldsymbol{\epsilon}$.
4. Compute loss $\mathcal{L} = \sigma_t^2 \cdot \left\| \mathbf{s}_{\theta, \sigma_t}(\mathbf{x}_t) + \frac{\boldsymbol{\epsilon}}{\sigma_t} \right\|^2$.

### Sampling

1. **Sample** $\mathbf{x}_0 \sim \mathcal{N}(0, \sigma_T^2 \cdot \mathbf{I}) \approx q(\mathbf{x}_T)$.
2. **Apply** $L$ steps of Langevin dynamic:

  $$
  \mathbf{x}_l = \mathbf{x}_{l-1} + \frac{\eta_t}{2} \cdot \mathbf{s}_{\theta, \sigma_t}(\mathbf{x}_{l-1}) + \sqrt{\eta_t} \cdot \boldsymbol{\epsilon}_l
  $$

3. **Update** $\mathbf{x}_0 := \mathbf{x}_L$ and choose the next $\sigma_t$.
<center><img src="pics/ald.gif" width=1200 /></center>

### **Note!**

To condition our score model $\mathbf{s}_{\theta, \sigma_t}(\mathbf{x}_t)$ on $\sigma_t$ we add addtitional input to the model $\mathbf{s}_{\theta, \sigma_t}(\mathbf{x}_t, t)$. For example: 

In [None]:
class ConditionedResnetBlock(nn.Module):
    def __init__(self, dim: int, num_embeddings: int) -> None:
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(dim, dim, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(dim, dim, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(dim, dim, kernel_size=1),
        )
        self.dim = dim
        self.embedding = nn.Embedding(num_embeddings=num_embeddings, embedding_dim=dim)

    def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        time_embed = self.embedding(y).view(-1, self.dim, 1, 1)
        return x + self.block(x + time_embed)