# Denoising Diffusion Models

To Denoise Diffusion consists of two processes: 
1. **Forward Diffusion process** - Gradually add noise to the input
2. **Reverse denoising process** - Learns to generate data by denoising

Given an observed training data x. Assume the data is governed by some unobserved latent random variable $\mathbf{z}$.
To generate consists of two steps: 
1. A **latent value** of $\mathbf{z}$ is generated from some prior distribution $p(z)$.
2. An **observed value** is generated from a conditional distribution $p( \mathbf{x} | \mathbf{z})$

## Preliminaries for Denoising Diffusion Probabilistic Models (DDPM)

|Concept | Summary | 
|--------|---------|
| **Markov Chain** | $$p(x_{t+1} \|x_t, \dots x_1) = p(x_{t+1} \| x_t)$$ |
| **Latent Value of $\mathbf{z}$** | This is generated from some prior distribution $p(\mathbf{z})$ |
| **observed value** | This is generated from a conditional distribution $p(\mathbf{x} \| \mathbf{z})$ |
| **Marginal-Likelihood/Evidence** | $$p(\mathbf{x}) = \int p(\mathbf{z})p(\mathbf{x} \| \mathbf{z}) d\mathbf{z}$$ |
| **Posterior** | $$p(\mathbf{z} \| \mathbf{x}) = \frac{p(\mathbf{z}) \cdot p(\mathbf{x} \| \mathbf{z})}{p(\mathbf{x})}$$ |

To obtain the **Posterior** is very difficult due to high-dimensional intergeration in the denominator. 

As such we use another method: **Variational Inference**

We use a simpler Distribution $\mathbf{q}_{\phi}(\mathbf{z} | \mathbf{x})$ to approximate the **True Posterior**.

So our goal is to find a distribution simple enough for us that is tractable, but close enough to the true distribution. 

To measure the difference in these distribution we use the $$\text{argmin}_{\phi} D_{KL} \left(\mathbf{q}_{\phi}(\mathbf{z} | \mathbf{x}) || p(\mathbf{z} \| \mathbf{x}) \right)$$

\begin{aligned}
&\textbf{Denoising Diffusion Probabilistic Models (DDPM)} \\
\\
&\textbf{1. Forward Process (Adding Noise):} \\
&q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t \mathbf{I}) \\
&x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I}) \quad (\text{where } \bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s)) \\
\\
&\textbf{2. Reverse Process (Denoising):} \\
&p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) \\
\\
&\textbf{3. Simplified Objective Function:} \\
&L_{\text{simple}}(\theta) = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t) \|^2 \right] \\
\\
&\textbf{4. Sampling Step (Iterative Denoising):} \\
&x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z \quad \text{where } z \sim \mathcal{N}(0, \mathbf{I})
\end{aligned}

### Training Algorithm for Denoising Diffusion Probabilistic Models (DDPM)

**1. Forward Process (Adding Noise):**
- For each training sample $x_0$:
  - For $t = 1$ to $T$:
    - Sample noise $\epsilon \sim \mathcal{N}(0, \mathbf{I})$
    - Generate noisy data: $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$

**2. Training Objective:**
- Train a neural network $\epsilon_\theta(x_t, t)$ to predict the noise $\epsilon$ added at each step.
- Minimize the loss:
  $$
  L_{\text{simple}}(\theta) = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]
  $$

**3. Training Steps:**
- For each batch:
  1. Sample $x_0$ from the dataset.
  2. Randomly choose $t$ (noise level).
  3. Sample noise $\epsilon$.
  4. Compute $x_t$ using the forward process.
  5. Predict noise: $\hat{\epsilon} = \epsilon_\theta(x_t, t)$.
  6. Compute and backpropagate the loss.

**Summary:**  
The model learns to reverse the noise process by predicting the noise at each step, enabling generation by iterative denoising.


### Sampling Algorithm for Denoising Diffusion Probabilistic Models (DDPM)

**1. Start with pure noise:**  
- Initialize $x_T \sim \mathcal{N}(0, \mathbf{I})$ (random noise).

**2. Iterative Denoising:**  
- For $t = T$ down to $1$:
  - Predict noise: $\hat{\epsilon}_\theta(x_t, t)$ using the trained model.
  - Compute mean: $\mu_\theta(x_t, t)$ (formula depends on DDPM implementation).
  - Sample $z \sim \mathcal{N}(0, \mathbf{I})$ (if $t > 1$; else $z = 0$).
  - Update:  
    $$
    x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \hat{\epsilon}_\theta(x_t, t) \right) + \sigma_t z
    $$

**3. Output:**  
- $x_0$ is the generated sample (e.g., image).

**Summary:**  
Start from random noise and iteratively denoise using the learned model to generate a realistic sample.