```{contents}
```
## Diffusion Models


Diffusion Models are **generative models** that learn to **generate new data (like images or audio)** by *reversing a gradual noising process*.

Think of it like this:

> You take a clean image → slowly add random noise → it becomes pure static.
>
> Now, a diffusion model learns how to **reverse this process** — starting from pure noise, it learns how to **denoise** step by step until a realistic image reappears.

That’s the essence of diffusion.

---

### High-Level Idea

A diffusion model has two processes:

#### (a) **Forward Process (Diffusion)**

Gradually destroys data by adding Gaussian noise over many small steps.

$$
x_t = \sqrt{\alpha_t}x_{t-1} + \sqrt{1-\alpha_t},\epsilon
$$
where

* $x_0$: original data (e.g., image)
* $x_t$: noisy version at timestep $t$
* $\epsilon \sim \mathcal{N}(0, I)$: Gaussian noise
* $\alpha_t$: controls how much noise is added at step $t$

After many steps, $x_T$ ≈ pure noise.

---

#### (b) **Reverse Process (Denoising)**

The model learns to **reverse** the forward process:
$$
p_\theta(x_{t-1} | x_t)
$$
That is, given a noisy image $x_t$, predict a slightly *cleaner* version $x_{t-1}$.

This is done step by step — from noise → structure → realistic image.

---

### Training Objective

The model is trained to predict the **noise** that was added at each step.

$$
L = \mathbb{E}*{x_0, \epsilon, t}\left[|\epsilon - \epsilon*\theta(x_t, t)|^2\right]
$$

* $\epsilon_\theta(x_t, t)$: model’s predicted noise
* $\epsilon$: true noise added
* The model learns to accurately “remove” noise from any level of corruption.

---

### Generation Workflow

Once trained, the generation process works **backward**:

1. Start with **pure noise** $x_T \sim \mathcal{N}(0, I)$.
2. Use the trained model to predict and subtract noise step by step:
   $$
   x_{t-1} = \frac{1}{\sqrt{\alpha_t}}(x_t - (1-\alpha_t)\epsilon_\theta(x_t, t)) + \text{small noise}
   $$
3. After repeating this for all $T$ steps, you get $x_0$ — a realistic image or data sample.

This process is **iterative denoising** — like watching a noisy static image slowly clear into a picture.

---

### Intuitive Analogy

| **Process**         | **Analogy**                                                               |
| ------------------- | ------------------------------------------------------------------------- |
| Forward (Diffusion) | Adding sand grain by grain until the image becomes pure sand.             |
| Reverse (Denoising) | Carefully brushing away grains to reveal the hidden sculpture underneath. |

---

### Why It Works So Well

* Diffusion models **don’t rely on adversarial training** (unlike GANs), so they’re **more stable**.
* They can model **complex multimodal distributions** — generating very diverse and realistic samples.
* Trained on **massive datasets**, they learn rich representations of real-world textures, lighting, and structure.

---

### Key Equations Summary

| **Stage**     | **Equation**                                     | **Purpose**                                                              |                        |
| ------------- | ------------------------------------------------ | ------------------------------------------------------------------------ | ---------------------- |
| Forward       | $q(x_t                                          | x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)I)$     | Add noise gradually    |
| Reverse       | $p_\theta(x_{t-1}                               | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$ | Remove noise gradually |
| Training Loss | $L = | \epsilon - \epsilon_\theta(x_t, t) |^2$ | Learn to predict noise                                                   |                        |

---

### Types of Diffusion Models

| **Model Type**                                     | **Description**                                           | **Examples**         |
| -------------------------------------------------- | --------------------------------------------------------- | -------------------- |
| **DDPM** (Denoising Diffusion Probabilistic Model) | Basic diffusion framework using Gaussian noise            | Ho et al., 2020      |
| **DDIM** (Deterministic Diffusion Implicit Model)  | Faster generation with fewer steps                        | Improved DDPM        |
| **Score-based Models**                             | Use score matching instead of noise prediction            | Song & Ermon, 2020   |
| **Latent Diffusion Models (LDM)**                  | Diffusion operates in compressed latent space (efficient) | **Stable Diffusion** |

---

### Applications

| **Domain**                     | **Examples**                                         |
| ------------------------------ | ---------------------------------------------------- |
| **Text-to-Image Generation**   | Stable Diffusion, DALL·E 2, Imagen                   |
| **Super-Resolution**           | Enhancing low-res images                             |
| **Image Inpainting**           | Filling missing parts of an image                    |
| **Image-to-Image Translation** | Turning sketches → photos, day → night               |
| **Video and Audio Generation** | Motion synthesis, speech synthesis                   |
| **Medical Imaging**            | Generating realistic MRI scans for data augmentation |

---

### Comparison: GAN vs Diffusion

| **Aspect**       | **GAN**                            | **Diffusion Model**                    |
| ---------------- | ---------------------------------- | -------------------------------------- |
| Training         | Adversarial (unstable)             | Noise-prediction (stable)              |
| Diversity        | Prone to mode collapse             | High diversity                         |
| Control          | Difficult to condition             | Easily conditioned (text, class, etc.) |
| Generation Speed | Fast                               | Slower (iterative denoising)           |
| Output Quality   | Sharp but sometimes artifact-prone | Highly realistic and smooth            |

---

### Real-World Examples

* **Stable Diffusion (2022):**
  Text-to-image generation using diffusion in a *latent space* (compact representation of images).
  Prompts like:

  > “A futuristic city skyline at sunset in cyberpunk style”
  > generate photorealistic or stylized images.

* **DALL·E 3 (OpenAI):**
  Combines **transformers + diffusion** for fine-grained text-conditioned image generation.

---

### Summary

| **Component**         | **Role**                                        |
| --------------------- | ----------------------------------------------- |
| **Forward Diffusion** | Gradually adds Gaussian noise to destroy data   |
| **Reverse Diffusion** | Neural net learns to denoise step-by-step       |
| **Training Goal**     | Learn the probability distribution of real data |
| **Output**            | Realistic new samples generated from noise      |