# Key Components of Diffusion Models

## Objectives

By the end of this notebook, you will be able to:

- Identify and describe the core components of diffusion models.
- Understand the role of the denoising neural network, noise schedule, and sampling process.
- Visualize and explain how each component contributes to generating realistic samples.
- Compare architecture types like U-Net and Transformer-based models.
- Connect theoretical understanding with mathematical intuition and visuals.


-----
## Core Components of Diffusion Models

Diffusion models are built upon several core components that define how they function and are trained. Understanding these components will help clarify how diffusion transforms noise into meaningful data, especially images.

##  1. Neural Network (Denoising Network)

###  Purpose

The neural network in diffusion models is trained to predict the noise component added at each timestep during the forward diffusion process.

Mathematically, at each timestep \( t \), the model learns to estimate:

$$
\epsilon_\theta(x_t, t)
$$

Where:
-  \(x_t) \: Noisy image at time step \( t \)
- t \: Current timestep
- \$ \epsilon_\theta \$: Noise predicted by the network with parameters \( \theta\)

---

###  Common Architectures

####  U-Net

- Popular in image-based diffusion models.
- Has a symmetric encoder-decoder structure with skip connections.
- Enables the network to capture both global context and fine details.


<p align="center">
  <img src="https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/u-net-architecture.png" alt="Figure 1: Architecture of U-Net" width="600"/>
</p>

<p align="center">
  <img src="https://i.postimg.cc/bNWRMfN1/U-net-Architecture.png" alt="Figure 1: Architecture of U-Net" width="600"/>
</p>


<p align="center">
  <img src="https://i.postimg.cc/FFfpxKhh/Description.png" alt="Figure 1: Architecture of U-Net" width="300"/>
</p>



<p align="center"><b>Figure 1:</b> Architecture of U-Net</p>



####  Transformer-based Networks

- Used for text, image, and multimodal diffusion models.
- Leverages self-attention to model long-range dependencies.

<p align="center">
  <img src="https://erdem.pl/static/2f26fadad0f8290c51b1b8579c008aeb/41d3c/attention.png" alt="Figure 1: Architecture of U-Net" width="400"/>
</p>



<p align="center"><b>Figure 2:</b> Self-Attention block</p>

---

###  Function

At every step \( t \), the denoising network receives:

- The noisy input image: \( x_t \)
- The current timestep: \( t \)

The model predicts the noise component:

$$
\epsilon_\theta(x_t, t)
$$

Then the sample for the previous step \( x_{t-1} \) is computed as:

$$
x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \bar{\alpha}_t}{\sqrt{1 - \alpha_t}} \cdot \epsilon_\theta(x_t, t) \right) + \sigma_t \cdot z
$$

Where:
- $\alpha_t \text{ and } \bar{\alpha}_t$ are noise schedule coefficients.
- $\sigma_t$ is the standard deviation of the added noise at timestep $t$.
- $z \sim \mathcal{N}(0, I)$ represents standard Gaussian noise.


---

### Intuition

>  Think of the denoising network as a *noise cleaner*. Each step, it learns to reverse a tiny part of the corruption caused by noise — gradually revealing the final image.

---

| Concept                      | Suggested Image Description                                    |
|------------------------------|----------------------------------------------------------------|
| U-Net Architecture           | Show a U-Net diagram with encoder, decoder, and skip paths    |
| Transformer Model (optional) | Display attention layers of Transformer blocks                |
| Denoising Process Flow       | Diagram of $$x_t \rightarrow \epsilon_\theta(x_t, t) \rightarrow x_{t-1}$$ |




## 2. Noise Schedule (β Schedule)

###  Purpose

The **noise schedule** defines how much noise is added to the data at each timestep during the forward diffusion process. It also controls the reverse process by influencing the denoising steps.

It is essentially a sequence of variance values \( \beta_1, \beta_2, \ldots, \beta_T \) over \( T \) timesteps, determining how rapidly or gradually the data is destroyed.

---

###  Mathematical Formulation

At each step \( t \), the noisy image is generated as:

$$
x_t = \sqrt{\bar{\alpha}_t} \cdot x_0 + \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)
$$

Where:

\begin{align*}
\alpha_t &= 1 - \beta_t &&\text{(signal retention at timestep } t\text{)} \\
\bar{\alpha}_t &= \prod_{s=1}^t \alpha_s &&\text{(cumulative product over time)} \\
\epsilon &\sim \mathcal{N}(0, I) &&\text{(standard Gaussian noise)}
\end{align*}


---

### Types of Noise Schedules

#### 1. **Linear Schedule**
- $
\beta_t \text{ increases linearly from a small value to a larger one.}
$

- Simple and widely used in early diffusion models.
- Leads to a uniform degradation of the image.

#### 2. **Cosine Schedule**
- Introduced in *Improved Denoising Diffusion Probabilistic Models* (Nichol & Dhariwal, 2021).
- Begins slowly and increases rapidly.
- Produces sharper images and improves training stability.

#### 3. **Quadratic / Sigmoid Schedules**
- Non-linear increase or S-shaped growth in noise.
- Useful for adjusting training dynamics or optimizing model quality.

---

###  Intuition

> The noise schedule acts like a **corruption controller**. In the forward process, it determines **how aggressively** the image gets noised. In the reverse process, it **guides the denoiser** in how much correction to apply.

---





# 3. Sampling Procedure in Diffusion Models

## Purpose

The sampling process is where diffusion models generate data (e.g., images) from pure noise by running the reverse diffusion process. It begins with a randomly sampled noisy input:

$$
x_T \sim \mathcal{N}(0, I)
$$

Then it gradually denoises it using a trained neural network to reach the final output:

$$
x_0
$$

This is how the model synthesizes realistic samples without direct data input—essentially learning to "imagine from noise."

## Mathematical Formulation

The reverse denoising step at timestep \( t \) is usually defined as:

$$
x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \cdot \epsilon_\theta(x_t, t) \right) + \sigma_t \cdot z
$$

Where:
- $x_t$: Noisy sample at timestep $t$
- $\epsilon_\theta(x_t, t)$: Noise predicted by the denoising network
- $\alpha_t = 1 - \beta_t$: Signal retention at timestep $t$
- $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$: Cumulative product over timesteps


\



## 4. Training Objective in Diffusion Models

### Purpose

The goal of training a diffusion model is to teach the neural network to **predict the noise** added to a sample at each timestep during the forward diffusion process.

At each timestep $t$, the model tries to estimate the noise $\epsilon$ such that:

$$
\mathcal{L}_{\text{simple}} = \mathbb{E}_{x_0, \epsilon, t} \left[ \left\| \epsilon - \epsilon_\theta(x_t, t) \right\|^2 \right]
$$

Where:

* $x_0$: Original clean data sample
* $\epsilon$: Actual Gaussian noise added
* $x_t$: Noisy version of $x_0$ at timestep $t$
* $\epsilon_\theta(x_t, t)$: Noise predicted by the neural network with parameters $\theta$

<p align="center">
  <img src="https://i.postimg.cc/s215N1dj/Screenshot-2025-06-18-161207.png" alt="Figure 1: Architecture of U-Net" width="800"/>
</p>



<p align="center"><b>Figure 3:</b> The training and sampling algorithms </p>

### Interpretation

> The model is trained to reverse the noise addition process by minimizing the **Mean Squared Error (MSE)** between the actual noise and the predicted noise.



### Variants of the Loss

There are multiple formulations of the training objective, including:

1. **Predicting the noise** $\epsilon$:
   This is the most common and simple approach.

2. **Predicting the original image** $x_0$:
   Loss is based on the difference between the true and reconstructed image.

3. **Predicting the mean of** $x_{t-1}$:
   Involves estimating parameters of the reverse distribution.






------
#  Connection to Score-Based Generative Models (SGMs) in Diffusion Models

Diffusion models and **Score-Based Generative Models (SGMs)** are two families of generative models that are **closely related**. In fact, diffusion models can be seen as a special case of SGMs. Here's a breakdown of the connection:



###  What are Score-Based Generative Models?

Score-Based Generative Models learn a **score function**, which is the gradient of the log-probability density of data:

$$
\nabla_{x} \log p(x)
$$

This score tells us **which direction to move a data point to make it more likely** under the true data distribution.
Since the true distribution $p(x)$ is unknown, SGMs learn to approximate this score from noisy versions of the data.



###  How this connects to Diffusion Models

In **diffusion models**, we gradually add noise to data through a forward process, producing a sequence $x_0 \rightarrow x_1 \rightarrow \cdots \rightarrow x_T$.
At each timestep $t$, the model tries to reverse this process.

Diffusion models learn to predict the **noise** added to a data sample $x_t$, using a model $\epsilon_\theta(x_t, t)$.
It turns out this noise prediction is **mathematically related** to the score function:

> The predicted noise is **proportional** to the negative score:
>
> $$
> \epsilon_\theta(x_t, t) \propto -\nabla_{x_t} \log p(x_t)
> $$

This means that **diffusion models are implicitly learning the score function**, just like SGMs.



###  Summary of the Connection

| Aspect            | Score-Based Models                         | Diffusion Models                    |
| ----------------- | ------------------------------------------ | ----------------------------------- |
| Goal              | Learn score function: $\nabla_x \log p(x)$ | Learn to predict noise $\epsilon$   |
| Method            | Train on noisy data using score matching   | Train on noisy data using MSE loss  |
| Generation method | Reverse SDE or Langevin Dynamics           | Reverse diffusion process           |
| Noise estimation  | Explicit score learning                    | Implicit score via noise prediction |
| Common Ground     | Both learn how to denoise or reverse noise |                                     |



###  Intuition

Imagine you're climbing a hill (i.e., generating a sample). The **score function** tells you which direction is uphill (toward high data likelihood).
By learning to **undo the noise step-by-step**, diffusion models are effectively **learning how to climb back up the hill** from pure noise to realistic data — just like SGMs.




#  Diffusion Models: Advantages, Disadvantages & Applications

---

##  Advantages of Diffusion Models

### 1. High-Quality Sample Generation
- Achieve **state-of-the-art performance** in image and audio synthesis.
- Often surpass GANs in **visual fidelity and realism**.
- **Examples**: `DALL·E 2`, `Imagen`, `Stable Diffusion`.



### 2. Training Stability
- Based on a **well-defined and simple objective** (e.g., Mean Squared Error).
- Avoids **adversarial instability** seen in GANs.
- Results in **more predictable and stable training**.



### 3. Diverse Sample Generation
- Less prone to **mode collapse** (a problem in GANs where outputs lack variety).
- Generates **rich and varied samples** even from similar noise inputs.



### 4. Flexible Conditioning
- Can be easily **conditioned** on:
  - Text prompts
  - Class labels
  - Images
  - Audio
- Enables powerful tasks like **text-to-image generation** and **guided image editing**.

---

##  Disadvantages of Diffusion Models

### 1. Computational Cost
- **Slow inference** due to hundreds/thousands of denoising steps.
- Much **slower than VAEs or GANs** which use one-shot generation.



### 2. High Memory Usage
- Architectures like large U-Nets or Transformers require:
  - **High GPU memory**
  - **Longer training time**
- Can be challenging for deployment on **resource-constrained systems**.



##  Applications of Diffusion Models

### 1. High-Fidelity Image Generation
- Generate photorealistic and artistic images from random noise.
- Used in systems like:
  - `DALL·E 2`
  - `Midjourney`
  - `Stable Diffusion`



### 2. Text-to-Image Synthesis
- Convert **natural language prompts** into detailed images.
- Example prompt: _“A cat wearing sunglasses in space”_



### 3. Image Editing & Inpainting
- **Modify or restore images** by:
  - Filling missing parts
  - Replacing specific regions
- Useful in **photo restoration**, **inpainting**, and **creative editing**.



### 4. Audio Generation
- Generate **music**, **speech**, and **sound effects**.
- Common models:
  - `DiffWave`
  - `AudioLDM`



### 5. Video Generation *(Emerging)*
- Generate **short videos** with consistent motion and style.
- Applications include:
  - Prompt-to-video generation
  - Frame interpolation
  - Motion-guided synthesis



