# Diffusion models overview

This gives a code overview of: 
- Diffusion probabilistic models



### Table of contents
1. [Generating synthetic data for regression tasks](#synthetic)
2. [Creating gradient descent optimizer algorithms (INNOVATORS)](#optimizer)
3. [Performing the gradient descent training loop (BENCHMARKERS)](#train)
4. [Verifying gradient descent computation (BLOCKCHAIN)](#verify)

In [1]:
%load_ext autoreload
%autoreload 2

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np

# attempt to request a GPU if available
dev = torch.device('cpu')#src.device.get_device(gpu=0)

ModuleNotFoundError: No module named 'torchvision'

<a id="synthetic"></a>
## 1. Generating synthetic data for regression tasks

Absolutely! Here's a **summary of our conversation** so far, organized by topic, to give you a clear picture of the concepts and how they connect:

---

## ðŸŒ€ **1. What is a DDPM?**

- **DDPM** (Denoising Diffusion Probabilistic Model) is a generative model that gradually **adds noise to data** (forward process) and learns to **reverse this noise** (reverse process).
- The forward process is a **Markov chain** that adds small Gaussian noise at each step.
- The reverse process is modeled by a neural network trained to predict the added noise (or denoise).
- Sampling: start with noise and apply the learned reverse process to generate data.

---

## ðŸ”— **2. Connection to Score-Based Models (SGMs)**

- SGMs learn the **score function** \( \nabla_x \log p_t(x) \), the gradient of the log-probability of data at time \( t \).
- SGMs define a **continuous-time noising process** using an **SDE**.
- DDPMs are a **special, discrete case** of SGMs with a specific noise schedule.
- Predicting noise in DDPMs is equivalent to estimating the score function, up to a constant.

---

## ðŸ”„ **3. Reverse-Time SDE and its Derivation**

- The forward SDE:
  \[
  dx = f(x, t) \, dt + g(t) \, dW_t
  \]
- The **reverse-time SDE** (via Andersonâ€™s theorem):
  \[
  dx = \left[f(x, t) - g(t)^2 \nabla_x \log p_t(x) \right] dt + g(t) \, d\bar{W}_t
  \]
- It runs **backward in time**, and the key to reversing it is the **score function**.

---

## ðŸŽ² **4. Is Reverse-Time Wiener Process Just -dW?**

- Not exactly: \( d\bar{W}_t \neq -dW_t \)
- The reverse Wiener process is defined as \( \bar{W}_t = W_T - W_{T - t} \), which **has the same distribution** but not the same pathwise behavior.
- Reversing time involves reversing the entire trajectory, not just flipping noise signs.

---

## ðŸ“Š **5. Fokker-Planck Equations (FPE) and Time Reversal**

- The Fokker-Planck equation governs how the **density \( p(x, t) \)** evolves under an SDE.
- Forward FPE:
  \[
  \frac{\partial p}{\partial t} = -\nabla \cdot (fp) + \frac{g^2}{2} \nabla^2 p
  \]
- Reversing the FPE gives the reverse drift:
  \[
  \tilde{f}(x, t) = f(x, t) - g(t)^2 \nabla_x \log p(x, t)
  \]
- This confirms that the **reverse SDE** guides samples back to the data distribution using the score function.

---

## ðŸ§© **Key Takeaways**

- DDPMs and SGMs are two perspectives on the same core idea: **modeling and reversing diffusion**.
- The reverse-time SDE is **not just flipping noise**; it's based on **score gradients**.
- The Fokker-Planck formalism shows how probability mass evolves and **justifies the reverse dynamics**.

---

If youâ€™d like, I can turn this into a visual diagram, notes for studying, or a compact cheat sheet. Want to go that route?

In [None]:
def forward_diffusion(x_0, timesteps, beta_start=1e-4, beta_end=0.02):
    """
    Forward diffusion: Adds noise to the data at each timestep.

    Args:
    - x_0: Original data (batch of images)
    - timesteps: Number of diffusion steps
    - beta_start, beta_end: Controls noise schedule

    Returns:
    - Noisy images at each timestep
    """
    device = x_0.device
    betas = torch.linspace(beta_start, beta_end, timesteps).to(device)  # Noise schedule
    alphas = 1.0 - betas
    alphas_cumprod = torch.cumprod(alphas, dim=0)  # Cumulative product of alphas

    noise = torch.randn_like(x_0).to(device)
    timesteps = torch.randint(0, timesteps, (x_0.shape[0],), device=device)  # Random timestep for each image
    sqrt_alpha_cumprod = torch.sqrt(alphas_cumprod[timesteps])[:, None, None, None]
    sqrt_one_minus_alpha_cumprod = torch.sqrt(1 - alphas_cumprod[timesteps])[:, None, None, None]
    
    x_t = sqrt_alpha_cumprod * x_0 + sqrt_one_minus_alpha_cumprod * noise  # Noisy sample
    return x_t, timesteps, noise


In [None]:
class SimpleDenoiser(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)  # Input 1-channel (grayscale)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.conv3 = nn.Conv2d(64, 32, 3, padding=1)
        self.conv4 = nn.Conv2d(32, 1, 3, padding=1)

        self.relu = nn.ReLU()
    
    def forward(self, x, t):
        """
        Forward pass through denoising network.
        x: Noisy image
        t: Timestep information (not used here but could be added via embeddings)
        """
        x = self.relu(self.conv1(x))
        x = self.relu(self.conv2(x))
        x = self.relu(self.conv3(x))
        x = self.conv4(x)  # Predict noise
        return x


In [None]:
# Hyperparameters
epochs = 10
timesteps = 100
lr = 1e-3
batch_size = 128

# Load MNIST dataset
transform = transforms.Compose([transforms.ToTensor()])
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transform, download=True)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Model, loss function, and optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleDenoiser().to(device)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.MSELoss()

# Training loop
for epoch in range(epochs):
    for images, _ in train_loader:
        images = images.to(device)
        x_t, t, noise = forward_diffusion(images, timesteps)
        
        optimizer.zero_grad()
        noise_pred = model(x_t, t)
        loss = criterion(noise_pred, noise)  # MSE loss to denoise the image
        loss.backward()
        optimizer.step()
    
    print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}")

print("Training complete!")
