# Assignment 4: Advanced Image Generation
## Diffusion Models and Energy-Based Models

**Student**: my2878  
**Course**: Generative AI  
**Date**: November 2025

---

## Overview

This notebook implements two advanced generative models:

1. **Diffusion Model (DDPM)** - Denoising Diffusion Probabilistic Model
2. **Energy-Based Model (EBM)** - Using Langevin Dynamics

Both models are trained on **CIFAR-10** dataset and integrated into the FastAPI.

---

## Table of Contents

1. [Setup and Imports](#setup)
2. [Part 1: Diffusion Model Implementation](#diffusion)
3. [Part 2: Energy-Based Model Implementation](#energy)
4. [Part 3: Training on CIFAR-10](#training)
5. [Part 4: Theory Questions](#theory)
6. [Part 5: Results and Visualization](#results)
7. [Part 6: API Integration](#api)


---

## 1. Setup and Imports


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import math
from typing import Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 
                     'mps' if torch.backends.mps.is_available() else 'cpu')
print(f'Using device: {device}')

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)


Using device: mps


---

## 2. Part 1: Diffusion Model Implementation

### 2.1 Sinusoidal Time Embedding

The sinusoidal time embedding provides a continuous representation of timesteps using sine and cosine functions.

**Mathematical Formula:**

For timestep $t$ and embedding dimension $d$, the $i$-th dimension is:

$$
\text{embedding}[2i] = \sin\left(\frac{t}{10000^{2i/d}}\right)
$$

$$
\text{embedding}[2i+1] = \cos\left(\frac{t}{10000^{2i/d}}\right)
$$


In [2]:
class SinusoidalTimeEmbedding(nn.Module):
    """
    Sinusoidal Time Embedding for diffusion timesteps.
    
    This provides a continuous, deterministic embedding for each timestep
    using sine and cosine functions at different frequencies.
    """
    
    def __init__(self, embedding_dim=128, max_period=10000):
        super().__init__()
        self.embedding_dim = embedding_dim
        self.max_period = max_period
    
    def forward(self, timesteps):
        """
        Args:
            timesteps: (batch_size,) tensor of timestep indices
        
        Returns:
            (batch_size, embedding_dim) tensor of embeddings
        """
        device = timesteps.device
        half_dim = self.embedding_dim // 2
        
        # Calculate frequency scaling
        frequencies = torch.exp(
            -math.log(self.max_period) * torch.arange(half_dim, device=device) / half_dim
        )
        
        # Compute arguments: t * frequency
        args = timesteps[:, None].float() * frequencies[None, :]
        
        # Concatenate sin and cos components
        embedding = torch.cat([torch.sin(args), torch.cos(args)], dim=-1)
        
        return embedding

# Test the time embedding
time_embed = SinusoidalTimeEmbedding(embedding_dim=8, max_period=10000)
t = torch.tensor([1])
embedding = time_embed(t)
print("Time Embedding for t=1, d=8:")
print(embedding.numpy()[0])
print("\nThis matches our theoretical calculation!")


Time Embedding for t=1, d=8:
[0.84147096 0.09983341 0.00999983 0.001      0.54030234 0.9950042
 0.99995    0.9999995 ]

This matches our theoretical calculation!


In [3]:
class ResidualBlock(nn.Module):
    """Residual block with time embedding injection."""
    
    def __init__(self, in_channels, out_channels, time_emb_dim, dropout=0.1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
        self.time_mlp = nn.Linear(time_emb_dim, out_channels)
        self.norm1 = nn.GroupNorm(8, out_channels)
        self.norm2 = nn.GroupNorm(8, out_channels)
        self.dropout = nn.Dropout(dropout)
        
        if in_channels != out_channels:
            self.residual_conv = nn.Conv2d(in_channels, out_channels, 1)
        else:
            self.residual_conv = nn.Identity()
    
    def forward(self, x, time_emb):
        residual = x
        x = self.conv1(x)
        x = self.norm1(x)
        x = F.relu(x)
        
        # Inject time embedding
        time_emb = self.time_mlp(F.relu(time_emb))
        x = x + time_emb[:, :, None, None]
        
        x = self.conv2(x)
        x = self.norm2(x)
        x = self.dropout(x)
        x = F.relu(x)
        
        return x + self.residual_conv(residual)


class AttentionBlock(nn.Module):
    """Self-attention block for spatial features."""
    
    def __init__(self, channels):
        super().__init__()
        self.norm = nn.GroupNorm(8, channels)
        self.qkv = nn.Conv2d(channels, channels * 3, 1)
        self.proj = nn.Conv2d(channels, channels, 1)
    
    def forward(self, x):
        B, C, H, W = x.shape
        residual = x
        
        x = self.norm(x)
        qkv = self.qkv(x)
        q, k, v = torch.chunk(qkv, 3, dim=1)
        
        q = q.reshape(B, C, H * W).permute(0, 2, 1)
        k = k.reshape(B, C, H * W)
        v = v.reshape(B, C, H * W).permute(0, 2, 1)
        
        attn = torch.bmm(q, k) / math.sqrt(C)
        attn = F.softmax(attn, dim=-1)
        out = torch.bmm(attn, v)
        out = out.permute(0, 2, 1).reshape(B, C, H, W)
        out = self.proj(out)
        
        return out + residual

print("✓ UNet building blocks defined")


✓ UNet building blocks defined


### 2.3 Complete UNet Architecture

The UNet predicts the noise $\epsilon$ that was added to create the noisy image at timestep $t$.


In [4]:
# Simplified UNet for CIFAR-10 (32x32 images)
class SimpleUNet(nn.Module):
    """Simplified UNet for CIFAR-10 diffusion model."""
    
    def __init__(self, in_channels=3, out_channels=3, time_emb_dim=128):
        super().__init__()
        
        # Time embedding
        self.time_embed = nn.Sequential(
            SinusoidalTimeEmbedding(time_emb_dim),
            nn.Linear(time_emb_dim, time_emb_dim * 4),
            nn.ReLU(),
            nn.Linear(time_emb_dim * 4, time_emb_dim * 4)
        )
        
        # Encoder
        self.init_conv = nn.Conv2d(in_channels, 64, 3, padding=1)
        self.down1 = ResidualBlock(64, 128, time_emb_dim * 4)
        self.down2 = ResidualBlock(128, 256, time_emb_dim * 4)
        self.downsample1 = nn.Conv2d(128, 128, 3, stride=2, padding=1)
        self.downsample2 = nn.Conv2d(256, 256, 3, stride=2, padding=1)
        
        # Bottleneck
        self.mid1 = ResidualBlock(256, 256, time_emb_dim * 4)
        self.mid_attn = AttentionBlock(256)
        self.mid2 = ResidualBlock(256, 256, time_emb_dim * 4)
        
        # Decoder
        self.up1 = ResidualBlock(512, 128, time_emb_dim * 4)  # 256 + 256
        self.up2 = ResidualBlock(256, 64, time_emb_dim * 4)   # 128 + 128
        self.upsample1 = nn.ConvTranspose2d(256, 256, 4, stride=2, padding=1)
        self.upsample2 = nn.ConvTranspose2d(128, 128, 4, stride=2, padding=1)
        
        # Output
        self.final_norm = nn.GroupNorm(8, 64)
        self.final_conv = nn.Conv2d(64, out_channels, 3, padding=1)
    
    def forward(self, x, t):
        # Time embedding
        t_emb = self.time_embed(t)
        
        # Initial conv
        x = self.init_conv(x)
        skip0 = x
        
        # Encoder
        x = self.down1(x, t_emb)
        skip1 = x
        x = self.downsample1(x)
        
        x = self.down2(x, t_emb)
        skip2 = x
        x = self.downsample2(x)
        
        # Bottleneck
        x = self.mid1(x, t_emb)
        x = self.mid_attn(x)
        x = self.mid2(x, t_emb)
        
        # Decoder
        x = self.upsample1(x)
        x = torch.cat([x, skip2], dim=1)
        x = self.up1(x, t_emb)
        
        x = self.upsample2(x)
        x = torch.cat([x, skip1], dim=1)
        x = self.up2(x, t_emb)
        
        # Output
        x = self.final_norm(x)
        x = F.relu(x)
        x = self.final_conv(x)
        
        return x

# Test UNet
unet = SimpleUNet().to(device)
num_params = sum(p.numel() for p in unet.parameters())
print(f"✓ UNet created with {num_params:,} parameters")

# Test forward pass
test_x = torch.randn(2, 3, 32, 32).to(device)
test_t = torch.tensor([0, 100]).to(device)
test_out = unet(test_x, test_t)
print(f"✓ Forward pass successful: {test_x.shape} -> {test_out.shape}")


✓ UNet created with 7,719,747 parameters
✓ Forward pass successful: torch.Size([2, 3, 32, 32]) -> torch.Size([2, 3, 32, 32])


---

## 3. Part 2: Theory Questions - Building Blocks of Energy Model

### Question 6: Basic Gradient Calculations

Understanding how PyTorch tracks gradients through computational graphs.


In [5]:
import torch

print("=" * 70)
print("Question 6: Basic Gradient Calculations")
print("=" * 70)

# Create a tensor with requires_grad=True
x = torch.tensor([2.0], requires_grad=True)

# Define a simple function y = x² + 3x
y = x**2 + 3 * x

# Backpropagate
y.backward()

# Print the gradient
print("\nOriginal code result:")
print("x.grad =", x.grad)
print("\nExpected: dy/dx = 2x + 3 = 2(2) + 3 = 7")


Question 6: Basic Gradient Calculations

Original code result:
x.grad = tensor([7.])

Expected: dy/dx = 2x + 3 = 2(2) + 3 = 7


#### Answer 6a: Expected Gradient

For $y = x^2 + 3x$, the gradient is:

$$
\frac{dy}{dx} = 2x + 3
$$

At $x = 2$:
$$
\frac{dy}{dx}\bigg|_{x=2} = 2(2) + 3 = 4 + 3 = 7
$$

**Expected gradient: 7.0**


In [6]:
print("\n" + "=" * 70)
print("Answer 6b: Setting requires_grad=False")
print("=" * 70)

# Create tensor with requires_grad=False
x_no_grad = torch.tensor([2.0], requires_grad=False)
y_no_grad = x_no_grad**2 + 3 * x_no_grad

try:
    y_no_grad.backward()
    print("x_no_grad.grad =", x_no_grad.grad)
except RuntimeError as e:
    print(f"\nError occurred: {e}")
    print("\nExplanation: When requires_grad=False, PyTorch does not build")
    print("a computational graph, so .backward() fails because gradients")
    print("cannot be computed. The tensor y_no_grad also has requires_grad=False.")



Answer 6b: Setting requires_grad=False

Error occurred: element 0 of tensors does not require grad and does not have a grad_fn

Explanation: When requires_grad=False, PyTorch does not build
a computational graph, so .backward() fails because gradients
cannot be computed. The tensor y_no_grad also has requires_grad=False.


In [7]:
print("\n" + "=" * 70)
print("Answer 6c: Default behavior of torch.tensor")
print("=" * 70)

# Create tensor without specifying requires_grad
x_default = torch.tensor([2.0])
y_default = x_default**2 + 3 * x_default

print(f"\nx_default.requires_grad: {x_default.requires_grad}")
print(f"y_default.requires_grad: {y_default.requires_grad}")

print("\nExplanation:")
print("By default, torch.tensor() creates tensors with requires_grad=False.")
print("From PyTorch documentation:")
print("  'requires_grad (bool, optional) – If autograd should record operations")
print("   on the returned tensor. Default: False.'")
print("\nTo enable gradient tracking, you must explicitly set requires_grad=True.")



Answer 6c: Default behavior of torch.tensor

x_default.requires_grad: False
y_default.requires_grad: False

Explanation:
By default, torch.tensor() creates tensors with requires_grad=False.
From PyTorch documentation:
  'requires_grad (bool, optional) – If autograd should record operations
   on the returned tensor. Default: False.'

To enable gradient tracking, you must explicitly set requires_grad=True.


---

### Question 7: Introduce Weights

When training neural networks, we're interested in gradients with respect to model weights.


In [8]:
print("\n" + "=" * 70)
print("Question 7: Introduce Weights")
print("=" * 70)

# Create a tensor with requires_grad=True
x = torch.tensor([2.0], requires_grad=True)
w = torch.tensor([1.0, 3.0])

# Define a simple function y = x² + 3x
y = w[0] * x**2 + w[1] * x

# Backpropagate
y.backward()

# Print the gradient
print("\nGradient with respect to x:")
print("x.grad =", x.grad)

print("\n" + "=" * 70)
print("Answer 7a: Gradient with respect to w")
print("=" * 70)

print("\nAttempting to access w.grad:")
print("w.grad =", w.grad)

print("\nExplanation:")
print("w.grad is None because w was created without requires_grad=True.")
print("By default, torch.tensor() sets requires_grad=False.")
print("PyTorch only computes and stores gradients for tensors that have")
print("requires_grad=True.")



Question 7: Introduce Weights

Gradient with respect to x:
x.grad = tensor([7.])

Answer 7a: Gradient with respect to w

Attempting to access w.grad:
w.grad = None

Explanation:
w.grad is None because w was created without requires_grad=True.
By default, torch.tensor() sets requires_grad=False.
PyTorch only computes and stores gradients for tensors that have
requires_grad=True.


#### Answer 7b: Modified Code to Calculate Gradient w.r.t. w

To compute gradients with respect to $w$, we need to set `requires_grad=True` for $w$.

Mathematical derivation:
$$
y = w_0 \cdot x^2 + w_1 \cdot x
$$

Gradients:
$$
\frac{\partial y}{\partial w_0} = x^2 = 2^2 = 4
$$

$$
\frac{\partial y}{\partial w_1} = x = 2
$$


In [9]:
print("\n" + "=" * 70)
print("Answer 7b: Modified Code - Computing w.grad")
print("=" * 70)

# Create tensors with requires_grad=True
x_new = torch.tensor([2.0], requires_grad=True)
w_new = torch.tensor([1.0, 3.0], requires_grad=True)  # Enable gradient tracking for w

# Define the function
y_new = w_new[0] * x_new**2 + w_new[1] * x_new

# Backpropagate
y_new.backward()

# Print gradients
print("\nGradients:")
print("x_new.grad =", x_new.grad)
print("w_new.grad =", w_new.grad)

print("\nVerification:")
print("∂y/∂w[0] = x² = 2² = 4.0 ✓")
print("∂y/∂w[1] = x = 2.0 ✓")
print("∂y/∂x = 2·w[0]·x + w[1] = 2·1·2 + 3 = 7.0 ✓")



Answer 7b: Modified Code - Computing w.grad

Gradients:
x_new.grad = tensor([7.])
w_new.grad = tensor([4., 2.])

Verification:
∂y/∂w[0] = x² = 2² = 4.0 ✓
∂y/∂w[1] = x = 2.0 ✓
∂y/∂x = 2·w[0]·x + w[1] = 2·1·2 + 3 = 7.0 ✓


#### Answer 7c: Default Gradient Tracking

Same as Question 6c: **No, gradients are NOT tracked by default.**

According to PyTorch documentation for `torch.tensor()`:
- Default value for `requires_grad` is `False`
- You must explicitly set `requires_grad=True` to enable automatic differentiation


---

### Question 8: Breaking the Graph

Understanding how `.detach()` breaks the computational graph.


In [10]:
print("\n" + "=" * 70)
print("Question 8: Breaking the Graph")
print("=" * 70)

# Original code that fails
x = torch.tensor([1.0], requires_grad=True)
y = x * 3
z = y.detach()  # This breaks the computational graph
w = z * 2

print("\nAttempting to call w.backward()...")
try:
    w.backward()
    print("Success! w.grad =", w.grad)
except RuntimeError as e:
    print(f"Error: {e}")
    
print("\n" + "-" * 70)
print("Why does this fail?")
print("-" * 70)
print("1. x has requires_grad=True")
print("2. y = x * 3 creates a node in the computational graph")
print("3. z = y.detach() creates a NEW tensor that shares data with y")
print("   but is DETACHED from the computational graph")
print("4. w = z * 2 operates on the detached tensor z")
print("5. w has no connection to x in the graph, so w.requires_grad=False")
print("6. Calling backward() on a tensor with requires_grad=False fails")

print(f"\nTensor properties:")
print(f"  x.requires_grad: {x.requires_grad}")
print(f"  y.requires_grad: {y.requires_grad}")
print(f"  z.requires_grad: {z.requires_grad}")
print(f"  w.requires_grad: {w.requires_grad}")



Question 8: Breaking the Graph

Attempting to call w.backward()...
Error: element 0 of tensors does not require grad and does not have a grad_fn

----------------------------------------------------------------------
Why does this fail?
----------------------------------------------------------------------
1. x has requires_grad=True
2. y = x * 3 creates a node in the computational graph
3. z = y.detach() creates a NEW tensor that shares data with y
   but is DETACHED from the computational graph
4. w = z * 2 operates on the detached tensor z
5. w has no connection to x in the graph, so w.requires_grad=False
6. Calling backward() on a tensor with requires_grad=False fails

Tensor properties:
  x.requires_grad: True
  y.requires_grad: True
  z.requires_grad: False
  w.requires_grad: False


#### Fixed Code - Solution

To fix this while still using `z`, we need to keep it connected to the graph.


In [11]:
print("\n" + "=" * 70)
print("Fixed Code - Solution")
print("=" * 70)

# Solution: Don't detach z
x = torch.tensor([1.0], requires_grad=True)
y = x * 3
z = y  # Don't detach - keep the graph connection
w = z * 2

print("\nCalling w.backward()...")
w.backward()

print(f"Success! x.grad = {x.grad}")

print("\n" + "-" * 70)
print("Verification:")
print("-" * 70)
print("w = z * 2 = (y) * 2 = (x * 3) * 2 = x * 6")
print("dw/dx = 6")
print(f"Computed gradient: {x.grad.item()} ✓")

print("\n" + "-" * 70)
print("Alternative Solutions:")
print("-" * 70)
print("1. Don't use detach() at all (shown above)")
print("2. If you NEED to use z for non-gradient operations,")
print("   keep a separate variable for the graph:")
print("   z_detached = y.detach()  # For non-gradient ops")
print("   w = y * 2  # For gradient ops")
print("3. Use torch.no_grad() context for operations that")
print("   don't need gradients, but keep the main path intact")



Fixed Code - Solution

Calling w.backward()...
Success! x.grad = tensor([6.])

----------------------------------------------------------------------
Verification:
----------------------------------------------------------------------
w = z * 2 = (y) * 2 = (x * 3) * 2 = x * 6
dw/dx = 6
Computed gradient: 6.0 ✓

----------------------------------------------------------------------
Alternative Solutions:
----------------------------------------------------------------------
1. Don't use detach() at all (shown above)
2. If you NEED to use z for non-gradient operations,
   keep a separate variable for the graph:
   z_detached = y.detach()  # For non-gradient ops
   w = y * 2  # For gradient ops
3. Use torch.no_grad() context for operations that
   don't need gradients, but keep the main path intact


---

### Question 9: Gradient Accumulation

Understanding how gradients accumulate in PyTorch.


In [12]:
print("\n" + "=" * 70)
print("Question 9: Gradient Accumulation")
print("=" * 70)

x = torch.tensor([1.0], requires_grad=True)

# First backward
y1 = x * 2
y1.backward()
print(f"After first backward: x.grad = {x.grad}")

# Second backward
y2 = x * 3
y2.backward()
print(f"After second backward: x.grad = {x.grad}")

print("\n" + "-" * 70)
print("What is happening?")
print("-" * 70)
print("PyTorch ACCUMULATES gradients by default!")
print("First backward:  x.grad = dy1/dx = 2")
print("Second backward: x.grad = 2 + dy2/dx = 2 + 3 = 5")
print("\nThis is useful for:")
print("  - Gradient accumulation across mini-batches")
print("  - Multi-task learning")
print("  - RNNs with multiple time steps")



Question 9: Gradient Accumulation
After first backward: x.grad = tensor([2.])
After second backward: x.grad = tensor([5.])

----------------------------------------------------------------------
What is happening?
----------------------------------------------------------------------
PyTorch ACCUMULATES gradients by default!
First backward:  x.grad = dy1/dx = 2
Second backward: x.grad = 2 + dy2/dx = 2 + 3 = 5

This is useful for:
  - Gradient accumulation across mini-batches
  - Multi-task learning
  - RNNs with multiple time steps


#### Solution: Zero the Gradients

To avoid unwanted accumulation, call `.zero_()` or set `.grad = None` between backward passes.


In [13]:
print("\n" + "=" * 70)
print("Solution: Proper Gradient Management")
print("=" * 70)

# Method 1: Using zero_()
print("\nMethod 1: Using x.grad.zero_()")
print("-" * 70)
x1 = torch.tensor([1.0], requires_grad=True)

y1 = x1 * 2
y1.backward()
print(f"After first backward: x1.grad = {x1.grad}")

x1.grad.zero_()  # Zero the gradients
y2 = x1 * 3
y2.backward()
print(f"After second backward (zeroed): x1.grad = {x1.grad}")

# Method 2: Setting grad to None
print("\nMethod 2: Setting x.grad = None")
print("-" * 70)
x2 = torch.tensor([1.0], requires_grad=True)

y1 = x2 * 2
y1.backward()
print(f"After first backward: x2.grad = {x2.grad}")

x2.grad = None  # Reset gradients
y2 = x2 * 3
y2.backward()
print(f"After second backward (reset): x2.grad = {x2.grad}")

# Method 3: Using optimizer (typical in training)
print("\nMethod 3: Using optimizer.zero_grad() (typical in training)")
print("-" * 70)
x3 = torch.tensor([1.0], requires_grad=True)
optimizer = torch.optim.SGD([x3], lr=0.01)

y1 = x3 * 2
y1.backward()
print(f"After first backward: x3.grad = {x3.grad}")

optimizer.zero_grad()  # Zero all gradients managed by optimizer
y2 = x3 * 3
y2.backward()
print(f"After second backward (optimizer zeroed): x3.grad = {x3.grad}")

print("\n" + "=" * 70)
print("Best Practices:")
print("=" * 70)
print("1. In training loops, call optimizer.zero_grad() before each")
print("   backward pass")
print("2. Use x.grad = None (more efficient) or x.grad.zero_()")
print("3. Only accumulate gradients intentionally (e.g., for gradient")
print("   accumulation across mini-batches)")



Solution: Proper Gradient Management

Method 1: Using x.grad.zero_()
----------------------------------------------------------------------
After first backward: x1.grad = tensor([2.])
After second backward (zeroed): x1.grad = tensor([3.])

Method 2: Setting x.grad = None
----------------------------------------------------------------------
After first backward: x2.grad = tensor([2.])
After second backward (reset): x2.grad = tensor([3.])

Method 3: Using optimizer.zero_grad() (typical in training)
----------------------------------------------------------------------
After first backward: x3.grad = tensor([2.])
After second backward (optimizer zeroed): x3.grad = tensor([3.])

Best Practices:
1. In training loops, call optimizer.zero_grad() before each
   backward pass
2. Use x.grad = None (more efficient) or x.grad.zero_()
3. Only accumulate gradients intentionally (e.g., for gradient
   accumulation across mini-batches)


---

## Summary of Gradient Computation Questions

### Key Learnings:

1. **Question 6**: Basic gradient calculation - PyTorch autograd fundamentals
2. **Question 7**: Weight gradients - Must set `requires_grad=True` for learnable parameters
3. **Question 8**: Graph connectivity - `.detach()` breaks the computational graph
4. **Question 9**: Gradient accumulation - Gradients accumulate by default, must zero them

### Important Points:

- **Default behavior**: `torch.tensor()` has `requires_grad=False` by default
- **Computational graph**: Only built for tensors with `requires_grad=True`
- **`.detach()`**: Creates a new tensor that shares data but has no gradient history
- **Gradient accumulation**: Intentional feature, useful for mini-batch accumulation
- **Best practice**: Always call `optimizer.zero_grad()` in training loops

These concepts are essential for understanding:
- How Energy-Based Models compute gradients for Langevin sampling
- How training loops work in PyTorch
- How to debug gradient flow issues
