# Module 25: Stable Diffusion

**Text-to-Image Generation with Latent Diffusion**

---

## 1. Objectives

- ‚úÖ Understand diffusion model theory
- ‚úÖ Learn Stable Diffusion architecture
- ‚úÖ Master the Diffusers library
- ‚úÖ Implement text-to-image generation
- ‚úÖ Explore ControlNet and fine-tuning

## 2. Prerequisites

- [Module 24: Multimodal Learning](../24_multimodal/24_multimodal.ipynb)
- Basic understanding of probability and neural networks

## 3. Diffusion Models - Theory

### Core Intuition

Diffusion models learn to **reverse a gradual noising process**:

```
Forward Process (Fixed):
Clean Image ‚îÄ‚îÄ‚Üí Add Noise ‚îÄ‚îÄ‚Üí Add Noise ‚îÄ‚îÄ‚Üí ... ‚îÄ‚îÄ‚Üí Pure Noise
    x‚ÇÄ      ‚Üí      x‚ÇÅ     ‚Üí      x‚ÇÇ     ‚Üí ... ‚Üí      x‚Çú

Reverse Process (Learned):
Pure Noise ‚îÄ‚îÄ‚Üí Denoise ‚îÄ‚îÄ‚Üí Denoise ‚îÄ‚îÄ‚Üí ... ‚îÄ‚îÄ‚Üí Clean Image
    x‚Çú     ‚Üí    x‚Çú‚Çã‚ÇÅ   ‚Üí    x‚Çú‚Çã‚ÇÇ   ‚Üí ... ‚Üí      x‚ÇÄ
```

### Forward Process (Adding Noise)

At each timestep $t$, we add Gaussian noise:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

We can jump directly to any timestep:

$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)$$

Where $\bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s)$

### Reverse Process (Denoising)

The model learns to predict the noise $\epsilon$ at each step:

$$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right]$$

In [None]:
# Install: pip install diffusers accelerate transformers torch

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device: {device}")

In [None]:
class SimpleDiffusion:
    """Simplified diffusion process for understanding.
    
    This implements the forward and reverse diffusion math.
    """
    
    def __init__(self, num_timesteps=1000, beta_start=1e-4, beta_end=0.02):
        self.num_timesteps = num_timesteps
        
        # Linear noise schedule
        self.betas = torch.linspace(beta_start, beta_end, num_timesteps)
        self.alphas = 1.0 - self.betas
        self.alpha_cumprod = torch.cumprod(self.alphas, dim=0)
        self.sqrt_alpha_cumprod = torch.sqrt(self.alpha_cumprod)
        self.sqrt_one_minus_alpha_cumprod = torch.sqrt(1.0 - self.alpha_cumprod)
    
    def add_noise(self, x_0, t, noise=None):
        """Forward process: add noise to clean data.
        
        x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise
        """
        if noise is None:
            noise = torch.randn_like(x_0)
        
        sqrt_alpha = self.sqrt_alpha_cumprod[t].view(-1, 1, 1, 1)
        sqrt_one_minus_alpha = self.sqrt_one_minus_alpha_cumprod[t].view(-1, 1, 1, 1)
        
        return sqrt_alpha * x_0 + sqrt_one_minus_alpha * noise
    
    def remove_noise(self, x_t, t, predicted_noise):
        """Reverse process: estimate x_{t-1} from x_t."""
        alpha = self.alphas[t].view(-1, 1, 1, 1)
        alpha_cumprod = self.alpha_cumprod[t].view(-1, 1, 1, 1)
        beta = self.betas[t].view(-1, 1, 1, 1)
        
        # Predict x_0 from noise
        sqrt_one_minus_alpha = self.sqrt_one_minus_alpha_cumprod[t].view(-1, 1, 1, 1)
        x_0_pred = (x_t - sqrt_one_minus_alpha * predicted_noise) / self.sqrt_alpha_cumprod[t].view(-1, 1, 1, 1)
        
        # Compute mean of reverse distribution
        mean = (1 / torch.sqrt(alpha)) * (x_t - (beta / sqrt_one_minus_alpha) * predicted_noise)
        
        # Add noise (except for t=0)
        if t[0] > 0:
            noise = torch.randn_like(x_t)
            std = torch.sqrt(beta)
            return mean + std * noise
        return mean

# Visualize noising process
diffusion = SimpleDiffusion()
x_0 = torch.randn(1, 3, 64, 64)  # Fake "clean" image

fig, axes = plt.subplots(1, 5, figsize=(15, 3))
timesteps = [0, 250, 500, 750, 999]

for ax, t in zip(axes, timesteps):
    x_t = diffusion.add_noise(x_0, torch.tensor([t]))
    # Normalize for visualization
    img = (x_t[0].permute(1, 2, 0).numpy() + 1) / 2
    img = np.clip(img, 0, 1)
    ax.imshow(img)
    ax.set_title(f't={t}')
    ax.axis('off')

plt.suptitle('Forward Diffusion Process')
plt.tight_layout()
plt.show()

## 4. Stable Diffusion Architecture

### Key Innovation: Latent Diffusion

Instead of diffusing in **pixel space** (512√ó512√ó3 = 786K dimensions), work in **latent space** (64√ó64√ó4 = 16K dimensions)!

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    Stable Diffusion Pipeline                  ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                               ‚îÇ
‚îÇ  Text Prompt ‚îÄ‚îÄ‚Üí [CLIP Text Encoder] ‚îÄ‚îÄ‚Üí Text Embeddings     ‚îÇ
‚îÇ                                               ‚Üì               ‚îÇ
‚îÇ  Random Noise ‚îÄ‚îÄ‚Üí [U-Net] ‚Üê‚îÄ‚îÄ Cross-Attention                ‚îÇ
‚îÇ                      ‚Üì                                        ‚îÇ
‚îÇ              [Scheduler: DDPM/DDIM/...]                       ‚îÇ
‚îÇ                      ‚Üì                                        ‚îÇ
‚îÇ              Denoised Latents                                 ‚îÇ
‚îÇ                      ‚Üì                                        ‚îÇ
‚îÇ              [VAE Decoder] ‚îÄ‚îÄ‚Üí Generated Image                ‚îÇ
‚îÇ                                                               ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Components

| Component | Purpose | Size |
|-----------|---------|------|
| VAE | Compress/decompress images | ~80M params |
| U-Net | Predict noise at each step | ~860M params |
| CLIP Text Encoder | Encode text prompts | ~123M params |
| Scheduler | Control denoising steps | N/A |

In [None]:
# U-Net building blocks

class ResidualBlock(nn.Module):
    """Residual block with time embedding."""
    
    def __init__(self, in_channels, out_channels, time_emb_dim):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
        self.time_mlp = nn.Linear(time_emb_dim, out_channels)
        self.norm1 = nn.GroupNorm(8, out_channels)
        self.norm2 = nn.GroupNorm(8, out_channels)
        
        if in_channels != out_channels:
            self.shortcut = nn.Conv2d(in_channels, out_channels, 1)
        else:
            self.shortcut = nn.Identity()
    
    def forward(self, x, t_emb):
        h = self.norm1(F.silu(self.conv1(x)))
        h = h + self.time_mlp(t_emb)[:, :, None, None]  # Add time embedding
        h = self.norm2(F.silu(self.conv2(h)))
        return h + self.shortcut(x)


class CrossAttention(nn.Module):
    """Cross-attention for text conditioning."""
    
    def __init__(self, dim, context_dim, n_heads=8):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = dim // n_heads
        
        self.to_q = nn.Linear(dim, dim)
        self.to_k = nn.Linear(context_dim, dim)
        self.to_v = nn.Linear(context_dim, dim)
        self.to_out = nn.Linear(dim, dim)
    
    def forward(self, x, context):
        batch, seq_len, dim = x.shape
        
        q = self.to_q(x)
        k = self.to_k(context)
        v = self.to_v(context)
        
        # Reshape for multi-head attention
        q = q.view(batch, -1, self.n_heads, self.head_dim).transpose(1, 2)
        k = k.view(batch, -1, self.n_heads, self.head_dim).transpose(1, 2)
        v = v.view(batch, -1, self.n_heads, self.head_dim).transpose(1, 2)
        
        # Attention
        attn = torch.softmax(q @ k.transpose(-2, -1) / (self.head_dim ** 0.5), dim=-1)
        out = attn @ v
        
        out = out.transpose(1, 2).reshape(batch, seq_len, dim)
        return self.to_out(out)

print("U-Net blocks defined!")

## 5. Using Diffusers Library

In [None]:
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch

# Load Stable Diffusion (use smaller versions for limited VRAM)
model_id = "stabilityai/stable-diffusion-2-1-base"  # ~5GB
# model_id = "CompVis/stable-diffusion-v1-4"  # Older but smaller

# For low VRAM:
pipe = StableDiffusionPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16,  # Use FP16 to save memory
    safety_checker=None  # Disable for speed (use responsibly!)
)

# Use faster scheduler
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Move to GPU if available
pipe = pipe.to(device)

# Enable memory optimizations
pipe.enable_attention_slicing()  # Reduces memory usage
# pipe.enable_xformers_memory_efficient_attention()  # Even better (needs xformers)

print("Stable Diffusion pipeline loaded!")

In [None]:
# Basic text-to-image generation

prompt = "A majestic lion in a cosmic nebula, digital art, highly detailed"
negative_prompt = "blurry, low quality, distorted"

# Generate image
with torch.inference_mode():
    image = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=25,  # More steps = better quality
        guidance_scale=7.5,      # How closely to follow prompt
        width=512,
        height=512
    ).images[0]

# Display
plt.figure(figsize=(8, 8))
plt.imshow(image)
plt.title(prompt[:50] + "...")
plt.axis('off')
plt.show()

# Save
image.save("generated_image.png")

## 6. Guidance Scale & Classifier-Free Guidance

### Theory

Classifier-Free Guidance (CFG) balances quality vs prompt adherence:

$$\tilde{\epsilon} = \epsilon_\theta(x_t, \emptyset) + s \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \emptyset))$$

Where:
- $\epsilon_\theta(x_t, c)$ = noise prediction with text condition
- $\epsilon_\theta(x_t, \emptyset)$ = unconditional prediction (no text)
- $s$ = guidance scale (typically 7-15)

### Effect of Guidance Scale

| Scale | Effect |
|-------|--------|
| 1 | No guidance, ignores prompt |
| 5-7 | Balanced |
| 10-15 | Strong prompt adherence |
| 20+ | Over-saturated, artifacts |

In [None]:
# Compare different guidance scales

prompt = "A serene Japanese garden with cherry blossoms"
scales = [1.0, 5.0, 7.5, 12.0]

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, scale in zip(axes, scales):
    with torch.inference_mode():
        image = pipe(
            prompt=prompt,
            num_inference_steps=20,
            guidance_scale=scale,
            generator=torch.Generator(device).manual_seed(42)  # Reproducible
        ).images[0]
    
    ax.imshow(image)
    ax.set_title(f"Scale: {scale}")
    ax.axis('off')

plt.suptitle(f"Guidance Scale Comparison: '{prompt[:40]}...'")
plt.tight_layout()
plt.show()

## 7. Schedulers (Samplers)

Different schedulers offer speed/quality tradeoffs:

| Scheduler | Steps | Quality | Speed |
|-----------|-------|---------|-------|
| DDPM | 1000 | Best | Slow |
| DDIM | 50-100 | Great | Fast |
| DPM++ 2M | 20-30 | Great | Fast |
| Euler Ancestral | 20-30 | Good | Fast |

In [None]:
from diffusers import (
    DDPMScheduler,
    DDIMScheduler,
    DPMSolverMultistepScheduler,
    EulerAncestralDiscreteScheduler
)

schedulers = {
    "DDIM": DDIMScheduler,
    "DPM++ 2M": DPMSolverMultistepScheduler,
    "Euler A": EulerAncestralDiscreteScheduler
}

prompt = "A futuristic city at sunset, cyberpunk style"

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

for ax, (name, scheduler_class) in zip(axes, schedulers.items()):
    pipe.scheduler = scheduler_class.from_config(pipe.scheduler.config)
    
    with torch.inference_mode():
        image = pipe(
            prompt=prompt,
            num_inference_steps=25,
            generator=torch.Generator(device).manual_seed(42)
        ).images[0]
    
    ax.imshow(image)
    ax.set_title(name)
    ax.axis('off')

plt.suptitle("Scheduler Comparison (25 steps)")
plt.tight_layout()
plt.show()

## 8. Image-to-Image Generation

In [None]:
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image
import requests
from io import BytesIO

# Load img2img pipeline
img2img_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16
).to(device)

# Load an input image
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Camponotus_flavomarginatus_ant.jpg/320px-Camponotus_flavomarginatus_ant.jpg"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((512, 512))

# Transform with a prompt
prompt = "A robotic ant made of chrome and neon, sci-fi style"

with torch.inference_mode():
    output = img2img_pipe(
        prompt=prompt,
        image=init_image,
        strength=0.75,  # 0=no change, 1=complete transformation
        num_inference_steps=30
    ).images[0]

# Compare
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].imshow(init_image)
axes[0].set_title("Original")
axes[1].imshow(output)
axes[1].set_title("Transformed")
for ax in axes:
    ax.axis('off')
plt.tight_layout()
plt.show()

## 9. ControlNet for Guided Generation

ControlNet adds **spatial conditioning** to guide generation:

```
Control Input (edge/pose/depth)  ‚îÄ‚îÄ‚Üí ControlNet ‚îÄ‚îÄ‚Üí 
                                        ‚Üì
Text Prompt ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí U-Net ‚îÄ‚îÄ‚Üí Image
```

In [None]:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
import cv2
import numpy as np

# Load ControlNet for Canny edges
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)

controlnet_pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to(device)

def get_canny_edges(image, low=100, high=200):
    """Extract Canny edges from image."""
    image = np.array(image)
    edges = cv2.Canny(image, low, high)
    edges = np.stack([edges] * 3, axis=-1)
    return Image.fromarray(edges)

# Create edges from input
canny_image = get_canny_edges(init_image)

# Generate with edge control
prompt = "A beautiful butterfly, vibrant colors, macro photography"

with torch.inference_mode():
    controlled_output = controlnet_pipe(
        prompt=prompt,
        image=canny_image,
        num_inference_steps=25
    ).images[0]

# Show results
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
axes[0].imshow(init_image)
axes[0].set_title("Original")
axes[1].imshow(canny_image)
axes[1].set_title("Canny Edges")
axes[2].imshow(controlled_output)
axes[2].set_title("ControlNet Output")
for ax in axes:
    ax.axis('off')
plt.tight_layout()
plt.show()

## 10. Fine-Tuning with LoRA

Train custom styles with minimal compute using LoRA:

In [None]:
# Load a LoRA-fine-tuned model
from diffusers import StableDiffusionPipeline

# Example: Loading a LoRA for a specific style
# pipe.load_lora_weights("path/to/lora/weights")

# For training LoRA, use the diffusers train_dreambooth_lora.py script:
# accelerate launch train_dreambooth_lora.py \
#   --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1" \
#   --instance_data_dir="./my_images" \
#   --instance_prompt="photo of sks dog" \
#   --output_dir="./lora_weights" \
#   --train_batch_size=1 \
#   --max_train_steps=500

print("LoRA fine-tuning example (see diffusers documentation for full training)")

## 11. Prompt Engineering for Images

### Effective Prompts

| Element | Example |
|---------|--------|
| Subject | "A majestic dragon" |
| Style | "digital art, oil painting, anime" |
| Quality | "highly detailed, 4k, masterpiece" |
| Lighting | "golden hour, dramatic lighting" |
| Camera | "wide angle, portrait, macro" |

### Template
```
[subject], [style], [quality modifiers], [lighting], [artist reference]
```

In [None]:
# Prompt examples
prompts = [
    # Basic
    "A cat",
    
    # With style
    "A cat, watercolor painting style",
    
    # With quality
    "A cat, watercolor painting, highly detailed, vibrant colors",
    
    # Full prompt
    "A majestic cat sitting on a throne, watercolor painting, highly detailed, "
    "golden hour lighting, by Studio Ghibli, masterpiece, 4k"
]

print("Prompt Progression:")
for i, p in enumerate(prompts, 1):
    print(f"\n{i}. {p}")

## 12. Interview Questions

**Q1: Explain the forward and reverse diffusion process.**
<details><summary>Answer</summary>

- Forward: Gradually add Gaussian noise to data over T steps until pure noise
- Reverse: Learn to predict/remove noise at each step to reconstruct clean data
- The model learns the noise distribution at each timestep
</details>

**Q2: Why does Stable Diffusion work in latent space?**
<details><summary>Answer</summary>

Latent space is much smaller (64√ó64√ó4 vs 512√ó512√ó3), making:
- Training faster and cheaper
- Inference faster (fewer pixels to denoise)
- The VAE preserves perceptually important features
</details>

**Q3: What is classifier-free guidance?**
<details><summary>Answer</summary>

A technique to control prompt adherence without a separate classifier:
- Train model with random prompt dropout
- At inference: blend conditional and unconditional predictions
- Higher guidance = stronger prompt following, but can over-saturate
</details>

## 13. Summary

| Concept | Key Point |
|---------|----------|
| Diffusion | Learn to reverse gradual noising |
| Latent Diffusion | Work in compressed space |
| CFG | Balance quality vs prompt adherence |
| Schedulers | Trade speed for quality |
| ControlNet | Add spatial conditioning |

## 14. References

- [DDPM Paper](https://arxiv.org/abs/2006.11239)
- [Latent Diffusion Paper](https://arxiv.org/abs/2112.10752)
- [Diffusers Library](https://huggingface.co/docs/diffusers/)
- [ControlNet Paper](https://arxiv.org/abs/2302.05543)

---
**üéâ Congratulations! You've completed the full NLP + Multimodal curriculum!**

Return to [Module 00: NLP Pipeline Overview](../00_nlp_pipeline/00_nlp_pipeline_overview.ipynb)