# Stable Diffusion: From Image Classification to Generative Modeling

In this lab, you'll explore diffusion models for image generation. Diffusion models learn how to turn random noise into images by gradually "denoising" it through a learned process. When trained with images and their associated text prompts, these models can generate new images simply by describing what you'd like to see.

You'll use the Stable Diffusion 2 model and the Hugging Face diffusers library to:

- Generate images from text prompts
- Control generation with deterministic seeds
- Experiment with "negative prompts"
- Learn about properties and parameters in the pipelines

## 1 - Introduction

### 1.1 Loading the libraries

Run the cell below to load the necessary libraries for this lab.

In [None]:
from diffusers import StableDiffusionPipeline
import matplotlib.pyplot as plt
import torch
from PIL import Image
import os
import numpy as np

### 1.2 Discriminative vs. Generative Approaches in Deep Learning

Up to this point in the course, you have engaged primarily with **discriminative models**—deep learning systems developed to recognize and categorize images (e.g., via convolutional neural networks). With tools such as **saliency maps** and **class activation maps**, you have explored how these models identify salient regions that inform their decisions.

In this lab, you will shift focus to **generative models**: neural networks capable of synthesizing new, realistic images from scratch by learning the underlying distribution of the training data.

For example, consider describing a novel scenario through a prompt such as:
> "A puppy riding a skateboard in Times Square."

With recent advancements in generative AI, it is possible for a model to produce images matching this description, even though the process begins not with a recognizable picture, but with random noise.

Let's load the model and check its output. You will go deeper into the noise details later!

In [None]:
# Define model and cache directory
repo_id = "stabilityai/stable-diffusion-2-base"
cache_dir = os.path.join(os.getcwd(), './models')
model_cache_dir_name = f"models--{repo_id.replace('/', '--')}"
model_cache_path = os.path.join(cache_dir, model_cache_dir_name)

# Check if this directory exists. If it does, the model is cached.
# Note: This cache detection logic is repeated for each model in this notebook.
# In production code, you might refactor this into a reusable helper function!
model_is_cached = os.path.isdir(model_cache_path)

if model_is_cached:
    print("Model cache found.")
    print("Setting local_files_only=True.")
else:
    print("Model cache not found")
    print("Setting local_files_only=False (will attempt download).")

In [None]:
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the pipeline
# Note: If you encounter CUDA out of memory errors, you can reduce memory usage by:
# 1. Using torch.float32 instead of torch.float16 (though slower)
# 2. Reducing num_inference_steps
# 3. Setting enable_attention_slicing: pipe.enable_attention_slicing()
pipe = StableDiffusionPipeline.from_pretrained(
    repo_id, 
    torch_dtype=torch.float16, 
    variant="fp16", 
    cache_dir=cache_dir,
    local_files_only=model_is_cached
)

# Set the seed for reproducibility
generator = torch.Generator(device=device).manual_seed(42)

pipe = pipe.to(device)

> **Note on PyTorch Generators and Randomness**
>
> PyTorch `Generator` objects provide a way to produce sequences of random numbers in a **deterministic** ("non-random") order—if you set the same seed, you’ll always get the same sequence of numbers *from that generator*.  
>
> However, every time you generate an image, the generator “rolls” (uses up) some of its numbers depending on how many random draws the model needs. If you call the pipeline multiple times in a row using the **same generator**, each call will consume random numbers in sequence.  
>
> As a result, **even with the same prompt and generator/seed**, you may get different images on subsequent calls—unless you re-initialize or reset the generator’s seed before each run.  
>
> To consistently reproduce the same image for a prompt, always re-create or re-seed your generator in the exact same way before each generation.

In [None]:
prompt = "A puppy riding a skateboard in Times Square."

In [None]:
# Using 40 steps provides a good balance between quality and speed
# (Typical range: 20-50 for fast generation, 50-150 for high quality)
images = pipe(
    prompt,                # What you want the model to create
    num_inference_steps=40, # How many denoising steps to use (more steps = more detail/compute)
    generator=generator    # Ensures reproducible noise/randomness
).images

In [None]:
images[0].show()

In [None]:
images = pipe(
    prompt,                # What you want the model to create
    num_inference_steps=40, # How many denoising steps to use (more steps = more detail/compute)
    generator=generator    # Ensures reproducible noise/randomness
).images

In [None]:
images[0].show()

Notice how the images are different, even with the same prompt! This is because the generator is "using" its pseudo-random numbers. However, if you re-define the generator every time prior to calling the pipeline, then the results will be **the same**. Let's try it out!

In [None]:
# First run
generator = torch.Generator(device=device).manual_seed(42)
images = pipe(
    prompt,                # What you want the model to create
    num_inference_steps=40, # How many denoising steps to use (more steps = more detail/compute)
    generator=generator    # Ensures reproducible noise/randomness
).images
images[0].show()

In [None]:
# Second run
generator = torch.Generator(device=device).manual_seed(42)
images = pipe(
    prompt,                # What you want the model to create
    num_inference_steps=40, # How many denoising steps to use (more steps = more detail/compute)
    generator=generator    # Ensures reproducible noise/randomness
).images
images[0].show()

**Explanation of `.images[0]`:**

Often, image generation pipelines can produce more than one image at a time (by passing, for example, num_images_per_prompt=3), so they always return a list for consistency, even if you only generate one image. Let's try it out!

In [None]:
image_list = pipe(
    prompt,                # What you want the model to create
    num_inference_steps=40, # How many denoising steps to use (more steps = more detail/compute)
    generator=generator,   # Ensures reproducible noise/randomness
    num_images_per_prompt = 3
).images

In [None]:
# Set up a row of 3 subplots
fig, axs = plt.subplots(1, 3, figsize=(15,5))

# Display each image
for idx, img in enumerate(image_list):
    axs[idx].imshow(img)
    axs[idx].axis('off')  # Hide axis

plt.tight_layout()
plt.show()

### 1.3 Parameters in Diffusion Models


#### 1.3.1 `num_inference_steps` 
This parameter controls how many **denoising steps** the model takes. More steps typically yield more detailed and higher-quality images, but can take longer to compute.

In [None]:
prompt = "a cute dog with a red bandana, sitting in a lush park"
# Redefining the generator
generator = torch.Generator(device=device).manual_seed(42)

# Fewer steps (fast, less detail)
image_fast = pipe(prompt, num_inference_steps=10, generator=generator).images[0]
# Default/standard (balanced quality)

image_standard = pipe(prompt, num_inference_steps=50, generator=generator).images[0]
# More steps (slower, more detail)

image_high_quality = pipe(prompt, num_inference_steps=200, generator=generator).images[0]

In [None]:
images = [image_fast, image_standard, image_high_quality]
titles = ["10 steps (fast)", "50 steps (standard)", "200 steps (high quality)"]

fig, axs = plt.subplots(1, 3, figsize=(15,5))
for idx, (img, title) in enumerate(zip(images, titles)):
    axs[idx].imshow(img)
    axs[idx].set_title(title)
    axs[idx].axis('off')
plt.tight_layout()
plt.show()

#### 1.3.2 `guidance_scale` (a.k.a. "Classifier-Free Guidance")

This parameter controls how closely the generated image should follow your prompt.

- **Lower values** (e.g., 5-7): Allow more creativity and variation, but may stray from your prompt
- **Medium values** (e.g., 7-10): Balanced adherence to prompt (default is typically 7.5)
- **Higher values** (e.g., 10-15): Force the image to match your prompt more literally
- **Too high** (e.g., >20): Can make images look unnatural or oversaturated

**Tip**: Start with 7.5 and adjust based on whether you want more creative freedom or stricter prompt adherence.

In [None]:
guidance_scales = [5, 7.5, 12]
images = []

prompt = "A surreal landscape with floating clocks, melting trees, and a purple sky, in the style of Salvador Dalí"
# Generate all images first (so the generator seed doesn't advance unpredictably)
for gs in guidance_scales:
    # It's important to re-create the generator for reproducibility
    generator = torch.Generator(device=device).manual_seed(42)
    img = pipe(prompt, generator=generator, guidance_scale=gs).images[0]
    images.append(img)

# Plot images side by side
fig, axs = plt.subplots(1, len(images), figsize=(5 * len(images), 5))
for idx, (img, gs) in enumerate(zip(images, guidance_scales)):
    axs[idx].imshow(img)
    axs[idx].set_title(f"guidance_scale = {gs}")
    axs[idx].axis('off')
plt.tight_layout()
plt.show()

#### 1.3.3 `negative_prompt`

This optional parameter tells the model what you do **NOT** want to see in the image.
For example, if image generators often have trouble drawing hands, you can instruct:

In [None]:
prompt = "A group of realistic teddy bears eating pizza at a birthday party"
negative_prompt = "pepperoni, deformed hands, extra limbs, blurry, out of focus, text, watermark"

# Generate without negative prompt
generator = torch.Generator(device=device).manual_seed(42)
result_no_neg = pipe(prompt, generator=generator, num_inference_steps=30)

# Re-seed to ensure the same starting noise for fair comparison
generator = torch.Generator(device=device).manual_seed(42)
result_with_neg = pipe(prompt, negative_prompt=negative_prompt, generator=generator, num_inference_steps=30)

imgs = [result_no_neg.images[0], result_with_neg.images[0]]
titles = ['Without negative_prompt', 'With negative_prompt']

fig, axs = plt.subplots(1, 2, figsize=(10, 5))
for i, (img, title) in enumerate(zip(imgs, titles)):
    axs[i].imshow(img)
    axs[i].set_title(title)
    axs[i].axis('off')
plt.tight_layout()
plt.show()

### 1.4 Saving Intermediate Denoising Steps in the Stable Diffusion Pipeline

In many research and creative workflows, it’s useful to inspect or save the intermediate images produced during the denoising process of a diffusion model.
This lets you better understand, debug, or visualize how your prompt is transformed from noise into a coherent image.

With the Hugging Face diffusers library, you can easily capture these steps using the pipeline’s callback functionality.

#### 1.4.1 How the Callback Mechanism Works

The diffusion pipeline allows you to supply a callback function, which is called after each denoising step.
The callback receives:

- The current step index
- The current timestep
- The current latent representation (not yet decoded)
  
You can use this callback to:

- Save the latent at each step
- Decode it into an image (using the VAE) if you wish
- Visualize the denoising trajectory

In [None]:
# List to store intermediate images
all_steps = []

def save_intermediate_steps(step_index: int, timestep: int, latents: torch.Tensor):
    """
    Callback for saving intermediate images during denoising in Stable Diffusion v2.
    """
    with torch.no_grad():
        # 0.18215 is the VAE scaling factor used by Stable Diffusion 2
        # This rescales latents back to the range expected by the VAE decoder
        latents_input = latents / 0.18215
        image = pipe.vae.decode(latents_input).sample
        image = (image / 2 + 0.5).clamp(0, 1)  # Normalize to [0,1]
        image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
    pil_image = Image.fromarray((image * 255).astype("uint8"))
    # Save to directory (optional)
    outdir = "intermediate_steps"
    os.makedirs(outdir, exist_ok=True)
    pil_image.save(f"{outdir}/step_{step_index:02d}.png")
    # Also collect for grid plotting
    all_steps.append((step_index, pil_image))

generator = torch.Generator(device).manual_seed(42)

prompt = "A puppy dog riding a skateboard in Times Square"
negative_prompt = "cars"

# Clear previous
all_steps.clear()


In [None]:
# Run the pipe with the callback
pipe(
    prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=40,
    generator=generator,
    callback=save_intermediate_steps,
    callback_steps=1  # Every step
)
# ----- Plot all 40 steps in an 8×5 grid -----
n_rows, n_cols = 8, 5  # 8 rows, 5 columns
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 24))
fig.suptitle('Stable Diffusion 2: Denoising Steps', fontsize=18)

for idx, (step, img) in enumerate(all_steps):
    row, col = divmod(idx, n_cols)
    axes[row, col].imshow(img)
    axes[row, col].set_title(f"Step {step+1}", fontsize=8)
    axes[row, col].axis('off')

plt.tight_layout()
plt.show()

### What is a DDPM?

**DDPM (Denoising Diffusion Probabilistic Models)** is a type of generative model that creates images through a two-phase process:

#### Forward Process (Diffusion)
- Gradually adds Gaussian noise to an image over many steps (typically 1000)
- Eventually transforms any image into pure random noise
- This process is fixed and doesn't require learning

#### Reverse Process (Denoising)
- Learns to reverse the forward process step by step
- A neural network is trained to predict and remove noise at each step
- Starting from pure noise, it gradually denoises to generate a new image

#### Key Characteristics:
- **Probabilistic**: Each denoising step involves sampling from a probability distribution
- **Iterative**: Requires many small denoising steps (unlike GANs which generate in one pass)
- **Stable Training**: Easier to train than GANs, with fewer stability issues
- **High Quality**: Produces very high-quality, diverse images

The mathematical foundation involves learning to approximate the reverse of a Markov chain that gradually adds noise to data.

---

### 2 - Comparing Pixel-space DDPM and Latent-space Stable Diffusion: How Does Denoising Progress Look?

Generative diffusion models can operate either on images directly (pixel space) or on compressed representations of images (latent space).
Understanding the intermediate steps of each approach is highly instructive for anyone learning about generative models.

#### Pixel-space DDPMs
- Operate directly on RGB image pixels
- The denoising process is intuitive—images become less noisy and more detailed step by step, and every intermediate step is a valid, viewable image
- This makes pixel-space DDPMs ideal for educational demos about the core idea behind denoising diffusion
- **Drawback**: Computationally expensive due to working with high-dimensional pixel data

#### Latent Diffusion Models (like Stable Diffusion)
- Operate on compressed codes ("latents") learned from image data
- Intermediate outputs may not look like natural images—structure becomes visible only in late steps
- This makes them **much more efficient and scalable**, but less intuitive to visualize at each step
- Use an autoencoder to compress images to latent space and decode back to pixels

In this section, you'll see both processes side by side, using classic DDPM (on pixels) and Stable Diffusion (on latents).

**You may see a message about .safetensors not being found and the model using .bin weights instead.**

This is a normal part of loading older Hugging Face models and does not affect their functionality.
You can safely ignore this warning. 

In [None]:
from diffusers import DDPMPipeline
from helper_utils import visualize_ddpm_denoising

In [None]:
# Define model and cache directory
model_id = "google/ddpm-ema-bedroom-256"  # Bedroom images (256x256)
cache_dir = os.path.join(os.getcwd(), './models')
model_cache_dir_name = f"models--{model_id.replace('/', '--')}"
model_cache_path = os.path.join(cache_dir, model_cache_dir_name)

# Check if this directory exists. If it does, the model is cached.
model_is_cached = os.path.isdir(model_cache_path)

if model_is_cached:
    print("Model cache found")
    print("Setting local_files_only=True.")
else:
    print("Model cache not found")
    print("Setting local_files_only=False (will attempt download).")

In [None]:
# Load the DDPM pipeline from cache

ddpm_pipeline = DDPMPipeline.from_pretrained(
    model_id, 
    cache_dir=cache_dir, 
    local_files_only=model_is_cached
).to(device)

# Using 1000 steps for highest quality DDPM generation
# (DDPM typically requires more steps than Stable Diffusion for best results)
num_inference_steps = 1000

print(f"Using model: {model_id}")
print(f"Image resolution: 256x256 pixels")
print(f"Steps: {num_inference_steps}")

# Access the model and scheduler
model = ddpm_pipeline.unet
scheduler = ddpm_pipeline.scheduler

# Set up for visualization
scheduler.set_timesteps(num_inference_steps)

In [None]:
# --- Run pixel-space DDPM comparison ---
num_inference_steps = 100
print("Generating DDPM denoising comparison...")
gradual_images, full_removal_images = visualize_ddpm_denoising(ddpm_pipeline, num_inference_steps=num_inference_steps)

In [None]:
num_splits = 5
# Use the actual step IDs saved
actual_step_ids = [gi[0] for gi in gradual_images]
num_inference_steps = actual_step_ids[-1]  # last step
step_indices = [int(np.round(i * num_inference_steps / (num_splits - 1))) for i in range(num_splits - 1)]
step_indices.append(num_inference_steps)

print("Plotted step indices:", step_indices)

# Speed up access: build {step: image} dictionaries
grad_dict = {s: img for (s, img) in gradual_images}
full_dict = {s: img for (s, img) in full_removal_images}

# Use closest available step if step is not present
def get_closest_img(dct, step):
    # Find the closest available key in the dict
    best = min(dct.keys(), key=lambda k: abs(k-step))
    return dct[best]

images_full = [get_closest_img(full_dict, s) for s in step_indices]
images_grad = [get_closest_img(grad_dict, s) for s in step_indices]
final_img = get_closest_img(grad_dict, num_inference_steps)

# Plot
fig, axes = plt.subplots(2, len(step_indices), figsize=(3.5 * len(step_indices), 6))
for i, (img, s) in enumerate(zip(images_full, step_indices)):
    axes[0, i].imshow(img)
    axes[0, i].set_title(f"Step {s}: All noise removed")
    axes[0, i].axis('off')
for i, (img, s) in enumerate(zip(images_grad, step_indices)):
    axes[1, i].imshow(img)
    axes[1, i].set_title("Final Result" if i == len(step_indices)-1 else f"Step {s}: Gradual")
    axes[1, i].axis('off')
plt.tight_layout()
plt.show()

**Notice how:**
- "All noise removed" at each step starts out blurry and imperfect, only approaching the true image by the very end.
- "Gradual" denoising images sharpen step by step, with the final result much cleaner.
- This demonstrates why diffusion models remove noise one step at a time, rather than all at once.

### 3 - Playing with the Diffusion Model

Now it's your turn to experiment! The widget below allows you to easily generate images with different prompts and parameters.

**Suggestions to try:**
- **Different subjects**: animals, landscapes, objects, fantasy creatures
- **Art styles**: "in the style of Van Gogh", "pixel art", "watercolor painting", "3D render"
- **Lighting/mood**: "sunset lighting", "dramatic shadows", "soft pastel colors"
- **Negative prompts**: Try adding "blurry, distorted, low quality" to improve results
- **Parameter experimentation**: 
  - Low guidance_scale (3-5) for creative, artistic results
  - High guidance_scale (12-15) for literal interpretation
  - More steps (100+) for higher detail

**Common issues:**
- **Blurry images**: Try increasing `num_inference_steps` or adding negative prompts
- **Doesn't match prompt**: Increase `guidance_scale`
- **Unnatural colors/artifacts**: Lower `guidance_scale` or try different seeds
- **Slow generation**: Reduce `num_inference_steps` (try 25-30 for faster results)

Run the cell below to load the interactive widget!

In [None]:
from helper_utils import load_widget

In [None]:
load_widget(pipe)


## 4 -  Conclusion

In this lab, you:
- Explored how diffusion models generate images step by step from noise.
- Compared classic DDPMs (in pixel space) with modern Stable Diffusion (in latent space).
- Used PyTorch and the Diffusers library to experiment with generation parameters and observe denoising in action.
- Built and interacted with a widget to visualize your own prompts through the denoising process.

Continue to try new prompts, tweak settings, and see what amazing images you can create!