Skip to content

radroof22/ddpm_implementation

Repository files navigation

Where Does Classifier-Free Guidance Act?

An Empirical Analysis of Conditioning in Diffusion Models

Author: Rohan Mehta Date: February 2026
Code: GitHub Link


Abstract

This work presents an investigation into the internal mechanisms of classifier-free guidance (CFG) in denoising diffusion probabilistic models (DDPMs). Through activation analysis, channel-wise comparisons, and ablation studies, we identify that class conditioning primarily affects the bottleneck region of the U-Net architecture, with early encoder and late decoder layers showing minimal sensitivity to conditioning. This finding suggests potential architectural simplifications and provides insight into how diffusion models incorporate semantic information.


1. Introduction

Diffusion models are probabilistic generative models that generate data by learning to reverse a gradual noising process. These models work by iteratively denoising random noise until they arrive at something that looks similar to what they've seen in their training data. Compared to autoregressive Transformer models, they can be substantially faster because they do not require token-by-token generation.

Classifier-free guidance allows ML practitioners to control the output of their diffusion model without training an entire classifier model. Unlike two-model techniques like GANs, CFG is able to control the output of the diffusion model using class-based embeddings that provide the model control over the output.

The focus of this investigation is to identify where the conditioning really acts. Does the conditioning act across the entire model, or is it specialized in a certain region, implying most of the class-specific features are controlled in only a handful of layers?


2. Background

2.1 Denoising Diffusion Probabilistic Models

The forward process starts with clean data and iteratively adds Gaussian noise to the image. We can use the reparameterization trick to jump to a specific timestep in the future rather than iteratively stepping up to that noise level. Once we arrive at the last timestep, the output data will be complete Gaussian noise. This procedure is fixed and has no learnable parameters.

But how do we generate usable data? The reverse process! The model learns to iteratively denoise the image from $x_t$ to $x_0$ by predicting the noise at each timestep. The model is then trained against the actual noise that was introduced so it's able to subtract the correct amount of noise to generate a valid sample.

These denoising steps aren't trying to arrive at a specific image but rather iteratively move toward a region of the data manifold where the original training data is more likely to exist. This is done by calculating the score function (gradient of the log probability density), which points in the direction where data is more likely to occur.

2.2 Classifier-Free Guidance

The issue with the normal DDPM procedure is we do not have any control over what we output. It's up to the model to point you towards a region of the data manifold where the training data is at a higher density, but what if it's closer to cats when you want dogs? The simplest way of doing this is adding embeddings for the output classes in your data, but how do you integrate them into your model architecture?

During training, we randomly drop the class label with some probability (e.g., 10%), replacing it with a null token. This trains the model to produce both conditional and unconditional predictions. At inference time, we compute both and combine them:

$$\tilde{\epsilon} = \epsilon_{\text{uncond}} + w \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$$

where $w$ is the guidance scale. When $w = 1$, this reduces to standard conditional generation. When $w > 1$, we amplify the difference between the conditional and unconditional predictions, pushing the output more strongly toward the conditioned class.

This works because $(\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$ captures the "direction" that class conditioning adds to the prediction. Scaling this difference by $w > 1$ extrapolates beyond what the model learned, producing higher-fidelity class-specific samples at the cost of some diversity.


3. Implementation

3.1 Architecture

Component Details
Base channels 64 → 128 → 256
Time embedding Sinusoidal + MLP (256-dim)
Class embedding Learned (256-dim) + null token for CFG
Normalization GroupNorm
Resolution 32×32 (CIFAR-10)

3.2 Training Details

Parameter Value
Dataset CIFAR-10
Epochs 1000
Batch size 128
Learning rate 2.5e-4
EMA decay 0.9999
Timesteps 1000
Label dropout 10%

3.3 Sample Quality

CFG Sweep Samples Figure 1: Generated samples across different guidance scales (w). Higher guidance produces more class-coherent images.

Sample quality was assessed qualitatively; formal metrics (FID, IS) were not computed for this analysis, as the focus was on understanding internal mechanisms rather than optimizing generation quality.


4. Analysis: Where Does Conditioning Act?

4.1 Experimental Setup

To isolate the effect of class conditioning, we compared activations between conditional and unconditional forward passes while controlling for all other variables:

  • Same input noise: Fixed random seed (42) ensures identical input tensor
  • Same timestep: $t = 500$ (mid-diffusion, where both noise and structure are present)
  • Conditional pass: Target class embedding (e.g., "cat", class index 3)
  • Unconditional pass: Null class embedding (the dropout token used during CFG training)

We registered forward hooks on all major layers (conv_in, enc1-3, down1-2, bottleneck, up1-3, dec1-2, conv_out) to capture intermediate activations. For each layer, we computed channel-wise cosine similarity between conditional and unconditional activations to quantify how much each channel changes due to conditioning.

4.2 Channel-wise Cosine Similarity

Channel Dissimilarity Heatmap Figure 2: Channel dissimilarity across layers. Darker red indicates channels heavily affected by conditioning. The bottleneck region shows the strongest conditioning effect.

We can see that the majority of the dissimilarity occurs between enc3, bottleneck, and up2. This makes sense with our hypothesis that the most CFG activity occurs in this region of the U-Net model. This is further supported by the peaks per layer, which also occur in the bottleneck region of the model.

Key Finding: Conditioning primarily affects layers in the bottleneck region (enc3, bottleneck, up2, dec2), with 14% of channels showing >10% dissimilarity. Early encoder and late decoder layers remain nearly identical between conditional and unconditional forward passes.

4.3 Most Affected Channels

From the visualization below, you can see that the changes are occurring in the bottleneck region of the U-Net model.

Top Affected Channels Figure 3: The 20 most conditioning-sensitive channels across the network, ranked by dissimilarity.


5. Ablation Study: Selective Layer Conditioning

5.1 Methodology

To validate our activation analysis, we performed an ablation study where class conditioning was selectively disabled in different layer groups:

  1. Full Conditioning (baseline)
  2. No Early Encoder (enc1, enc2 use null class embedding)
  3. No Late Decoder (dec2, dec1 use null class embedding)
  4. No Bottleneck Region (enc3, bottleneck, up2 use null class embedding)
  5. No Conditioning (all layers use null class embedding)

5.2 Results

Ablation Samples Figure 4: Generated samples under different ablation conditions for the "cat" class.

Ablation Impact Figure 5: Quantified ablation impact (L2 distance from baseline) across multiple classes.

Key Finding: Removing conditioning from the bottleneck region causes significantly larger deviations from baseline (L2 = 0.131) compared to removing conditioning from early (L2 = 0.049) or late (L2 = 0.060) layers. This confirms that class information is primarily injected and utilized in the bottleneck region.

One thing to note is that the visual differences can be subtle at 32×32 resolution, but the quantitative analysis clearly shows the difference between ablation conditions.


6. Bottleneck Representations

6.1 t-SNE Visualization

t-SNE computes pairwise similarities between points in high-dimensional space, and then creates a low-dimensional map that keeps similar points close together.

t-SNE Figure 6: t-SNE projection of bottleneck features colored by class. The dashed line shows a linear SVM decision boundary separating animals from vehicles.

  • Silhouette Score: 0.657 (indicates strong clustering quality; range -1 to 1, higher is better)
  • Animals vs Vehicles SVM Accuracy: 98.0% (indicates excellent semantic separation)

Key Finding: The bottleneck learns class-separable representations, with clear clustering by class and semantic grouping (animals vs vehicles).


7. Denoising Dynamics

7.1 Activation Evolution Across Timesteps

We would expect that the activations should look visually different depending on the timestep. The later timesteps have inputs closer to random noise, which is why the activations per pixel are more evenly distributed and random. The lower timesteps have more precise activations, which shows in the image.

Timestep Evolution Figure 7: Activation statistics across timesteps, showing how spatial structure emerges during denoising.


8. Discussion

8.1 Summary of Findings

  1. Conditioning is localized: Class conditioning primarily affects the bottleneck region (enc3, bottleneck, up2), not the full network.

  2. Sparse channel utilization: Only a subset of channels (~14%) show significant sensitivity to conditioning (>10% dissimilarity).

  3. Ablation confirmation: Removing conditioning from early/late layers has minimal impact on generation quality.

  4. Semantic organization: Bottleneck representations are class-separable and capture semantic categories.

8.2 Implications

  • Architectural efficiency: Class conditioning could potentially be removed from early encoder and late decoder layers without quality loss, reducing parameters and computation.

  • Understanding CFG: These results provide mechanistic insight into how CFG steers generation—by modulating the compressed semantic representation in the bottleneck rather than low-level features.

8.3 Limitations and Future Work

  • Analysis conducted on CIFAR-10 (32×32); findings should be validated on higher-resolution models
  • Single architecture tested; results may vary with different U-Net configurations
  • Future work: test on text-conditioned models (Stable Diffusion), larger datasets

9. Conclusion

This work provides empirical evidence that classifier-free guidance in diffusion models operates primarily through the bottleneck region of the U-Net architecture. Through activation analysis and ablation studies, we demonstrate that early encoder and late decoder layers show minimal sensitivity to class conditioning, while the compressed bottleneck representation is where semantic information is injected and utilized. These findings contribute to our understanding of how diffusion models incorporate conditioning and suggest potential avenues for architectural optimization.


References

  1. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS.

  2. Ho, J., & Salimans, T. (2022). Classifier-Free Diffusion Guidance. arXiv preprint.

  3. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI.


Appendix

A. Additional Visualizations

(Additional figures available in the code repository)

B. Code Availability

All code for this analysis is available at: [GitHub Repository Link]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors