Author: Rohan Mehta
Date: February 2026
Code: GitHub Link
This work presents an investigation into the internal mechanisms of classifier-free guidance (CFG) in denoising diffusion probabilistic models (DDPMs). Through activation analysis, channel-wise comparisons, and ablation studies, we identify that class conditioning primarily affects the bottleneck region of the U-Net architecture, with early encoder and late decoder layers showing minimal sensitivity to conditioning. This finding suggests potential architectural simplifications and provides insight into how diffusion models incorporate semantic information.
Diffusion models are probabilistic generative models that generate data by learning to reverse a gradual noising process. These models work by iteratively denoising random noise until they arrive at something that looks similar to what they've seen in their training data. Compared to autoregressive Transformer models, they can be substantially faster because they do not require token-by-token generation.
Classifier-free guidance allows ML practitioners to control the output of their diffusion model without training an entire classifier model. Unlike two-model techniques like GANs, CFG is able to control the output of the diffusion model using class-based embeddings that provide the model control over the output.
The focus of this investigation is to identify where the conditioning really acts. Does the conditioning act across the entire model, or is it specialized in a certain region, implying most of the class-specific features are controlled in only a handful of layers?
The forward process starts with clean data and iteratively adds Gaussian noise to the image. We can use the reparameterization trick to jump to a specific timestep in the future rather than iteratively stepping up to that noise level. Once we arrive at the last timestep, the output data will be complete Gaussian noise. This procedure is fixed and has no learnable parameters.
But how do we generate usable data? The reverse process! The model learns to iteratively denoise the image from
These denoising steps aren't trying to arrive at a specific image but rather iteratively move toward a region of the data manifold where the original training data is more likely to exist. This is done by calculating the score function (gradient of the log probability density), which points in the direction where data is more likely to occur.
The issue with the normal DDPM procedure is we do not have any control over what we output. It's up to the model to point you towards a region of the data manifold where the training data is at a higher density, but what if it's closer to cats when you want dogs? The simplest way of doing this is adding embeddings for the output classes in your data, but how do you integrate them into your model architecture?
During training, we randomly drop the class label with some probability (e.g., 10%), replacing it with a null token. This trains the model to produce both conditional and unconditional predictions. At inference time, we compute both and combine them:
where
This works because
| Component | Details |
|---|---|
| Base channels | 64 → 128 → 256 |
| Time embedding | Sinusoidal + MLP (256-dim) |
| Class embedding | Learned (256-dim) + null token for CFG |
| Normalization | GroupNorm |
| Resolution | 32×32 (CIFAR-10) |
| Parameter | Value |
|---|---|
| Dataset | CIFAR-10 |
| Epochs | 1000 |
| Batch size | 128 |
| Learning rate | 2.5e-4 |
| EMA decay | 0.9999 |
| Timesteps | 1000 |
| Label dropout | 10% |
Figure 1: Generated samples across different guidance scales (w). Higher guidance produces more class-coherent images.
Sample quality was assessed qualitatively; formal metrics (FID, IS) were not computed for this analysis, as the focus was on understanding internal mechanisms rather than optimizing generation quality.
To isolate the effect of class conditioning, we compared activations between conditional and unconditional forward passes while controlling for all other variables:
- Same input noise: Fixed random seed (42) ensures identical input tensor
-
Same timestep:
$t = 500$ (mid-diffusion, where both noise and structure are present) - Conditional pass: Target class embedding (e.g., "cat", class index 3)
- Unconditional pass: Null class embedding (the dropout token used during CFG training)
We registered forward hooks on all major layers (conv_in, enc1-3, down1-2, bottleneck, up1-3, dec1-2, conv_out) to capture intermediate activations. For each layer, we computed channel-wise cosine similarity between conditional and unconditional activations to quantify how much each channel changes due to conditioning.
Figure 2: Channel dissimilarity across layers. Darker red indicates channels heavily affected by conditioning. The bottleneck region shows the strongest conditioning effect.
We can see that the majority of the dissimilarity occurs between enc3, bottleneck, and up2. This makes sense with our hypothesis that the most CFG activity occurs in this region of the U-Net model. This is further supported by the peaks per layer, which also occur in the bottleneck region of the model.
Key Finding: Conditioning primarily affects layers in the bottleneck region (enc3, bottleneck, up2, dec2), with 14% of channels showing >10% dissimilarity. Early encoder and late decoder layers remain nearly identical between conditional and unconditional forward passes.
From the visualization below, you can see that the changes are occurring in the bottleneck region of the U-Net model.
Figure 3: The 20 most conditioning-sensitive channels across the network, ranked by dissimilarity.
To validate our activation analysis, we performed an ablation study where class conditioning was selectively disabled in different layer groups:
- Full Conditioning (baseline)
- No Early Encoder (enc1, enc2 use null class embedding)
- No Late Decoder (dec2, dec1 use null class embedding)
- No Bottleneck Region (enc3, bottleneck, up2 use null class embedding)
- No Conditioning (all layers use null class embedding)
Figure 4: Generated samples under different ablation conditions for the "cat" class.
Figure 5: Quantified ablation impact (L2 distance from baseline) across multiple classes.
Key Finding: Removing conditioning from the bottleneck region causes significantly larger deviations from baseline (L2 = 0.131) compared to removing conditioning from early (L2 = 0.049) or late (L2 = 0.060) layers. This confirms that class information is primarily injected and utilized in the bottleneck region.
One thing to note is that the visual differences can be subtle at 32×32 resolution, but the quantitative analysis clearly shows the difference between ablation conditions.
t-SNE computes pairwise similarities between points in high-dimensional space, and then creates a low-dimensional map that keeps similar points close together.
Figure 6: t-SNE projection of bottleneck features colored by class. The dashed line shows a linear SVM decision boundary separating animals from vehicles.
- Silhouette Score: 0.657 (indicates strong clustering quality; range -1 to 1, higher is better)
- Animals vs Vehicles SVM Accuracy: 98.0% (indicates excellent semantic separation)
Key Finding: The bottleneck learns class-separable representations, with clear clustering by class and semantic grouping (animals vs vehicles).
We would expect that the activations should look visually different depending on the timestep. The later timesteps have inputs closer to random noise, which is why the activations per pixel are more evenly distributed and random. The lower timesteps have more precise activations, which shows in the image.
Figure 7: Activation statistics across timesteps, showing how spatial structure emerges during denoising.
-
Conditioning is localized: Class conditioning primarily affects the bottleneck region (enc3, bottleneck, up2), not the full network.
-
Sparse channel utilization: Only a subset of channels (~14%) show significant sensitivity to conditioning (>10% dissimilarity).
-
Ablation confirmation: Removing conditioning from early/late layers has minimal impact on generation quality.
-
Semantic organization: Bottleneck representations are class-separable and capture semantic categories.
-
Architectural efficiency: Class conditioning could potentially be removed from early encoder and late decoder layers without quality loss, reducing parameters and computation.
-
Understanding CFG: These results provide mechanistic insight into how CFG steers generation—by modulating the compressed semantic representation in the bottleneck rather than low-level features.
- Analysis conducted on CIFAR-10 (32×32); findings should be validated on higher-resolution models
- Single architecture tested; results may vary with different U-Net configurations
- Future work: test on text-conditioned models (Stable Diffusion), larger datasets
This work provides empirical evidence that classifier-free guidance in diffusion models operates primarily through the bottleneck region of the U-Net architecture. Through activation analysis and ablation studies, we demonstrate that early encoder and late decoder layers show minimal sensitivity to class conditioning, while the compressed bottleneck representation is where semantic information is injected and utilized. These findings contribute to our understanding of how diffusion models incorporate conditioning and suggest potential avenues for architectural optimization.
-
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS.
-
Ho, J., & Salimans, T. (2022). Classifier-Free Diffusion Guidance. arXiv preprint.
-
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI.
(Additional figures available in the code repository)
All code for this analysis is available at: [GitHub Repository Link]