# Diffusion "Self-guidance"

https://dave.ml/selfguidance/

**Use-case:** Method to move/resize objects, replace objects with new ones, change scene backgrounds of diffusion generated images

**Overview:**

- Uses the attention maps and activations to steer the image generation
  - both are from the attention layers of the diffusion model<sup>1</sup>
  - these allow us to figure out objects, object position/size...
- Using the above we figure out a value to add to the "noise" generated at each step of de-noising<sup>2</sup>
  - since the noise is subtracted from the noisy image to produce a clearer image<sup>3</sup>
  - this extra value effectively changes what the image produced will look like
- Math on the "self-guidance" <sup>4</sup>
  - Object position: Computed as center of mass of relevant attention channels
  - Object size: Spatial sum of thresholded attention channel
  - Object shape: Thresholded attention map itself
  - Object appearance: Combination of thresholded attention and spatial activation maps

<details>
  <summary>Questions</summary>

- Everything on the third bullet above "object position...", no idea what any of those things mean. Add notes? **medium- priority**
- What's an attention map? **medium- priority**
  - Perhaps do notes on the recurrent models chapter up to transformers to solidify the attention concept
  - Then get this attention map info if still needed

</details>

<details>
  <summary>Context/Superscripts</summary>

1. The original diffusion paper (Ho et al.) does not use attention layers, but the ones today (2024) do. Ex. Stable Diffusion
2. If I forget this, just skim the diffusion folder notes
3. Not really "clearer" as in there's a final form for the noisy image that's the same all the time. I just mean getting closer to the formed image by removing a bit more noise.
4. I honestly have no idea what these things mean like "center of mass" of attention channels. I'm considering coming back to this after a firmer grasp on attention, then attention maps. Then see if I get closer to grokking this, since I need a better grasp on attention/recurrent NN fundamentals to advance on involving in AI research regardless of this paper

</details>

## Math

### Typical diffusion (de-noise)

Formula

$$
\hat{\epsilon}_t = (1 + s)\epsilon_{\theta}(z_t; t, y) - s\epsilon_{\theta}(z_t; t, \emptyset)
$$

Overly simplistic pseudo-math:

$$
NoiseToRemove = (NoiseToRemove | TextPrompt) - (NoiseToRemove | NoTextPrompt) 
$$

- $\hat{\epsilon}_t$ is noise to be removed at step t
- $\epsilon_{\theta}$ is the function to predict the noise
  - $\epsilon_{\theta}(z_t; t, y)$ predicts the noise given the conditioning info (from text prompt)
  - $\epsilon_{\theta}(z_t; t, \emptyset)$ predicts the noise given no conditioning
- $z_t$ is the current noisy image
- y is the conditioning (like a text prompt)
- $\emptyset$ is no conditioning (gen diffusion as if no text prompt is there<sup>1</sup>)

### ...with self-guidance (de-noise)

Formula:

$$
\hat{\epsilon}_t = (1 + s)\epsilon_{\theta}(z_t; t, y) - s\epsilon_{\theta}(z_t; t, \emptyset) + v\sigma_t * \nabla_{z_t} g(z_t; t, y)
$$

Overly simplistic pseudo-math:

$$
NoiseToRemove = ...OriginalStuff... + NewStuff
$$

$$
NewStuff = TODO!
$$

- $v\sigma_t * \nabla_{z_t} g(z_t; t, y)$ is the only thing thats different (added)
  - calculating this value for each step allows for _self-guidance_ (changing object size, object position, background etc)

<details>
  <summary>Questions</summary>

- Explain all the variables in $v\sigma_t * \nabla_{z_t} g(z_t; t, y)$ and what they mean/do **medium- priority**
- Why is the first question noise w/ text conditioning minus noise w/o text conditioning? **low+ priority**
  - How subtracting nosie w/o text conditioning help with finding the correct noise amount to remove at that step?
- In the first equation, why is the scalars (1+s) applied? **low- priority**
- In the first equation, what exactly does the math function/equation for $\epsilon_{\theta}$ actually look like? **low+ priority**

</details>

<details>
  <summary>Context/Superscripts</summary>

1. though doesn't quite mean "empty" string. perhaps digging into this will unveil what it is mathematically (low priority)
</details>

## Code

Core changes:
- Use Stable Diffusion (inherits `StableDiffusionXLPipeline` class)
- Edit the `__call__` function of `StableDiffusionXLPipeline` by making a new class `SelfGuidanceStableDiffusionXLPipeline`

### Diagram

```mermaid
graph TD
    A[Initialize Pipeline] --> B[Encode Prompt]
    B --> C[Prepare Latents]
    subgraph SDXL.__call__["SDXLPipeline.__call__()"]
        C --> D[Start Denoising Loop]
        D --> E[Predict Noise]
        E --> F[Apply Classifier-Free Guidance]
        F --> G{Self-Guidance Active?}
        G -->|No| H[Update Latents]
        G -->|Yes| I[Compute & Apply Self-Guidance]
        I --> H
        H --> J{More Steps?}
        J -->|Yes| E
        J -->|No| K[Generate Image]
    end
    K --> L[Post-process & Return Image]

    %% Linus-style comments positioned near relevant components
    %% N1[Initialize: Set up your shit properly or get out.]
    N2[The orange parts are only in SelfGuidanceSDXLPipeline.__call__ and not in SDXLPipeline.__call__]
    %% N3[Self-Guidance: Optional. Use it right or don't use it at all.]
    N4["Prepare latents" is where we set up the initial noise, ex. generate the noisy image, that we're going to sculpt into something resembling an image.]
    N5["Classifier-free guidance" is the standard guidance of SDXLPipeline.__call__ that guides the de-noising to resemble the text prompt]
    
    %% Positioning with dotted lines
    %% A -.- N1
    G -.- N2
    %% G -.- N3
    C -.- N4
    F -.- N5
    
    classDef selfGuided fill:#ffe6cc,stroke:#333,stroke-width:2px;
    class G,I selfGuided;
    classDef comment fill:#ffe6cc,stroke:#333,stroke-width:1px;
    class N1,N2,N3 comment;
    %% linkStyle 12,13,14 stroke-width:2px,stroke-dasharray: 5,5;
```

**SDXLPipeline.__call__(...)**
- Only used during inference (hint: you could tell by it only contains de-noising!)\
- `__call__` runs the de-noising for **all** the timesteps to generate the final image (latent)

**Update Latents**
- The latent space has all variations of the noise images and possible variations of imagery possible
  
<details>
  <summary>Questions</summary>

- when is self-guidance active or not active?
  - do you run self-guidance multiple times during a single timestep?
- explain the steps that are happening in all the orange stuff (self guidance, how does it do what we need? how does the inputs in call guide it to do what we need? e.g. move an object, change background, etc?)

</details>