<a href="https://colab.research.google.com/github/samiha-mahin/An-Image-Processing-Repo/blob/main/Spatial_Attention_In_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Spatial Attention in CNNs**



### üîπ **What is Spatial Attention in CNN?**

In CNNs, feature maps contain a lot of information, but not every **region (spatial location)** is equally important.

* Some areas of the image matter more for prediction (like the object region),
* Other areas may be background noise.

**Spatial Attention** helps the model **focus on "where" the important features are located** in the image.



### üîπ**How It Works**

1. Take the feature maps from a CNN layer (say size $H \times W \times C$).

   * $H, W$ = height & width of the feature map
   * $C$ = number of channels

2. **Compress along channels**

   * Apply **max pooling** and **average pooling** along the channel axis ‚Üí this gives 2 feature maps of size $H \times W \times 1$.
   * Why?

     * Max pooling highlights the strongest features at each location.
     * Average pooling gives general context.

3. **Combine**

   * Concatenate these 2 maps ‚Üí apply a convolution (usually $7 \times 7$) ‚Üí then a **sigmoid**.
   * This produces a **spatial attention map** of size $H \times W \times 1$.

4. **Re-weight feature maps**

   * Multiply this attention map with the original feature map ‚Üí the CNN will now focus more on important spatial regions.



### üîπ **Formula**

If $F \in \mathbb{R}^{H \times W \times C}$ is the input feature map:

$$
M_s(F) = \sigma( f^{7 \times 7}([AvgPool(F); MaxPool(F)]) )
$$

where:

* $\sigma$ = sigmoid
* $f^{7 \times 7}$ = convolution with 7√ó7 kernel
* $M_s(F)$ = spatial attention map

Then the output is:

$$
F' = M_s(F) \otimes F
$$



### üîπ **Intuition**

Think of it like telling the CNN:
üëâ "Don‚Äôt waste time looking at the sky or background pixels‚Äî**focus on the object area** where the useful information is!"

---



 # **Difference between Spatial Attention (like in CBAM)** and **Squeeze-and-Excitation (SE) blocks**.



## üîπ **1. SE Net (Squeeze-and-Excitation Attention)**

* **Focuses on channels (‚Äúwhat‚Äù features are important).**
* Workflow:

  1. **Squeeze**: Do global average pooling on the feature map ‚Üí compress spatial dimensions ‚Üí get a vector of size $1 \times 1 \times C$.
  2. **Excitation**: Pass this through 2 fully connected (FC) layers + sigmoid ‚Üí outputs weights for each channel.
  3. **Reweight**: Multiply these weights with the original feature maps channel-wise.

‚ú® **Key Idea**:
SE tells the model **which feature maps (channels)** are important (e.g., texture vs color vs shape).
It answers **‚Äúwhat‚Äù to focus on**.


## üîπ **2. Spatial Attention (Spatio Attention, e.g., CBAM)**

* **Focuses on spatial locations (‚Äúwhere‚Äù is important).**
* Workflow:

  1. Apply **average pooling** + **max pooling** along channels ‚Üí get 2 spatial maps of size $H \times W \times 1$.
  2. Concatenate them ‚Üí pass through convolution + sigmoid ‚Üí output attention map.
  3. Multiply this with original feature map spatially.

‚ú® **Key Idea**:
Spatial attention tells the model **which regions in the image** are important (e.g., object vs background).
It answers **‚Äúwhere‚Äù to focus**.



## üîπ** Comparison Table**

| Feature            | SE Block (Squeeze-Excitation)  | Spatial Attention (CBAM style) |
| ------------------ | ------------------------------ | ------------------------------ |
| **Focus**          | Channel-wise (what features)   | Spatial (where in the image)   |
| **Mechanism**      | Global avg pooling + FC layers | Pooling across channels + Conv |
| **Output size**    | $1 \times 1 \times C$          | $H \times W \times 1$          |
| **Attention type** | ‚ÄúWhat is important?‚Äù           | ‚ÄúWhere is important?‚Äù          |
| **Granularity**    | Global per-channel weights     | Local pixel/region weights     |



## üîπ **Together**

* **SE block = channel attention only.**
* **Spatial attention = location attention only.**
* **CBAM = both (channel + spatial).**
  That‚Äôs why CBAM is often described as a **generalization of SE**.


