# Deeper Dive into Grad-CAM

We break down each step of the Grad-CAM map,
try to give intuition and thorough explanaition at what this mapping does.

Recall that in the Introduction to Grad-CAM notebook we defined the function

$$L^{c}_{Grad-CAM} = ReLU(\sum_{k}\alpha_{k}^{c}\mathbf{A}^{k})$$

Which was the activation mapping of the linear combination of our $K$ total features. Then we use bilinear interpolation to upsample our activation mapping so that it matches the size of the original image.

Now lets break down the first step of the Grad-CAM mapping function.

---

### Computing channel importance weights:

In order to understand the importance weights $\alpha_{k}^{c}$, we must understand the components of it's equation.
$$ \alpha_{k}^{c} = \frac{1}{Z}\sum_{i}\sum_{j} \frac{\delta y^c}{\delta A_{ij}^{k}}$$
When an image goes through a convolutional layer , the layer applies multiple filters (kernels). Each filter detects different features, in our MRI example these features could be edges, textures, tumor boundaries, etc. These filters each produce a 2D activation map. This activation map is called a channel. If we use 512 filters there will be 512 channels.

The $k$-th feature map $\mathbf{A}^{k}$ is the output of the $k$-th filter. So $\mathbf{A}^{k}$ is a 2D spatial grid showing how strongly filter $k$ fired at each spatial location.

This idea is demonstrated below by this diagram []

<figure  style = "text-align: center;">
  <img src="../Youssef O/photos/ConvolutionalLayerFeatureMap.png" alt="Convolutional Layer Activation Map" width="400">
  <figcaption>Figure 1: Convolutional Layer Activation Map.</figcaption>
</figure>



This shows us starting with an input image with $3$ channels and a dimension of $5 \times 5$ and then a convolutional layer with 2 filters (or kernels) of dimension $3\times 3$ is applied to each channel of our input image and this outputs 2 feature maps of dimension $3\times 3$. Therefore, $K =2$ which represents the number of channels, filters and feature maps. With the final output feature maps being a 2D spatial grid showing how strongly filter k fired at each spatial location.

The question that Grad-CAM asks is which of these channels contributed most to the target class, and where in the image did those channels activate.

In the final layer of a CNN, a fully connected (dense) layer outputs a logit value for each possible class, this is viewed as the score attributed to each class and represents the confidence the network has that an inputted image belongs to a potential class $c$.

 So naturally, we look at how this confidence changes with respect to each feature mapping to understand what region of an image the model places it's confidence in. This leads to our formula for the importance weight $\alpha_{k}^{C}$ , where for each pixel in the 2D activation map we calculate the gradient of our class score with respect to the pixel and sum all the pixels. We then divide by the number of pixels, so that we look at the average gradient of our whole feature map.
<figure  style = "text-align: center;">
  <img src="../Youssef O/photos/FeatureMapPixels.png" alt="Class Activation"  width ="400">
  <figcaption>Figure 2: Per Pixel Class Activation.</figcaption>
</figure>


 Note:
 1. We use the score instead of the probabilities as the probability is entangled with all other classes.
 2. Even if the model predicted "No tumor", we can still compute the Grad-CAM for the "Tumor" class.

---

 ## Linear combination of feature maps:

Now that we have calculated the importance of each feature map, we take a linear combination of the feature maps.
<figure  style = "text-align: center;">
  <img src="../Youssef O/photos/LinearCombination.png" alt="Linear Combination of Feature Maps"  width ="400">
  <figcaption>Figure 3: Linear Combination of Feature Maps.</figcaption>
</figure>

Each scalar $\alpha_c^k$ tells you important that entire map is to the class, we combine all the feature maps since each feature map highlights different spatial features of the input, some maps activate when they detect tumors (so those maps correspond to tumor features), some maps respond to edges, some maps may respond to texture and so on. So combining all these features togther by their importance , gives us a balanced spatial heatmap showing where the channels impoprtant for class $c$ were active and how strongly they were active.

Then, using the ReLU activation function where 
$$ ReLU(x) =  max(x,0)$$
gives us only the postively important features and fitlers out features which are negatively important for the activation of class $c$.
Giving us a heatmap of the same dimension as the last convolutional layer
$$L_{Grad-CAM}^{c} = ReLU(\sum_{k}\alpha_{k}^{c}\mathbf{A}^{k})$$
specifiying where the channels that had postive activations for class $c$ are, and a comparitive scale of how strong their activation is.
We then upsample our image so that it is of the same dimension as the input dimension. This will allows us to overlay the heatmap on the image to see what the model deemed as important when it made it's prediction.

<figure  style = "text-align: center;">
  <img src="../Youssef O/photos/ReLU+Upsampling.png" alt="ReLU + Upsampling"  width ="400">
  <figcaption>Figure 4: ReLU Activation function and Upsampling of Combined Feature Maps.</figcaption>
</figure>

---

## Upsampling using bilinear interpolation:

In most computer vision tasks, interpolation is used to resize an image. We look at the most common interpolation method, the bilinear interpolation method to help us understand how our heatmap $L_{Grad-CAM}^{c}$ can be resized. Bilinear is used it outputs a smooth resized image and avoids 'checkerboard' style images.

 The intuitive idea is that for an input image $I \in \mathbb{R}^{H \times W}$ and a raw Grad-CAM heatmap $L \in \mathbb{R}^{h \times w}$ the upsampling function is a mapping:
$$ U(L) = \tilde{L} \in \mathbb{R}^{H \times W}$$
where $U$ is typically bilinear interpolation.

### Bilinear interpolation

In order to compute the value at an output pixel $(x,y)$, we take a weighted average of the four pixels closet to it in the lower resolution heatmap $L_{Grad-CAM}^{c}$ . Let the corresponding coordinates in the low-resolution heatmap $(u,v)$ equal:
$u = x \dot \frac{h-1}{H-1} ,  \ \ \ \ \ \ v = y \dot \frac {w-1}{W-1}$
These low-res coordinates are sometimes called the fractional coordinates this due to ....

We define $i = floor(u)$ , $\ j = floor(v)$ as the nearest pixel (to the left, below respectively) to $(u,v)$. We also let the interpolation weights be :

$$\alpha = u - i,  \ \ \ \ \ \ \beta = v - j$$
These weights are known as the fractional offests and they tell you how far inside the square formed above you are from the top-left corner $(i,j)$. The term $\alpha$ describes how far right you are (from 0 to 1) relative to $(i,j)$, the term $\beta$ describes how far down you are (from 0 to 1) relative to $(i,j)$.

Therefore, the bilinear interpolation formula equals:
$$\tilde{L}_{x,y} = (1 - \alpha)(1-\beta)L_{i,j} + \alpha(1-\beta)L_{i+1,j} + (1-\alpha)\beta L_{i,j+1} + \alpha\beta L_{i+1,j+1}$$

Intuitively, the equation tells us we should take a weighted combination of the four nearest grid points, proportional to how close we are to each of the points. This structure of interpolating twice (left to right and up to down) allows the output to be smooth.

<figure  style = "text-align: center;">
  <img src="../Youssef O/photos/bilinearInterpolation.png" alt="Bilinear Interpolation diagram"  width ="400">
  <figcaption>Figure 5: Bilinear Interpolation Diagram.</figcaption>
</figure>

Bilinear interpolation usually offers smoother resizing of images than the other interpolation methods.


<figure  style = "text-align: center;">
  <img src="../Youssef O/photos/Comparison-of-upsamling-functions.png" alt="Comparison of Upsammpling functions"  width ="400">
  <figcaption>Figure 6: Bilinear Interpolation compared to other upsampling functions.</figcaption>
</figure>


# Using Grad-CAM in python:
We will compliment this tutorial of Grad-CAM with some python code and visualisations using a specific image of an MRI tumor cell.

In order to do this we first load in a toy example of our CNN, which is the ResNet18 pre-trained model.