<a href="https://colab.research.google.com/github/y-oth/dst_assessment2/blob/main/report/02-SmoothGradIntro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**SmoothGrad**

Due to the limitations of Vanilla Gradient (noisy and unstable), we need to explore other methods. SmoothGrad, introduced by Smilkov et al (2017)[4], is a stronger pixel attribution method. Rather than computing a single gradient $\nabla_x f_c(x)$ per image $x$, it adds Gaussian noise to duplicates of $x$, then averages over their Vanilla Gradients, in essence 'smoothing' out the high-frequency derivative fluctuations.[2]

Given input image

$$x \in [0,1]^{C \times H \times W}$$,

SmoothGrad creates an ensemble of noisy samples

$$x_i = x + ϵ$$  where   $$ϵ \sim \mathcal{N}(0, \sigma^2)$$

We have parameters $N$ - the number of samples in the ensemble - and $\sigma$ - the noise level. Each $x_i$ is considered independently, identically distributed.

The "ideal" parameter values suggested by Smilkov et al is a noise level of around 10-20% (as a proportion of the dynamic range of input features) and the sample number is $N \approx 50$, as results tend to diminish past that point.

Building from VanillaGrad, SmoothGrad computes the vanilla gradient of the predicted class for each sample $x_i$:

$$G_i(x) = \nabla_x f_c(x_i) \in ℜ^{C \times H \times W}$$,

then averages over their absoloute values:

$$S_{\text{SG}}(x) = \frac{1}{N}\sum_{i=1}^N \vert G_i \vert.$$

As before, we then aggregate across channels and normalise to $[0,1]$:

$$S_{\text{SG}}(i,j)=\max_{c \in \{1, \dots, K\}}\vert S_{\text{SG}}\vert,$$

$$S = \frac{S - \min(S)}{\max(S) - \min(S)}.$$

**Advantages of SmoothGrad over Vanilla Grad**

1. Removes high frequency noise by aggregating across noisy samples.

2. Reinforces consistent structures across samples.

3. Stabilises around ReLU boundaries, producing smoother saliency maps.

**How does SmoothGrad stabilise around ReLU boundaries?**

Since SmoothGrad computes the average of many Vanilla gradients from slightly augmented inputs $x_i$, when an original input $x$ is close to $0$, ReLU will class some samples as 'active' and some as 'inactive'.

Therefore, when we average across these gradients, we get a proportional value $\in$ [$0,1$] rather than a binary decision. In essence this is smoothing out gradient fluctuations and leads to a smoother, stabler saliency map. [3]

**Mathematical Proof**[1]

The main benefit of SmoothGrad is its ability to reduce noise in our saliency maps. Here we will explore mathematically why this is true.

1. We use a Taylor Series Expansion to prove that SmoothGrad's averaging method produces an average gradient approximately equal to the true gradient:

Let $g(x) = f_c(x)$ be the logit of predicted class $c$.

Considering a second-order Taylor expansion or sampe $x_i = x + \epsilon $ around $x$:

$$g(x + ϵ) ≈ g(x) + \nabla g(x)^\top \epsilon + \frac{1}{2}\epsilon^\top H_g(x)\epsilon$$

where $H_g(x)$ is the Hessian of $g$ at $x$.

If we differentiate wrt. $x$:

$$\nabla_x g(x+\epsilon) \approx \nabla g(x)+H_g(x)\epsilon.$$

Here
$\nabla_x g(x)$ is the true gradient and $H_g(x)\epsilon$ is a noise term introduced by $\epsilon$.

Now we take the expectation noting that $\mathbb{E}[ϵ] = 0$ since $ϵ$ is normally distributed with mean $0$:

$$\mathbb{E}[\nabla_x g(x+\epsilon)] \approx \mathbb{E}[\nabla g(x)+ H_g(x)\epsilon]$$

$$\approx \nabla_x g(x)+ H_g(x)\mathbb{E}[ϵ]$$

$$ \approx \nabla_x g(x).$$

Therefore, we have proven that taking the average over an ensemble of Gaussian noise injected samples does not distort the true gradient.

2. SmoothGrad reduces the sample variance.

Our averaged gradient is:

$$\bar{G}_N = \frac{1}{N} \sum_{i=1}^{N} G_i$$

Due to the Gaussian noise being normally distributed, with each $\epsilon_i$ and therefore $G_i$ assumed i.i.d, we can say that

$$\text{Var}[\bar{G}_N]= \frac{1}{N^2} \sum_{i=1}^{N}{Var}[G_i]=\frac{1}{N}\text{Var}[G_i]$$

Therefore, as $N$ increases, the variance of the gradient estimate decreases linearly with number of samples $N$.

Overall, SmoothGrad gives the true gradient values of Vanilla Grad with lower variance and therefore less noise.


**References**

[1] https://www.emergentmind.com/topics/smoothgrad-technique

[2] https://users.cs.fiu.edu/~sjha/class2023/Lecture4/Slides/2017SmoothGrad.pdf

[3] https://christophm.github.io/interpretable-ml-book/pixel-attribution.html

[4] https://arxiv.org/pdf/1706.03825



