**SmoothGrad**

Due to the limitations of Vanilla Gradient (noisy and unstable), we need to explore other methods. SmoothGrad, introduced by Smilkov et al (2017), is a stronger pixel attribution method. Rather than computing a single $G(x)$ per image $x$, it adds Gaussian noise to duplicates of $x$, then averages $G(x)$ over them, in essence 'smoothing' out the high-frequency derivative fluctuations.

Given input image

$x \in [0,1]^{C \times H \times W}$,

SmoothGrad creates an ensemble of noisy samples

$x_i = x + ϵ$,  where   $ϵ \sim \mathcal{N}(0, \sigma^2)$

We have paramters $N$ - the number of samples in the ensemble - and $\sigma$ - the noise level. Each $x_i$ is considered independently, identically distributed.

The "ideal" parameter values suggested by Smilkov et al is a noise level of around 10-20% (as a proportion of the dynamic range of input features) and the sample number is $N \approx 50$, as results tend to diminish past that point.

Building from VanillaGrad, SmoothGrad computes the vanilla gradient of the predicted class for each sample $x_i$:

$G_i(x) = \frac{\partial}{\partial x}f_c(x_i) \bigg|_{x=x_0} \in ℜ^{C \times H \times W}$,

then averages over their absoloute values:

$S_{\text{SG}}(x) = \frac{1}{N}\sum_{i=1}^N \vert G_i \vert$.

As before, we then aggregate across channels and normalise to $[0,1]$:

$S_{\text{SG}}(i,j)=\max_{c \in \{1, \dots, K\}}\vert S_{\text{SG}}\vert$,

$S = \frac{S - \min(S)}{\max(S) - \min(S)}$.

**Advantages of SmoothGrad over Vanilla Grad**

1. Removes high frequency noise by aggregating results so rapidly fluctuating gradients cancel out.

2. Reinforces consistent strcutures across samples.

3. Stabilises around ReLU boundaries, since noise is added on both sides of activation boundaries, so there are less abrupt gradient jumps.

**Mathematical Proof**

The main benefit of SmoothGrad is its ability to reduce noise. Here we will explore mathematically why this is true.

First we use a Taylor Series Expansion to prove that SmoothGrad's averaging method poduces an average gradient approximately equal to the true gradient.

Let $g(x) = f_c(x)$ be the logit of predicted class $c$.

Considering a second-order Taylor expansion around sample $x_i = x + ϵ$:

$g(x + ϵ) ≈ g(x) + \nabla g(x)^\top \epsilon + \frac{1}{2}\epsilon^\top H_g(x)\epsilon$,

where $H_g(x)$ is the Hessian.

If we take gradients wrt. $x$,

$\nabla_x g(x+\epsilon) \approx \nabla g(x)+H_g(x)\epsilon.$

$\nabla_x g(x)$ is the true gradient and $H_g(x)\epsilon$ is a noise term.

Now we average over our $N$ realisations of $x_i$ and take the expected value:

$\mathbb{E}[\nabla_x g(x+\epsilon)] \approx \mathbb{E}[\nabla g(x)+ H_g(x)\epsilon]$

$\approx \nabla g(x)+ H_g(x)\mathbb{E}[ϵ]$

$ \approx \nabla g(x)$

since $\mathbb{E}[ϵ] = 0$.

Therefore, we have proven that taking the average over an ensemble of Gaussian noise injected samples does not distort the true gradient.

Secondly, we can prove that averaging reduces the sample variance. Due to the Gaussian noise being normally distributed, with each $x_i$ assumed i.i.d, we can say that $\text{Var}[\bar{G}_N]=\frac{1}{N}\text{Var}[G_i]$.

Therefore, as $N$ increases, the variance of the gradient estimate decreases linearly with number of samples $N$.

Overall, SmoothGrad gives the true gradient values of Vanilla Grad with lower variance and therefore less noise.
