# **3 Gradient computations in Neural Networks (Ed Tam)**

$$\mathbf{h}_1 = \sigma\left( \mathbf{W}_1^\top \mathbf{x} + \mathbf{b}_1 \right),$$
$$\mathbf{h}_2 = \sigma\left( \mathbf{W}_2^\top \mathbf{h}_1 + \mathbf{b}_2 \right),$$
$$f(\mathbf{x}) = \mathbf{h}_3 =  \sigma\left( \mathbf{W}_3^\top \mathbf{h}_2 + b_3 \right),$$

$$L=-\sum_{i=1}^{N}y_i\log(f(\mathbf{x}_i)) + (1-y_i)\log(1-f(\mathbf{x}_i)),$$

$$\mathbf{z}_l := \mathbf{W}_l^{\top} \mathbf{h}_{l-1}+\mathbf{b}_l.$$

**3.1 Gradient Evaluation**

In order to get $\frac{\partial L}{\partial \mathbf{W}_1}$ let's start with
 $-\frac{\partial L_i}{\partial \mathbf{W}_1}$:

\begin{eqnarray}
\frac{\partial L_i}{\partial \mathbf{W}_1}=\frac{\partial L_i}{\partial {h}_3}\frac{\partial {h}_3}{\partial \mathbf{W}_1},
\end{eqnarray}


\begin{eqnarray}
\frac{\partial L_i}{\partial {h}_3}=\frac{\partial L_i}{\partial f}=\frac{\partial}{\partial f}\left(y_i\log(f) + (1-y_i)\log(1-f)\right)=\\
y_i\dfrac{1}{f}+(1-y_i)\dfrac{-1}{1-f}=\dfrac{y_i-y_if-f+y_if}{f(1-f)}=\dfrac{y_i-f}{f(1-f)}.
\end{eqnarray}

Now let's simplify $\frac{\partial {h}_3}{\partial \mathbf{W}_1}$:

\begin{eqnarray}
\frac{\partial {h}_3}{\partial \mathbf{W}_1}=\frac{\partial {h}_3}{\partial z_3}\frac{\partial z_3}{\partial \mathbf{h_2}}\frac{\partial\mathbf{h_2}}{\partial\mathbf{W}_1},
\end{eqnarray}

We know that

$$\frac{\partial {h}_3}{\partial z_3}=\sigma'(z_3)=\sigma(z_3)(1-\sigma(z_3))=h_3(1-h_3).$$

Also,

$$\frac{\partial z_3}{\partial \mathbf{h_2}}=\mathbf{W}_{3}^T.$$

Next step is to calculate $\frac{\partial h_2^l}{\partial\mathbf{W}_1}$:

$$\frac{\partial\mathbf{h}_2}{\partial\mathbf{W}_1}=\frac{\partial \mathbf{h}_2}{\partial \mathbf{z}_2}\frac{\partial \mathbf{z}_2}{\partial \mathbf{h_1}}\frac{\partial\mathbf{h_1}}{\partial\mathbf{W}_1},$$

where

$$\frac{\partial \mathbf{h}_2}{\partial \mathbf{z}_2}=\mathbf{h}_2\odot(\mathbf{1}-\mathbf{h}_2),$$

where $\odot$ is the Hadamard product (element wise multiplication) and $\mathbf{1}$ is the vector of ones.

for $\frac{\partial \mathbf{z}_2}{\partial \mathbf{h_1}}$ we have:

$$\frac{\partial \mathbf{z}_2}{\partial \mathbf{h_1}}=\mathbf{W}_{2}^{T}.$$

The last derivative we are missing is $\frac{\partial\mathbf{h_1}}{\partial\mathbf{W}_1}$ which is

$$\frac{\partial\mathbf{h_1}}{\partial\mathbf{W}_1}=\frac{\partial\mathbf{h_1}}{\partial\mathbf{z}_1}\frac{\partial\mathbf{z}_1}{\partial \mathbf{W}_1}=\mathbf{h}_1\odot(\mathbf{1}-\mathbf{h}_1)\frac{\partial \mathbf{z}_1}{\partial \mathbf{W}_1},$$

and the derivative $\frac{\partial \mathbf{z}_1}{\partial \mathbf{W}_1}$ is

$$\frac{\partial \mathbf{z}_1}{\partial \mathbf{W}_1}=\mathbf{x}^T.$$

Combining everything together we can have:

\begin{eqnarray}
\frac{\partial L_i}{\partial \mathbf{W}_1}=\dfrac{y_i-f(\mathbf{x}_i)}{f(\mathbf{x}_i)(1-f(\mathbf{x}_i))}h_3(1-h_3)\mathbf{W}_{3}^T\times\\
\times \mathbf{h}_2\odot(\mathbf{1}-\mathbf{h}_2)\mathbf{W}_{2}^{T}\mathbf{h}_1\odot(\mathbf{1}-\mathbf{h}_1)\mathbf{x}_i^T.
\end{eqnarray}

Since $f(\mathbf{x}_i)=h_3$, we can simplify it as follow

\begin{eqnarray}
\frac{\partial L_i}{\partial \mathbf{W}_1}=(y_i-f(\mathbf{x}_i))\mathbf{W}_{3}^T\mathbf{h}_2\odot(\mathbf{1}-\mathbf{h}_2)\mathbf{W}_{2}^{T}\mathbf{h}_1\odot(\mathbf{1}-\mathbf{h}_1)\mathbf{x}_i^T
\end{eqnarray}

After the summation over all $i=1...N$ we have:

\begin{eqnarray}
\frac{\partial L}{\partial \mathbf{W}_1}=-\sum_{i=1}^N (y_i-f(\mathbf{x}_i))\mathbf{W}_{3}^T\mathbf{h}_2\odot(\mathbf{1}-\mathbf{h}_2)\mathbf{W}_{2}^{T}\mathbf{h}_1\odot(\mathbf{1}-\mathbf{h}_1)\mathbf{x}_i^T.
\end{eqnarray}

Initially I did the same but in term of summations over all the nodes. The answer is:

\begin{eqnarray}
\frac{\partial L}{\partial \mathbf{W}_1}=-\sum_{i=1}^N(y_i-f(\mathbf{x}_i))\sum_{l=1}^{H}W_{3,l}^Th_2^l(1-h_2^l)\sum_{m=1}^{H}W_{2,m}^{lT}h_1^m(1-h_1^m)\mathbf{x}_i^T.
\end{eqnarray}

**3.2 Gradient Evaluation**

We need to compute $\delta_l^i:= \frac{\partial L_i}{\partial \mathbf{z}_l}$, knowing
$\delta_{l+1}^i$, $\mathbf{W}_{l+1}$, and $\mathbf{h}_l$.

Using out previous derivation we can write

\begin{eqnarray}
\delta_l^i=\frac{\partial L_i}{\partial \mathbf{z}_l}=\dfrac{\partial L_i}{\partial \mathbf{z}_{l+1}}  \dfrac{\partial \mathbf{z}_{l+1}}{\partial \mathbf{h}_l}   \dfrac{\partial \mathbf{h}_l}{\partial \mathbf{z}_l}  =  \delta_{l+1}^i  \dfrac{\partial \mathbf{z}_{l+1}}{\partial \mathbf{h}_l}   \dfrac{\partial \mathbf{h}_l}{\partial \mathbf{z}_l} =\\
=  \delta_{l+1}^i  \mathbf{W}^T_{l+1}   \dfrac{\partial \mathbf{h}_l}{\partial \mathbf{z}_l}  =\delta_{l+1}^i  \mathbf{W}^T_{l+1}  \mathbf{h}_l\odot(\mathbf{1}-\mathbf{h}_l).
\end{eqnarray}

Thus,

$$\delta_l^i=\delta_{l+1}^i  \mathbf{W}^T_{l+1}  \mathbf{h}_l\odot(\mathbf{1}-\mathbf{h}_l).$$