## Backprop Workbook 01

**For these questions, assume that an $x$ input has 1024 dimensions, that the first hidden layer should have $512$ units, a second layer has $256$ units, and that there are $10$ classes to choose from at the end.**

**Cell to run for Latex commands**

\\[
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\grad}[1]{\nabla #1}
\newcommand{\softmax}[0]{\text{SOFTMAX}}
\\]

### Backpropagation to $z^{(3)}$

**What is $\fpartial{}{x} \log x$?**

$\frac{1}{x}$

**What is the derivative $\fpartial{}{x} g(f(x))$ in terms of $\fpartial{}{x} f(x)$ and $\fpartial{}{x} g(x)$? Recall that just like you can use the same variable name in two functions and they have nothing to do with each other, the $x$ in the $\fpartial{}{x} f(x)$ isn't necessarily the same as the $x$ in $\fpartial{}{x} g(x)$.**

Chain rule.

\\[
\fpartial{}{x} g\left(f\left(x\right)\right)
=
\left(\fpartial{}{y} g\left(y\right)\right)
    \left(g\left(x\right)\right)
\left(\fpartial{}{x} f\left(x\right)\right)
\\]

**We want to calculate $\fpartial{}{z^{(3)}_i} CE_\text{vector}(h^{(3)}, y)$. Recall that $h^{(3)} = \text{SOFTMAX}\left(z^{(3)}\right)$. So before anything, let's write $CE_\text{vector}(h^{(3)}, y)$ in terms of the $z^{(3)}$ by expanding the dot-product of $h^{(3)}$ and substituting the formula in terms of $z^{(3)}$.**

\\[
\begin{align}
    CE\left(h^{(3)}, y\right)
&=
    -\log h^{(3)} \cdot y
\\
&=
    -\log \sum_{i = 0}^{9} h^{(3)}_i y_i
\\
&=
    -\log \sum_{i = 0}^{9}
        y_i
        \frac{
            \exp\left(z^{(3)}_i\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
\end{align}
\\]


**Next let the scalar value $y^*$ be the correct class. That is $y_i = 1$ exactly when $i = y^*$; else $y_i$ is zero.**

**Notice then that all the terms of the sum don't matter except for $i = y^*$. So simplify the above formula**

\\[
\begin{align}
    CE\left(h^{(3)}, y\right)
&=
    -\log \sum_{i = 0}^{9}
        y_i
        \frac{
            \exp\left(z^{(3)}_i\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
\\
&=
    -\log\left(
        y_{y^*}
        \frac{
            \exp\left(z^{(3)}_{y^*}\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
    \right)
\\
&=
    -\log\left(
        \frac{
            \exp\left(z^{(3)}_{y*}\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
    \right)
\end{align}
\\]

Here we replace $y_{y^*}$ with 1 since this is the definition of the one-hot encoding.

From here out, let's use $S := \sum_{j=0}^9 \exp\left(z^{(3)}_j\right)$ because I'm lazy.

**Let's calculate $\fpartial{}{z^{(3)}_i} CE_\text{vector}(h^{(3)}, y)$ for $i \ne y^*$. First, it will help to simplify a log of a fraction into a difference of logs.**

\\[
\begin{align}
    CE\left(h^{(3)}, y\right)
&=
    -\log\left(
        \frac{
            \exp\left(z^{(3)}_{y*}\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
    \right)
\\
&=
    -\log \left(\exp\left(z^{(3)}_{y*}\right)\right)
    +
    \log \left(\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)\right)
\\
&=
    -z^{(3)}_{y*}
    +
    \log \left(\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)\right)
\end{align}
\\]

**Next, what is the derivative of the first term? Why?**

Now, the first term is just $z^{(3)}_{y*}$, so if we differentiate with respect to some $z^{(3)}_i$ where $i$ is *not* the correct class, then this derivative is zero.

Basically, a change to the other $z^{(3)}$ values doesn't change the numerator.


**Using the rule that the derivative of $\log a$ wrt $a$ is $1/a$, and also the chain rule, do the first-step of the deriative of the second term wrt $z^{(3)}_i$.**

\\[
\begin{align}
    \fpartial{}{z^{(3)}_i}
        \log \left(\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)\right)
&=
    \frac{1}{\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)}
    \fpartial{}{z^{(3)}_i}
        \left(\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)\right)
\end{align}
\\]

**Next, use the rule that the derivative of $e^a$ wrt $a$ is also $e^a$.**

\\[
\begin{align}
    \frac{1}{\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)}
    \fpartial{}{z^{(3)}_i}
        \left(\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)\right)
&=
    \frac{1}{\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)}
    \exp\left(z^{(3)}_i\right)
\end{align}
\\]

**Finally, use the definition of $h^{(3)}_i = SOFTMAX(z^{(3)})_i$ to simplify this.**

\\[
\begin{align}
    \frac{1}{\sum_{j=0}^9 \exp\left(z^{(3)}_j\right)}
    \exp\left(z^{(3)}_i\right)
&=
    h^{(3)}_i
\end{align}
\\]


**Now, if $i = y^*$, what is the derivative of the first term of the difference of logs?**

It is $-1.0$.

**Does the formula for the derivative of the second difference term change?**

No.

**Thus, what is $\fpartial{}{z^{(3)}_i} CE_\text{vector}(h^{(3)}, y)$ when $i=y^*$?**

$-1.0 + h^{3}_i$.

**The gradient $\grad_{z^{(3)}} CE_\text{vector}(h^{(3)}, y)$ is a vector. What are its entries?**

\\[
\grad_{z^{(3)}} CE_\text{vector}(h^{(3)}, y)_i
=
\fpartial{}{z^{(3)}_i} CE_\text{vector}(h^{(3)}, y)
\\]

**What is the vectorized formula for $\grad_{z^{(3)}} CE_\text{vector}(h^{(3)}, y)$?**

\\[
\grad_{z^{(3)}} CE_\text{vector}(h^{(3)}, y)
=
h^{(3)} - y
\\]

Where $y$ is the one-hot encoding.

**Let's do an intuition check. What entries of the gradient are positive? Which are negative? Why?**

All entries except $y^*$ are positive, because increasing their log odds decreases the probability on the right answer. That increases the loss.

Increasing $z^{(3)}_{y^*}$ increases the probability of the correct answer, so it reduces the loss.

## Backprop to $W^{(3)}$


**To update the weights $W^{(3)}$ we must calculate $\grad_{W^{(3)}} CE_\text{vector}(h^{(3)}, y)$. What is the shape of this "2 dimensional gradient"?**

$(256, 10)$.

**What does each entry $\left(\grad_{W^{(3)}} CE_\text{vector}(h^{(3)}, y)\right)_{i, j}$ mean?**

It means how much the loss will change if we changed $W_{i, j}$ a little.

**Reminder: which value in the second hidden layer does $W_{i, j}$ connect to which pre-activation in the third layer?**

$h^{(2)}_i$ to $z^{(3)}_j$.

**If for $j$ we have $\fpartial{}{z^{(3)}_j} CE_\text{vector}(h^{(3)}, y)$ is zero, what is $\grad_{W^{(3)}} CE_\text{vector}(h^{(3)}, y)$ for all $(i, j)$ for all $i$? Why?**

It must be zero. Because changing any $W_{i, j}$ may change $z^{(3)}_j$, but we know that has no impact on the loss.

**If for $i$ we have $h^{(2)}_i$ is zero, what is $\grad_{W^{(3)}} CE_\text{vector}(h^{(3)}, y)$ for all $(i, j)$ for all $j$? Why?**

It must be zero. Changing any $W_{i, j}$ won't change $z^{(3)}_j$, because $z^{(3)}_j$ is a weighted sum of the $h^{(2)}_i$, but if a $h^{(2)}_i$ is zero, changing the associated weight doesn't change $z^{(3)}_j$.

**Given the above, what two "forces" does $\left(\grad_{W^{(3)}} CE_\text{vector}(h^{(3)}, y)\right)_{i, j}$ need to combine?**

First, the amount that changing $W^{(3)}_{i, j}$ changes $z^{(3)}_j$.

Second, the amount that changing $z^{(3)}_j$ changes the loss.

**How do the two forces combine for $\left(\grad_{W^{(3)}} CE_\text{vector}(h^{(3)}, y)\right)_{i, j}$? Remember that $CE$ is a function of $z^{(3)}_j$ and $z^{(3)}_j$ is a function of $W^{(3)}_{i, j}$. Something something chain rule? First calculate $\fpartial{}{W^{(3)}_{i, j}} z^{(3)}_j$.**

\\[
\begin{align}
    \fpartial{}{W^{(3)}_{i, j}} z^{(3)}_j
&=
    \fpartial{}{W^{(3)}_{i, j}} \sum_{i=0}^{256} h^{(2)}_i W_{i, j}
\\
&=
    h^{(2)}_i
\end{align}
\\]


1. Does it matter what $j$ is?
2. Let's just calcualte the vector for $W^{(3)}_{:, j}$ for any $j$.
3. What is the dimension of that vector.
4. Something something outer product.