## Backprop Workbook 02: Backprop to Hidden Layer

**For these questions, assume that an $x$ input has 1024 dimensions, that the first hidden layer should have $512$ units, a second layer has $256$ units, and that there are $10$ classes to choose from at the end.**

**Cell to run for Latex commands**

\\[
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\grad}[1]{\nabla #1}
\newcommand{\softmax}[0]{\text{SOFTMAX}}
\\]

## Backprop Further Through Gradient Chains

We've now calculated gradients for the final layer's weights and biases. We now want to calculate the corresponding gradients for $\grad_{W^{(2)}} CE(h^{(3)}, y)$ and $\grad_{b^{(2)}} CE(h^{(3)}, y)$.

How do changes to these weights and biases effect the loss?

1. Changes to $W^{(2)}$ and $b^{(2)}$ change the $z^{(2)}$ values.
2. Changes to the $z^{(2)}$ values change the $h^{(2)}$ values.
3. Changes to the $h^{(2)}$ values change the $z^{(3)}$ values.
4. And we already know how changes to $z^{(3)}$ values change the loss.

In this section we will work our way backward: 3, 2, 1. Blast off!

**1. Write $\grad_{b^{(2)}} CE(h^{(3)}, y)$ as a chain of four gradients. Take inspiration from the above plan.**

\\[
\grad_{b^{(2)}} CE(h^{(3)}, y)
=
\left(
    \grad_{b^{(2)}} z^{(2)}
\right)
\left(
    \grad_{z^{(2)}} h^{(2)}
\right)
\left(
    \grad_{h^{(2)}} z^{(3)}
\right)
\left(
    \grad_{z^{(3)}} CE(h^{(3)}, y)
\right)
\\]

## Gradient Shapes

**1. Gradients like $\grad_{z^{(3)}} CE(h^{(3)}, y)$ are familiar because there are many inputs, but only one output. What is the length of this vector? Why?**

It is length ten because there are ten $z^{(3)}$ values and thus ten partial derivatives.

**Note:** A gradient like $\grad_{h^{(2)}} z^{(3)}$ feels weird because there are many inputs *and* many outputs. We know it must collect terms like $\fpartial{z^{(3)}_j}{h^{(2)}_i}$, but what is the shape of that matrix? How do we organize the terms?

Let's go back to simple gradients. Let's say I have a function $f(x)$ that takes in a vector $x$ and outputs a scalar value. I want to be able to write:

\\[
\left(\Delta x\right) \left(\grad_x f(x)\right)
=
\Delta y
\\]

Here I want to do the matrix multiply of $\Delta x$ with $\grad_x f(x)$.


**2. Given our convention that a vector like $h^{(2)}$ should be interpreted as a row-vector with shape $(1, 256)$, what do we want the shape of $\grad_{h^{(2)}} z^{(3)}_j$ to be so that the matrix product:**

\\[
\left(\Delta h^{(2)}\right)
\left(\grad_{h^{(2)}} z^{(3)}_j\right)
=
\Delta z^{(3)}_j
\\]

**works out? What kind of vector is that?**

The shape should be $(256, 1)$. Column vector.

**3. Let's generalize to $\grad_{h^{(2)}} z^{(3)}$. I still want:**

\\[
\left(\Delta h^{(2)}\right)
\left(\grad_{h^{(2)}} z^{(3)}\right)
=
\Delta z^{(3)}
\\]

**What is the desired shape? Give me a formula for $\left(\grad_{h^{(2)}} z^{(3)}\right)_{i, j}$.**

We want the gradient to have 256 rows and 10 columns.

We want

\\[
\left(\grad_{h^{(2)}} z^{(3)}\right)_{i, j}
=
\fpartial{z^{(3)}_j}{h^{(2)}_i}
\\]

## Calculating $\grad_{h^{(2)}} z^{(3)}$ and $\grad_{h^{(2)}} CE(h^{(3)}, y)$

**1. In our plan we know the last gradient $\grad_{z^{(3)}} CE(h^{(3)}, y)$, so let's work backward and start with $\grad_{h^{(2)}} z^{(3)}$. Let's focus on just a single partial: $\fpartial{}{h^{(2)}_i} z^{(3)}_j$. Use the formula for $z^{(3)}_j$ to calculate this.**

\\[
\fpartial{}{h^{(2)}_i} z^{(3)}_j
=
\fpartial{}{h^{(2)}_i} \left(
    \sum_{k = 0}^{512} h^{(2)}_k W^{(3)}_{k, j}
\right) + b^{(3)}_j
=
W^{(3)}_{i, j}
\\]

**2. Why does does $W^{(3)}_{i, j}$ feel like the right anwer?**

Because it is the weight that connects $h^{(2)}_i$ to $z^{(3)}_j$. Any change in $h^{(2)}_i$ will be "magnified" by $W^{(3)}_{i, j}$.

**3. Using this result, and our definition above for $\left(\grad_{h^{(2)}} z^{(3)}\right)_{i, j}$, give an equation for $\grad_{h^{(2)}} z^{(3)}$.**


\\[
\grad_{h^{(2)}} z^{(3)}
=
W^{(3)}
\\]

**4. Great. Let's break this down to understand better. Give a formula for $\grad_{h^{(2)}} z^{(3)}_j$ in terms of $W^{(3)}$. What is the shape of this? Why does this formula make sense?**

\\[
\grad_{h^{(2)}} z^{(3)}_j = W^{(3)}_{:, j}
\\]

This is a vector with shape $(256, 1)$. It makes sense because all 256 units of $h^{(2)}$ are connected to the $z^{(3)}_j$ value. This column consists of exactly the weights used to calculate $z^{(3)}_j$ and scale the values in $h^{(2)}$.

**5. Using the above formula, consider a change $\Delta h^{(2)}$ to the 256 dimensions of $h^{(2)}$. Use the gradient for $z^{(3)}_j$ to calculate the change in $z^{(3)}_j$. Break it down to the summation level even. Do this both in terms of partials $\fpartial{z^{(3)}_j}{h^{(2)}_i}$ and $W^{(3)}$. Give an explanation for the formulae.**

\\[
\begin{align}
    \Delta z^{(3)}_j
    &=
    \left(
        \Delta h^{(2)}
    \right)
    \left(
        \grad_{h^{(2)}} z^{(3)}_j
    \right)
    =
    \sum_{i = 0}^{256}
    \Delta h^{(2)}_i \fpartial{z^{(3)}_j}{h^{(2)}_i}
\\
    &=
    \left(
        \Delta h^{(2)}
    \right)
    \left(
        W^{(2)}_{:, j}
    \right)
    =
    \sum_{i = 0}^{256}
    \Delta h^{(2)}_i W^{(3)}_{i, j}
\end{align}
\\]

Basically each change to $h^{(2)}$ has its own impact on the $z^{(3)}_j$ value. We need to evaluate those impacts and sum them up.

**6. Give a formula for $\grad_{h^{(2)}_i} z^{(3)}$ in terms of $W^{(3)}$. What is the shape of this? Column or row vector? Why? Why does this formula make sense?**

\\[
\grad_{h^{(2)}_i} z^{(3)} = W^{(3)}_{i, :}
\\]

This is a row vector of length $(1, 10)$. This way when multiplied by a scalar change $\Delta h^{(2)}_i$ you get a proper $\Delta z^{(3)}$ row vector.

It makes sense because the unit $h^{(2)}_i$ is connected to all $10$ units of $z^{(3)}$.

**7. Using the above formula, consider a scalar change $\Delta h^{(2)}_i$. Calculate the change in $z^{(3)}$. Do this both in terms of partials $\fpartial{z^{(3)}_j}{h^{(2)}_i}$ and $W^{(3)}$. Give an explanation for the formulas.**

\\[
\begin{align}
    \Delta z^{(3)}
    &=
    \Delta h^{(2)}_i \grad_{h^{(2)}_i} z^{(3)}
    =
    \left(
        \Delta h^{(2)}_i
        \fpartial{z^{(3)}_0}{h^{(2)}_i}
        ,
        \Delta h^{(2)}_i
        \fpartial{z^{(3)}_1}{h^{(2)}_i}
        ,
        \ldots
        ,
        \Delta h^{(2)}_i
        \fpartial{z^{(3)}_255}{h^{(2)}_i}
    \right)
\\
    &=
    \Delta h^{(2)}_i W^{(3)}_{i, :}
    =
    \left(
        \Delta h^{(2)}_i
        W^{(3)}_{i, 0}
        ,
        \Delta h^{(2)}_i
        W^{(3)}_{i, 1}
        ,
        \ldots
        ,
        \Delta h^{(2)}_i
        W^{(3)}_{i, 255}
    \right)
\end{align}
\\]

**8. The chain rule says:**

\\[
\grad_{h^{(2)}} CE(h^{(3)}, y)
=
\left(
    \grad_{h^{(2)}} z^{(3)}
\right)
\left(
    \grad_{z^{(3)}} CE(h^{(3)}, y)
\right)
\\]

**Tell me the shapes of the terms in the product. Tell me about the final shape.**

The shapes are $(256, 10)$ and $(10,1)$. The product is a vector $(256, 1)$.

**9. Why does it make sense that the final shape of the gradient is $(256, 1)$?**


There are $256$ units in layer two, but we're assessing their impact on a single scalar value: the loss.

**10. To calculate the matrix product, we take the dot product of rows of $\grad_{h^{(2)}} z^{(3)}$ with $\grad_{z^{(3)}} CE(h^{(3)}, y)$. This dot product is $\grad_{h^{(2)}} CE(h^{(3)}, y)_i$.**

**Write a formula with a summation for this for row $i$. Do this both in terms of partials and in terms of $W^{(3)}$.**

\\[
\begin{align}
    \grad_{h^{(2)}} CE(h^{(3)}, y)_i
    &=
    \left(
        \grad_{h^{(2)}} z^{(3)}
    \right)_{i, :}
    \left(
        \grad_{z^{(3)}} CE(h^{(3)}, y)
    \right)
    =
    \sum_{j = 0}^{10}
    \left(
        \fpartial{z^{(3)}_j}{h^{(2)}_i}
    \right)
    \left(
        \fpartial{CE(h^{(3)}, y)}{z^{(3)}_j}
    \right)
\\
    &=
    W^{(3)}_{i, :}
    \left(
        \grad_{z^{(3)}} CE(h^{(3)}, y)
    \right)
    =
    \sum_{j = 0}^{10}
    W^{(3)}_{i, j}
    \left(
        \fpartial{CE(h^{(3)}, y)}{z^{(3)}_j}
    \right)
\end{align}
\\]

**11. Give me a story in words for this formula.**


A change in $h^{(2)}_i$ affects all of $z^{(3)}_j$ values via the weights in row $W_{i, :}$. And a change in a $z^{(3)}_j$ value causes a change in $CE(h^{(3)}, y)$.

The amount of change for $CE(h^{(3)}, y)$ "via" the value $z^{(3)}_j$ is equal to the product of the two partial derivatives.

The total change in cross-entropy comes from summing up over all the "routes" via which $h^{(2)}_i$ can impact the cross-entropy.

**12. Calculate $\grad_{h^{(2)}} CE(h^{(3)}, y)$ by using the formulae we've found for $\grad_{h^{(2)}} z^{(3)}$ and $\grad_{z^{(3)}} CE(h^{(3)}, y)$.**

\\[
\begin{align}
    \grad_{h^{(2)}} CE(h^{(3)}, y)
    &=
    \left(
        \grad_{h^{(2)}} z^{(3)}
    \right)
    \left(
        \grad_{z^{(3)}} CE(h^{(3)}, y)
    \right)
\\
    &=
    W^{(3)}
    (h^{(3)} - y)
\end{align}
\\]

## Calculating $\grad_{z^{(2)}} h^{(2)}$ and $\grad_{z^{(2)}} CE(h^{(3)}, y)$

Since we know 

\\[
\grad_{z^{(2)}} CE(h^{(3)}, y)
=
\left(\grad_{z^{(2)}} h^{(2)}\right)
\left(\grad_{h^{(2)}} CE(h^{(3)}, y)\right)
\\]

We know that what we really need to backprop the loss to $z^{(2)}$ is to calculate $\grad_{z^{(2)}} h^{(2)}$.

**1. What are the shapes of $\grad_{z^{(2)}} h^{(2)}$ and $\grad_{h^{(2)}} CE(h^{(3)}, y)$.**


They are $(256, 256)$ and $(256, 1)$.

**2. $\grad_{z^{(2)}} h^{(2)}$ is a matrix which records how a change to any $z^{(2)}_i$ can change any $h^{(2})_j$. Why is this seem excessive? How many and which $h^{(2)}_j$ values can a change to $z^{(2)}_i$ change? Why? Consider the formula for $h^{(2)}_j$ in terms of $z^{(2)}$...**

We know:

\\[
h^{(2)}_j = \sigma\left(z^{(2}_j\right)
\\]

Therefore, the only $z^{(2)}$ value that can effect $h^{(2}_j$ is $z^{(2)}_j$. The other $z^{(2)}$ values have no impact at all on $h^{(2)}_j$.

Our gradient matrix $\grad_{z^{(2)}} h^{(2)}$ will hold almost all zeros!

**3. What entries of the gradient matrix $\grad_{z^{(2)}} h^{(2)}$ will be zero, and which can be non-zero? What do we call this kind of matrix? What are the values of these entries in terms of partials?**

Only at positions $(i, j)$ where $i = j$ can the matrix be non-zero. All other entries must be zero.

We have:

\\[
\left(
     \grad_{z^{(2)}} h^{(2)}
\right)_{i, i}
=
\fpartial{h^{(2)}_i}{z^{(2)}_i}
\\]

This is called a *diagonal* matrix.

**4. We must now know how to calculate the partial $\fpartial{\sigma(z)}{z}$. The first step is to learn a formula for $1 - \sigma(z)$. Calculate this by expanding the definition of $\sigma$ and simplifying. Hint: remember both formulas for $\sigma(z)$...**

\\[
1 - \sigma(z)
=
1 - \frac{e^z}{1 + e^z}
=
\frac{1 + e^z}{1 + e^z} - \frac{e^z}{1+e^z}
=
\frac{1}{1 + e^z}
\\]

That's almost the same as the "other" formula for $\sigma(z)$. Except the other formula has a $-z$ in the denominator. So this is actually the formula for $\sigma(-z)$.

Thus we have 

\\[
1 - \sigma(z)
=
\sigma(-z)
\\]

**5. Next, let's take the derivative of $\sigma(z)$ wrt $z$. Remember that we can write $\sigma(z) = \left(1 + e^{-z}\right)^{-1}$. This lets us use the "polynomial rule" and chain rule together.**

\\[
\begin{align}
\fpartial{\sigma(z)}{z}
&=
\fpartial{}{z}
\left(1 + e^{-z}\right)^{-1}
\\
&=
-1
\left(1 + e^{-z}\right)^{-2}
\fpartial{}{z}
e^{-z}
\\
&=
-1
\left(1 + e^{-z}\right)^{-2}
\left(
    -e^{-z}
\right)
\\
&=
\frac{e^{-z}}{\left(1 + e^{-z}\right)^2}
\\
&=
\frac{1}{\left(1 + e^{-z}\right)}
\frac{e^{-z}}{\left(1 + e^{-z}\right)}
\\
&=
\sigma(z)
\frac{1}{\left(1 + e^{z}\right)}
\\
&=
\sigma(z)
\sigma(-z)
\\
&=
\sigma(z)
\left(1 - \sigma(z)\right)
\end{align}
\\]

**6. Using this formula for $\fpartial{\sigma(z_i)}{z_i}$.**

**TODO**: broadcast for matrix multiply.