**For these questions, assume that an $x$ input has 1024 dimensions, that the first hidden layer should have $512$ units, a second layer has $256$ units, and that there are $10$ classes to choose from at the end.**

\\[
\newcommand{\fpartial}[2]{\frac{\partial #1}{\partial #2}}
\\]

**What do the rows of $X$ represent? What do the columns of $X$ represent? What is the shape of $X$?**

Rows represent each image in the batch.

Columns represent each pixel in the image; a column of values is the same pixel's value in each of the images.

The shape of $X$ is $(bs, 1024)$.

**You have a first matrix of weights $W^{(1)}$ and a vector of biases $b^{(1)}$. What are the shapes of $W$ and $b$? Why have I written these superscripts?**

The shape of $W^{(1)}$ is: $(1024, 512)$. The shape of $b^{(1)}$ is $(1, 512)$.

This is the first set of weights and biases used to calculate the pre-activations for the first hidden layer. There will be more weights and biases.

**What is the formula to calculate the hidden pre-activation values $Z^{(1)}$? What is the dimensionality of $z^{(1)}$?**

Formula is:

\\[
S^{(1)} = X W^{(1)} + b^{(1)}
\\]

The dimensionality of $Z^{(1)}$ is $(bs, 512)$.

**What is the nice formula for the $\sigma$ function? Why is this nice? (Hint: what is the formula to convert odds of two outcomes to probability?)**

$\sigma(z) = \frac{e^z}{1 + e^z}$.

If the odds are $odds:1$, then the formula for $odds$ to $p$ is $p = \frac{odds}{1 + odds}$.

**Let's say $f(odds) = \frac{odds}{1 + odds}$. $f$ converts and odds to a probability. Can you write sigmoid in terms of $f$?**

$\sigma(z) = f(e^z)$

**If we usually interpret the input of $f$ as an odds, then if we try to interpret $e^z$ as an odds, what does that suggest we interpret $z$ as?**

We interpret $z$ as the log of an odds.

**What z value has $\sigma(z) = 0.50$ (50% probability) equivalent to an odds of $1.0$?**

$z = 0.0$.

**When is the probability $<0.5$, when is the probability $>0.5$?**

When $z$ is negative the probability will be less than half, when $z$ is positive probability will be greater than half.

**$\sigma(z)$ isn't necessarily a probability. It can be the "percent activated."**

**What is the problem numerically with this formula?**

When $z$ is really large then the floating point representation of $e^z$ can overflow and be $\infty$.

That's a problem because both the numerator and denominator will be $\infty$ which means their ratio is not a number. We want it to be: $1.0$.

**Is there a problem for very negative $z$s with this formula?**

No, because $e^z$ will round to $0.0$ and that's not a problem because this is zero divided by one which is zero which is correct.

**How do we fix this problem? What's the better formula**

$\sigma(z) = \frac{1}{e^{-z} + 1}$. It works for very negative and very positive $z$ values.


**How do I calculate the activation values $H^{(1)}$ from $Z^{(1)}$?**

\\[
H^{(1)} = \sigma(Z^{(1)})
\\]

**Why do we want to use this $\sigma$ function?**

Additional hidden layers add no value if using only pure linear functions. The result would still be linear and representable by a single layer. Extra layers are a waste.

Composing a series of nonlinear functions one after the other *can* result in functions that could not be represented with a single layer. That is, more layers can result in more sophisticated or complex functions. Or more expressive power.

Any function can be approximated by a neural network with relu activations. This means neural networks are *universal function approximators*.

**What is another name for the function $\sigma$ when used to calculate hidden activations?**

Activation function.

**What do I call the linear transformation of the $X$ values before being input into the activation function? What symbols do I use?**

We wrote it as $z^{(1)}$ and it is called pre-activations.

**How do we notate the $i$th row of $W^{(1)}$? The $j$th column?**

$W^{(1)}_{i, :}$ and $W^{(1)}_{:, j}$.

**What is the formula for calculating a specific pre-activation $z^{(1)}_j$ for a single input $x$?**

\\[
\begin{align}
z^{(1)}_j &= x \cdot W^{(1)}_{:, j} + b^{(1)}_j
\\
z^{(1)}_j &= \left(
    \sum_{i = 0}^{1024} x_i W^{(1)}_{i, j}
\right) + b^{(1)}_j
\end{align}
\\]

**What does a column of $W^{(1)}_{:, j}$ represent? What does a row of $W^{(1)}_{i, :}$ represent?**

The column consists of weights for each input dimension $x_i$ used to calculate the preactivation $z^{(1)}_j$. They are the weights for the $j$th hidden unit.

The row consists of weights for a single input dimension $x_i$ used to compute each of the hidden pre-activations $z^{(1)}_j$

**What are the dimensions of $W^{(2)}$ and $b^{(2)}$?**

$(512, 256)$ and $(1, 256)$

**What is the dimension of the first-layer activations $H^{(1)}$?**

$(bs, 512)$

**What are the formulas for $Z^{(2)}$ and $H^{(2)}$?**

\\[
\begin{align}
Z^{(2)} &= H^{(1)} W^{(2)} + b^{(2)}
\\
H^{(2)} &= \sigma(H^{(2)})
\end{align}
\\]

**What does the row $W^{(2)}_{i, :}$ represent? What does the column $W^{(2)}_{:, j}$ represent?**

The $j$th column of $W^{(2)}$ represents the weights for each output of the first hidden layer used to compute the $j$th pre-activation of the second hidden layer.

The $i$th row of $W^{(2)}$ is all of the weights for the $i$th activation of the hidden layer $H^{(1)}$ used to compute all the hidden pre-activations $z^{(2)}_j$ for all $j$.

**What are the dimensions of $W^{(3)}, b^{(3)}$? What is the shape of $Z^{(3)}$?**

$(256, 10)$ and $(1, 10)$. $(bs, 10)$.


**What is the formula for calculating $Z^{(3)}$? What is the formula for calculating $H^{(3)}$ (careful)?**

\\[
\begin{align}
Z^{(3)} &= H^{(2)} W^{(3)} + b^{(3)}
\\
H^{(3)} &= softmax(Z^{(3)})
\end{align}
\\]

**What properties do we want out of the last hidden layer (aka, the output layer)?**

They must be between zero and one because otherwise they're not valid as a probability at all.

The probabilities should sum to one so that they form a probability *distribution.*

For an input $x$, we want $h^{(3)}$ to be all zeros except at position $y$ which is the correct class. That value should be ideally 1.0.

**What is the shape of $y$ for a single example? What is the range of $y$ values?**

Shape of $y$ is `()` or just a scalar. The range is zero to nine.

**What is the shape of $y$ for a batch of examples (I'm using the same symbol)? What is the range of $y$ values?**

Shape of $y$ is $(bs,)$ a vector of length $bs$. And the range is zero to nine.

**What is the matrix $Y$? What is the range of $Y$ values? What is the shape of $Y$?**

Shape is $(bs, 10)$ and each row is a one hot encoding of $y_i$.


Forget about your $z$s and $h$s for a second.

**If the probability we assign to the correct answer is $p$, then what is the ideal value of $p$?**

1.0 or 100%.

**What is the ideal value of $\log p$?**

0.0

**What is the worst value of $p$? And $\log p$?**

0.0 and $-\infty$.

**If larger values of $p$ are better than what values of $\log p$ are better?**

Larger ones. Because monotonic.

**What are the properties of a loss function?**

Loss function should be non-negative. Should be zero when perfect/correct.

The worse the prediction the greater the loss function.

**Can we use $\log p$ by itself as a loss function? Why? What do we have to do to use $\log p$ as a loss function?**

No. Goes negative. Greater values are better.

We use $-\log p$ as the loss function.

**Is there a deeper reason for using $-\log p$ rather than some other random function like $2^{-\log p} - 1$?**

Is it simpler? Maybe. There's a deep reason related to maximum likelihood that we won't talk about tonight.

**What do we call this loss function?**

Cross entropy.

**If I give you $h^{(3)}$ and a one-hot encoding $y$ for a single example, what is the formula for the cross-entropy loss?**

\\[
\begin{align}
CE(h^{(3)}, y) &= -\log\left(
    h^{(3)} \cdot y
\right)
\\
&= -\log\left(
    \sum_{i = 0}^{9} h^{(3)}_i y_i
\right)
\end{align}
\\]

**Can you write the mean cross entropy for a batch using rows of the matrix $H^{(3)}$ and rows of the matrix $Y$ and doing a summation?**

\\[
\begin{align}
CE(H^{(3)}, Y) = \frac{1}{bs} \sum_{i = 0}^{bs} CE\left(
    H^{(3)}_{i, :},
    Y_{i, :}
\right)
\end{align}
\\]

TODO: ask how to do this with just vectors first.

TODO: how do we get the dot product of the rows of $H^{(3)}$ and the rows of $Y$ using coordinate wise multiplication and summing of rows of a matrix to produce a vector.

Make more clear: no "real" matrix operations or transposes or whatever.

\\[
\text{np.sum}(H^{(3)} * Y, axis = 1)
\\]

This is a vector of length $bs$. And the values are dot products of corresponding rows, which is in fact the probability assigned to the correct answer.

Next step is: take negative log of the vector. This produces a new vector of length $bs$.

Last step is:

\\[
\text{np.sum}(-\log \text{np.sum}(H^{(3)} * Y, axis = 1), axis = 0) / bs
\\]


### Backpropagation

**What is $\fpartial{}{x} \log x$?**

$\frac{1}{x}$

**What is the derivative $\fpartial{}{x} g(f(x))$ in terms of $\fpartial{}{x} f(x)$ and $\fpartial{}{x} g(x)$? Recall that $x$ and $y$ are just variable names and are otherwise meaningless. Just like you can use the same variable name in two functions and they have nothing to do with each other, the $x$ in the $\fpartial{}{x} f(x)$ isn't necessarily the same as the $x$ in $\fpartial{}{x} g(x)$.**

Chain rule.

\\[
\fpartial{}{x} g\left(f\left(x\right)\right)
=
\left(\fpartial{}{y} g\left(y\right)\right)
    \left(f\left(x\right)\right)
\left(\fpartial{}{x} f\left(x\right)\right)
\\]

**Calculate $\fpartial{}{z^{(3)}_i} CE(h^{(3)}, y)$. Recall that $h^{(3)} = \text{SOFTMAX}\left(z^{(3)}\right)$. So first write CE in terms of the $z^{(3)}$ and then differentiate.**

\\[
\begin{align}
    CE\left(h^{(3)}, y\right)
&=
    -\log h^{(3)} \cdot y
\\
&=
    -\log \sum_{i = 0}^{9} h^{(3)}_i y_i
\\
&=
    -\log \sum_{i = 0}^{9}
        y_i
        \frac{
            \exp\left(z^{(3)}_i\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
\end{align}
\\]

Next let $c$ be the correct class. Notice that all the log probabilities for the wrong classes don't matter so we have:

\\[
\begin{align}
    CE\left(h^{(3)}, y\right)
&=
    -\log \sum_{i = 0}^{9}
        y_i
        \frac{
            \exp\left(z^{(3)}_i\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
\\
&=
    -\log\left(
        y_c
        \frac{
            \exp\left(z^{(3)}_c\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
    \right)
\\
&=
    -\log\left(
        \frac{
            \exp\left(z^{(3)}_c\right)
        }{
            \sum_{j=0}^9
            \exp\left(z^{(3)}_j\right)
        }
    \right)
\end{align}
\\]

Here we replace $y_c = 1$ since this is the definition of the one-hot encoding.

Let's use $S := \exp\left(z^{(3)}_j\right)$ because I'm lazy.