# Neural Network Mathematical Derivation

In this discussion, we walk through the key principles behind neural networks. We will illustrate the algorithm conceptually, solidify that understanding with a mathematical derivation, and finally we will provide a coded example. 

### Neural network feed-forward

The diagram below illustrates the basic architecture behind a neural network. We have a set of inputs denoted by _x_. These inputs are then fed to the hidden layer _Z_ through an activation by their corresponding weights, _w_. The hidden layer performs a transformation of these inputs using a nonlinear 'transfer function.' Subsequently, each of the hidden units is then passed through their own activation functions via the weights _v_. Lastly, the output, consisting of k-classes for a given observation, is then normalized so that the output for a given class represents the probability of that class for the given sample.

<img src="extras/NN.png" width="500" height="500" />

In this example of forward propagation, we can see that we move from the inputs to each hidden unit by taking the sigmoid of the activation function. 

\begin{align}
\ z_j = sigmoid\left(\sum_i(w_{ij}x_i+b_j\right)
\end{align}

In the case of a binary output (two classes in the target variable), we could then also apply the sigmoid as we move from the hidden layer to the output.

\begin{align}
\ p(y\mid x) = sigmoid\left(\sum_j(v_jz_j)+c\right)
\end{align}

If we put the two parts together, we can represent the formula using matrix notation.

\begin{align}
\ y = \sigma(\sigma(XW+b)V+c)
\end{align}

where:
- σ(XW+b) = Z
- D = # of input features
- M = # of hidden units
- K = # of classes
- N = # of samples
- shape of X = N x D
- shape of W = D x M
- XW results in an N x M matrix

### Softmax for K classes

In the above example, we dealt with a binary outcome. However, when dealing more than 2 classes, we must employ the softmax function:

\begin{align}
\large p(y=k\mid x) = \frac{e^{a_k}}{\sum_je^{a_j}}
\end{align}

where, in our case,

\begin{align}
\ a_k = \sum_mV_{mk}Z_m
\end{align}

### Training a neural network through backpropagation

Having discussed the feed-forward mechanism of a neural network, we now get to the interesting part. In order to accurately update our weights using gradient descent, we have to move backwards through our network through each layer. In our case of a single hidden layer, we first have to update our hidden-layer-to-output weights, and then update the input-to-hidden-layer weights.

Our objective function can be expressed as:

\begin{align}
\large P(T \mid X,W,V) = \prod_{n=1}^N\prod_{k=1}^C (y_k^n)^{t_k^n}
\end{align}

where T represents our targets and X, W, and V are the components of our single-layer network. As we have discussed, our output is determined by the taking the softmax of the activation function:

\begin{align}
\ y_k = \text{softmax}(a_k)
\end{align}

where the activation function is:

\begin{align}
\ a_k = \sum_mV_{mk}Z_m
\end{align}

In this example we will take the log-likelihood of our objective function, meaning that we are trying to maximize this function.

\begin{align}
\ J = \sum_n\sum_kt_k^nlogy_k^n
\end{align}

In the first phase, we want to maximize the objective function with respect to the weights _V_, which results in the following use of the chain rule.

\begin{align}
\ \frac{\partial J}{\partial V_{mk}} = \frac{\partial J}{\partial y_k} \frac{\partial y_{k'}^n}{\partial a_k} \frac{\partial a_k}{\partial V_{nk}}
\end{align}

Let us first take the derivative of _J_ with respect to _y_.

\begin{align}
\frac{\partial J}{\partial y_k} = \sum_n\sum_{k'}t_{k'}^n\frac{1}{y_{k'}^n}
\end{align}

Now, given that _y_ is

\begin{align}
\large y_k = \frac{e^{a_k}}{\sum_je^{a_j}}
\end{align}

This means that the predicted probability of a given class actually depends on all the classes due to the summation below. This certainly complicates our derivative of _y_, because we must calculate the derivative for two scenarios: one where _k=k'_ and one where _k≠k'_. The full derivation is shown as the appendix of this document. What we get is: 

\begin{align}
\large \frac{\partial y_{k'}}{\partial a_k} = \left\{ {y_{k'}(1-y_{k'}) \text{  if  } k=k'}\atop {-y_{k'}y_k \text{  if  } k\neq k'}\right.
\end{align}

Using the Kronecker delta:

\begin{align}
\large \delta_{ij} = \left\{1 \text{  if  } i=j\atop {0 \text{  if  } i\neq j}\right.
\end{align}

We get:

\begin{align}
\frac{\partial y_{k'}}{\partial a_k} = y_{k'}(\delta_{kk'}-y_{k'})
\end{align}

The derivative of our activation function with respect to our weight is:

\begin{align}
\large \frac{\partial a_k}{\partial V_{mk}} = z_m^n
\end{align}

Putting our three derivatives together yields the equation below for the gradient of the weights _V_. The full derivation is shown in the appendix.

\begin{align}
\frac{\partial J}{\partial V_{mk}} = \sum_n(t_k^n-y_k^n)z_m^n
\end{align}

Next, we proceed by calculating the derivative of the objective function with respect to the input weights _W_, which again forces us to implement the chain rule.

\begin{align}
\frac{\partial J}{\partial W_{dm}} = \frac{\partial J}{\partial z_m} \frac{\partial z_m}{\partial a_k} \frac{\partial a_k}{\partial W_{dm}}
\end{align}

The derivative of the objective function with respect to _z_ again relies on the softmax derivation, thus yielding:

\begin{align}
\frac{\partial J}{\partial z_m} = \sum_n(t_k^n-y_k^n)V_{mk}
\end{align}

Because the sigmoid is what we apply our activation function, we know that our derivative is (see previous post for details):

\begin{align}
\frac{\partial z_m}{\partial a_k} = z_m^n(1-z_m^n)
\end{align}

The derivative of our activation function simply _x_.

\begin{align}
\frac{\partial a_k}{\partial W_{dm}} = x_d^n
\end{align}

Putting it all together (note that _k_ is not on hte left side - this is due to the 'total derivative' rule):

\begin{align}
\frac{\partial J}{\partial W_{dm}} = \sum_n\sum_k(t_k^n-y_k^n)V_{mk}z_m^n(1-z_m^n)x_d^n
\end{align}

One interesting observation regarding the gradient of the weights aboves is that summation _k_ term, which illustrates that if we want to update our weights, we need to take into account the fact that a given weight affects all classes. Thus the error associated with the prediction of each class _backpropagates_ and is used to update the weights.

### Recursiveness in deeper networks

Let us briefly look at backpropagation in deeper neetworks. Here were have three hidden layers, and now we need to figure out how to optimize four different sets of weights.

<img src="extras/deeper_network.png" width="300" height="300" />

Starting from the output and working our way backward as we did before, first we calculate the derivative with respect to _W3_. For the sake of simplicity, we will assume that this is a one-sample case, thus yielding:

\begin{align}
\frac{\partial J}{\partial W^3_{sk}} = (t_k^n-y_k^n)z_s^3
\end{align}

Then, the weights that feed into the last hidden layer have already been shown to be:

\begin{align}
\frac{\partial J}{\partial W^2_{rs}} = \sum_n(t_k^n-y_k^n)W_{sk}^3z_s^3(1-z_s^3)z_r^2
\end{align}

Next, the derivative of the objective function with respect to _W1_ is:

\begin{align}
\frac{\partial J}{\partial W^1_{qr}} = \sum_n(t_k^n-y_k^n)W_{sk}^3z_s^3(1-z_s^3)W_{rs}^2z_r^2(1-z_r^2)z_q^1
\end{align}

As we can see, a clear pattern emerges as we backpropagate, and we see more and more repetition within each layer. 

### Coding example

### Appendix

#### Softmax derivative

Given

\begin{align}
\large y_k = \frac{e^{a_k}}{\sum_je^{a_j}}
\end{align}

In the case where _k=k'_, we first must apply the product rule:

\begin{align}
\large \frac{dy_{k'}}{da_k} = \frac{de^{a_k}}{da_k}\frac{1}{\sum_{j=1}^Ke^{a_j}}+\frac{d\left[\sum_{j=1}^Ke^{a_j}\right]^{-1}}{da_k}e^{a_k}
\end{align}

\begin{align}
\large \frac{dy_{k'}}{da_k} = \frac{e^{a_k}}{\sum_{j=1}^Ke^{a_j}}-\left[\sum_{j=1}^Ke^{a_j}\right]^{-2}e^{a_k}e^{a_k}
\end{align}

\begin{align}
\large\frac{dy_k}{da_k} = y_k-y_k^2 = y_k(1-y_k)
\end{align}

Where _k≠k'_:

\begin{align}
\large\frac{dy_k}{da_{k'}} = e^{a_k}\frac{d\left[\sum_{j=1}^Ke^{a_j}\right]^{-1}}{da_{k'}} =  e^{a_k}(-1) \left[\sum_{j=1}^Ke^{a_j}\right]^{-2}e^{a_{k'}}
\end{align}

\begin{align}
\large\frac{dy_k}{da_k'} = -\frac{e^{a_k}}{\left[\sum_{j=1}^Ke^{a_j}\right]}\frac{e^{a_{k'}}}{\left[\sum_{j=1}^Ke^{a_j}\right]}
\end{align}

\begin{align}
\large = -y_ky_{k'}
\end{align}

We can combine the two answers using the delta function. This gives us two possibilites, but only one is useful.

\begin{align}
\large\frac{dy_k}{da_{k'}} = y_k(\delta_{kk'}-y_{k'})
\end{align}

\begin{align}
\large\frac{dy_k}{da_{k'}} = y_{k'}(\delta_{kk'}-y_{k})
\end{align}

We now combine the latter formula with our derivative of _J_ with respect to _y_:

\begin{align}
\large\sum_{k'}^Kt_{k'}^n\frac{1}{y_{k'}^n}y_{k'}(\delta_{kk'}-y_{k})
\end{align}

\begin{align}
\large=\sum_{k'}^Kt_{k'}^n(\delta_{kk'}-y_{k})
\end{align}

\begin{align}
\large=\sum_{k'}^Kt_{k'}^n\delta_{kk'}-\sum_{k'}^Kt_{k'}^ny_{k}
\end{align}

Looking at the first term above, we recall that delta is equal to 1 only when _k=k'_ and zero otherwise, which allows us to get rid of the delta and the summation term, giving us:

\begin{align}
\large t_k^n-\sum_{k'}^Kt_{k'}^ny_{k}
\end{align}

Note also that _k'_ has become _k_ in the first term (by definition). We then move the _y_ to the front, as it doesn't depend on the summation.

\begin{align}
\large t_k^n-y_{k}\sum_{k'}^Kt_{k'}^n
\end{align}

And lastly, for a given range of classes, our target can only be equal to 1 for one of the classes, and, being zero for all other classes, that reduces the our equation to:

\begin{align}
\large t_k^n-y_k^n
\end{align}