<h1>Forward propagation</h1>
<p>We have a fully-connected network with <font color='red'>$L$</font> layers.
<br>The activations of the nodes in layer $(l)$ are stored in an activations column-vector <font color='red'>$a^{(l)}$</font>, where the superscript index denote the layer. 
<br>The connections from the nodes in layer $(l-1)$ to the layer $(l)$ are stored in a weight matrix <font color='red'>$W^{(l)}$</font>, 
<br>and the biases for each node is stored in a bias column-vector <font color='red'>$b^{(l)}$</font>.

<p>For a simple forward pass we have:
    
> <font color='red'>$$a^{(0)} = x$$</font>
> <p><font color='red'>$$a^{(l)} = \sigma\left(W^{(l)}a^{(l-1)} + b^{(l)}\right)$$</font>

<p>We introduce a new vector <font color='red'>$z^{(l)}$</font>. 
<br>which is the activation without the application of a component-wise activation function, so that <font color='red'>$a^{(l)} = \sigma\left(z^{(l)}\right)$</font>. 
    <br>Call this value the <b>“input sum”</b> of a node.
<img src="images/backprop/matrix_multiplection.png" align="center" />

<p>The whole network is shown below, from the input vector $x$, to the output activation vector $a^{(L)}$. 
<br>The connections leading in to a specific node is shown in colors in two layers:
<img src="images/backprop/fully_connected.png" align="center" />

<h2>Notation</h2>
<p>
<font color='red'>$L$</font> - number of layers in the network

> <br>Layers are indexed <font color='red'>$l=1,2,...,N-1,N$</font>
> <br>Nodes in a given layer <font color='red'>$l$</font> are indexed: 
    <font color='red'>$j=0,1,2,...,n-1$</font>
> <br>Nodes in layer <font color='red'>$l-1$</font> are indexed: 
    <font color='red'>$k=0,1,2,...,n-1$</font>
<p>
<font color='red'>$y_j$</font> - the desiered value of node <font color='red'>$j$</font> in the output layer
    <font color='red'>$L$</font> for a single (specific) training example.
<p>
<font color='red'>$C$</font> - the <b>cost</b> (=loss =error) function of the network for a specific example.
    
> <br>e.g. the sum of squared errors: 
    $$C = \sum_{j=0}^{n-1} \left(\hat{y} - y\right)^2$$
<p>
<font color='red'>$w_{kj}^{(l)}$</font> - the weight of the connection fron node <font color='red'>$k$</font> 
    in layer <font color='red'>$l-1$</font> to node <font color='red'>$j$</font> in layer 
    <font color='red'>$l$</font>.
<p>
<font color='red'>$w_j^{(l)}$</font> - <b>weights vector</b> of node <font color='red'>$j$</font> 
    in layer <font color='red'>$l$</font>.

<p>
<font color='red'>$z_j^{(l)}$</font> - <b>input</b> for node <font color='red'>$j$</font> in layer <font color='red'>$l$</font> 
<font color='red'>$$z_j^{(l)} = \sum_{k=0}^{(n-1)} \left(w_{jk}^{(l)} a_k^{(l-1)}\right) + b_j^{(l)}$$</font>

<p>
<font color='red'>$\sigma^{(l)}$</font> - the <b>activation function</b> used for layer 
    <font color='red'>$l$</font>.

<p>
<font color='red'>$a_j^{(l)}$</font> - the <b>activation output</b> of node 
    <font color='red'>$j$</font> in layer <font color='red'>$l$</font>.
    <font color='red'>$$a_j^{(l)} = \sigma\left(z_j^{(l)}\right)$$</font>


<h1>Deriving the error</h1>
<p>In the figure below, we zoom at three adjacent layers anywhere in the network.
<br>The index letter for the nodes in the layers <font color='red'>$(l-1)$, $(l)$</font> and <font color='red'>$(l+1)$</font> are <font color='red'>$j$, $k$</font> and <font color='red'>$m$</font> respectively.
<img src="images/backprop/layers_jkm.png" align="center" />

<p>An error function $C$ is defined using one example from our training data, and its derivative is calculated with respect to a single weight $w_{jk}$ in layer $(L)$.

<p>Using the chain rule we get:
<p><font color='red'>$$
    \frac{\partial C}{\partial w_{kj}^{(l)}} = 
    \frac{\partial C}{\partial z_k^{(l)}} \frac{\partial z_k^{(l)}}{\partial w_{kj}^{(l)}} = 
    \frac{\partial C}{\partial a_k^{(l)}} \frac{\partial a_k^{(l)}}{\partial z_k^{(l)}} \frac{\partial z_k^{(l)}}{\partial w_{kj}^{(l)}} = 
    $$</font>

<p> Using the chain rule again:
<br>Notice that all contributions from the neurons in layer $(l+1)$ (indexed by $m$) have to be accounted for since their value is affecting the end error (their value is depending on the weight that we are taking the derivative with respect to).
<p><font color='red'>$$
     = \sum_{m} \left( \frac{\partial C}{\partial z_m^{(l+1)}} \frac{\partial z_m^{(l+1)}}{\partial a_j^{(l)}} \right) \frac{\partial a_j^{(l)}}{\partial z_j^{(l)}} \frac{\partial z_j^{(l)}}{\partial w_{jk}^{(l)}}   $$</font>
