# Cheat sheet on Neural Networks 1  

The following article is a cheat sheet on neural networks. My sources are based on the following course and article:
- the excellent [Machine Learning course](https://www.coursera.org/learn/machine-learning) on Coursera from Professor Andrew Ng, Stanford,
- the very good [article from Michael Nielsen](http://neuralnetworksanddeeplearning.com/chap2.html), explaining the backpropagation algorithm.


## Why the neural networks are powerful ?

It is proven mathematically that:  

> Suppose we’re given a [continuous] function f(x) which we’d like to compute to within some desired accuracy ϵ>0. The guarantee is that by using enough hidden neurons we can always find a neural network whose output g(x) satisfies:
|g(x)−f(x)|<ϵ, for all inputs x.  

_Michael Nielsen — From the following [article](http://neuralnetworksanddeeplearning.com/chap4.html)_

##  Conventions  
Let’s define a neural network with the following convention:

L = total number of layers in the network.  
$s_l$ = number of units (not counting bias unit) in layer l.  
K = number of units in output layer ( = $s_l$ ).  

<img src="images/Neural_Network_definition.png" />

We define the matrix θ of the weights for the layer l as following:

$$
\theta^{(l)} \in \mathbb{R}^{s_l \times (s_{(l-1)}+1)}
$$

$$
\theta^{(l)} = 
\begin{bmatrix}
    [ \theta^{(l)}_1 ]^T \\
    [ \theta^{(l)}_2 ]^T \\
    \vdots \\
    [ \theta^{(l)}_{s_{l}} ]^T
\end{bmatrix} =
\begin{bmatrix}
    \theta_{1,0} & \dots & \theta_{1,j} & \dots  & \theta_{1,s_{l-1}} \\
    \vdots       &       & \vdots       &        & \vdots \\
    \theta_{i,0} & \dots & \theta_{i,j} & \dots  & \theta_{i,s_{l-1}} \\
    \vdots       &       & \vdots       &        & \vdots \\
    \theta_{s_l,0} & \dots & \theta_{s_l,j} & \dots  & \theta_{s_l,s_{l-1}} \\
\end{bmatrix}
$$

Hence, we have the following relation: 
$$a^{(l)} = g(\theta^{(l)}.a^{(l-1)})$$


## The cost function of a Neural Network

The training set is defined by: $ { (x^1,y^1), ..., (x^m,y^m) } $

x and y are vectors, with respectively the same dimensions as the input and output layers of the neural network.  

The cost function of a neural network is the following:


$$
J(\theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[ cost( a^{(L)}_k, y^{(i)}_k) \right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{j=1}^{s_l} \sum_{i=1}^{s_{l+1}} ( \theta_{i,j}^{(l)})^2
$$

$a^{(L)}_k$ is the output of the neural network, and is dependent of the weights 𝜃 of the neural network.  

Now, the objective is to train the neural network and find the minimum of the cost function J(𝜃).

## Mathematic reminder: the chain rule

Let’s define the functions f, g and h as following:

$$ f:\mathbb{R}^n \rightarrow \mathbb{R}  $$

$$ g:\mathbb{R}^p \rightarrow \mathbb{R}^n $$

$$ h = f \circ g $$

The derivative of h is given by the chain rule theorem:

$$
\forall_i \in \{ \!1, ... , \!p \}, 
\frac{\partial h}{\partial x_i} = 
\sum_{k=1}^{n} \frac{\partial f}{\partial g_k} \frac{\partial g_k}{\partial x_i}
$$

(See the following [course online](https://ocw.mit.edu/courses/mathematics/18-02sc-multivariable-calculus-fall-2010/2.-partial-derivatives/) on partial derivation from the MIT)


## The backpropagation algorithm

We use the __gradient descent__ to find the minimum of J on 𝜃: $ \min\limits_{\theta} J(\theta)$

The gradient descent requires to compute: 

$$ \frac{\partial J(\theta)}{\partial \theta^{(l)}_{i,j}} $$

___In the following parts, we consider only the first part of J(θ) (as if the regularisation term λ=0). The partial derivative of the second term of J(θ) is easy to compute.___


### Definition of ẟ

Let’s define the function ẟ. When ẟ of the layer l is multiplied by the output of the layer (l-1), we obtain the partial derivative of the cost function on θ.

Let’s use the chain rule and develop this derivative on z:

$$ 
\frac{\partial J(\theta)}{\partial \theta^{(l)}_{i,j}} 
=
\sum^{s_l}_{k = 0} \frac{\partial J(\theta)}{\partial z^{(l)}_k} \frac{\partial z^{(l)}_k}{\partial \theta^{(l)}_{i,j}}
$$

(Remind that J is dependent of z)

As: 
$$z^{(l)}_k = [ \theta^{(l)}_k ]^T . a^{(l-1)} = \sum_{p=1}^{s_l} \theta^{(l)}_{k,p} \times a^{(l-1)}_p$$

$$\frac{\partial z^{(l)}_k}{\partial \theta^{(l)}_{i,j}} = 0\ for\ k\ ≠\ i\ and\ p\ ≠\ j\ in\ the\ sum.$$

$$And\ \frac{\partial z^{(l)}_k}{\partial \theta^{(l)}_{i,j}} = a^{(l-1)}_j\ for\ k\ =\ i\ and\ p\ =\ j\ in\ the\  sum.$$

We define the __output error 𝛿__:

$$ \delta^{(l)}_k = \frac{\partial J(\theta)}{\partial z^{(l)}_k} $$

So we have:

$$ 
\frac{\partial J(\theta)}{\partial \theta^{(l)}_{i,j}} 
=
\delta^{(l)}_i . a^{(l-1)}_j
$$

### Value of ẟ for the layer L

Now let’s find 𝛿 for the output layer (layer L):

