# Cheat sheet on Neural Networks 2

This cheat sheet is an article on Neural Networks and follows [the first sheet cheat on Neural Networks](https://blog.innovea.tech/https-medium-com-sebastien-attia-cheat-sheet-on-neural-networks-1-30735616584a). The purpose of this article is to make alive the equations we have described previously.


## Definition of the cost and activation functions


### The cost function

Remember that the cost function $J(\theta)$ is defined as follow:

$$
J(\theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[ cost( a^{(L)}_k, y^{(i)}_k) \right] + \frac{\lambda}{2m}\sum_{l=2}^{L} \sum_{j=1}^{s_l} \sum_{i=1}^{s_{l+1}} ( \theta_{i,j}^{(l)})^2
$$

The cost function $cost( a^{(L)}_k, y^{(i)}_k)$ can be defined as in the linear regression or as in logistic regression:


|  | $cost( a^{(L)}_k, y^{(i)}_k)$ | $\frac{\partial cost( a^{(L)}_k, y^{(i)}_k)}{\partial a^{(L)}_k} $ |
|:---------------------:|:-------------------------------:|------------------------------------------------|
| Linear regression   | $1/2.(a^{(L)}_k-y^{(i)}_k)^2$ | $(a^{(L)}_k-y^{(i)}_k)$   |
| Logistic regression | $y^{(i)}_k log(a^{(L)}_k) + (1 - y^{(i)}_k).log(1 - a^{(L)}_k)$ | $\frac{y^{(i)}_k}{a^{(L)}_k} - \frac{1 - y^{(i)}_k}{1 - a^{(L)}_k} $ |          |


### The activation functions

We consider the sigmoid and the hyperbolic tangent:


| Activation function | g(z) | g'(z) |
|---------------------|:------:|:-------:|
| Sigmoid             | $\frac{1}{1+e^{-z}}$ | g(z).(1-g(z)) |
| Hyperbolic tangent  | $\frac{e^z-e^{-z}}{e^{z} + e^{-z}}$  | $ 1 \text{-} (g(z))^2 $ |


## Make the equations alive

For the concrete implementation of the backpropagation algorithm, we will choose:
- the cost function of the __logistic regression__ and
- the __sigmoid__ for the activation function


### Equations 

Let's express the __[third equation](https://blog.innovea.tech/https-medium-com-sebastien-attia-cheat-sheet-on-neural-networks-1-30735616584a#mjx-eq-3)__ of the previous article ["Cheat sheet on Neural Networks 1"](https://blog.innovea.tech/https-medium-com-sebastien-attia-cheat-sheet-on-neural-networks-1-30735616584a).

$$
\begin{align}
\delta^{(L)} & = \nabla_{a^{(L)}} J(\theta) \odot g'(z^L)\\
\delta^{(L)}_p & = \frac{\partial J(\theta)}{\partial a^{(L)}_p} . g'(z^L_p),\ \forall p \in {1, ..., s_L}\\
\delta^{(L)}_p & = - \frac{\partial \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[ cost( a^{(L)}_k, y^{(i)}_k) \right] +{\frac{\lambda}{2m}\sum_{l=2}^{L} \sum_{j=1}^{s_l} \sum_{i=1}^{s_{l+1}} ( \theta_{i,j}^{(l)})^2}}{\partial a^{(L)}_p} . g'(z^L_p)
\end{align}
$$

The second term is 0, because the regularization term is not dependent of $a^{(L)}_p$.  
The first term is not equal to 0 in the sum, only when k=p and can then be reduced to:

$$
\begin{align}
\delta^{(L)}_p & = - \frac{1}{m} \frac{\partial \sum_{i=1}^m \left[ cost( a^{(L)}_i, y^{(i)}_i) \right]} {\partial a^{(L)}_p} . g'(z^L_p)\\
& = - \frac{1}{m} \sum_{i=1}^m \frac{\partial \left[ cost( a^{(L)}_i, y^{(i)}_i) \right]} {\partial a^{(L)}_p} . g'(z^L_p)\\
& = - \frac{1}{m} \sum_{i=1}^m \left( \frac{y^{(i)}_p}{a^{(L)}_p} - \frac{1 - y^{(i)}_p}{1 - a^{(L)}_p} \right) . g(z^L_p).(1-g(z^L_p))\\
& = - \frac{1}{m} \sum_{i=1}^m \left( \frac{y^{(i)}_p}{a^{(L)}_p} - \frac{1 - y^{(i)}_p}{1 - a^{(L)}_p} \right) . a^L_p.(1-a^L_p)\\
& = - \frac{1}{m} \sum_{i=1}^m \left( y^{(i)}_p - a^{(L)}_p \right)\\
\delta^{(L)} & = - \frac{1}{m} \sum_{i=1}^m \left( y^{(i)} - a^{(L)} \right)\\
\end{align}
$$

The __[second equation](https://blog.innovea.tech/https-medium-com-sebastien-attia-cheat-sheet-on-neural-networks-1-30735616584a#mjx-eq-2)__ becomes:

$$
\begin{align}
\delta^{(l)} & = [(\theta^{(l+1)})^T . \delta^{(l+1)}] \odot g'(z^l)\\
\delta^{(l)} & = [(\theta^{(l+1)})^T . \delta^{(l+1)}] \odot a^{(l)} \odot (1-a^{(l)})
\end{align}
$$


### Implementation of the backpropagation algorithm

The backpropagation algorithm has several flavours:
- the batch gradient descent,
- the stochastic gradient descent,
- the mini-batch gradient descent,

We define first the mini-batch gradient descent.  


#### The mini-batch gradient descent

We define the variable b, the mini-batch size.

1. Randomly initialize the weights of the Neural Network, 
2. Split the training set $\{(x^{(1)}, y^{(1)}), ..., (x^{(m)}, y^{(m)})\}$ in mini-batch of size b,
3. For each mini-batch:  
    0. initialize to 0 the accumulator $\Delta^{(l)},\ \forall l \in {2, ..., L}$
    1. for each input of the mini-batch:  
        1. perform the forward propagation, by applying recursevely the equation (0) on the input vector,  
        2. Recursively, for each $l \in {L, L-1, ..., 2}$:
            1. compute the value of $\delta^{(l)}$, when l=L, use equation (3), else use equation (2),
            2. compute $\Delta^{(l)} := \Delta^{(l)} + \nabla_{\theta^{(l)}} J(\theta)$ by applying equation 1
            3. compute $\Delta^{(l)} := \Delta^{(l)} / (size\ of\ mini-batch),\ \forall l \in {2, ..., L}$, (the last batch may have a size ≠ b),
    3. optionally: compute an approximation of $\nabla_{\theta^{(l)}}$ (__gradient checking__) and compare the value to $\Delta^{(l)}$
    4. recompute the weights $\theta^{(l)}$ of the Neural Network by applying: $\theta^{(l)} := \theta^{(l)} - \alpha. \Delta^{(l)},\ \forall l \in {2, ..., L} $


#### The batch gradient descent
... is a mini-batch gradient descent with b = m.

#### The stochastic gradient descent
... is a mini-batch gradient descent with b = 1.


## Implementation of the Neural Network in Python



<div style="text-align: right"> To Victor, Oscar and all those who will follow </div>