# Back-Propagation

![Imgur](https://i.imgur.com/fWYI0TY.jpg)

## Overview

In this post, we will build a neural network with three layers:


*   Input layer with two inputs neurons  
*   One hidden layer with two neurons (activation - relu)
*   Output layer with a single neuron

![Imgur](https://i.imgur.com/JcN0Qzc.jpg)

## Weights, weights, weights

Neural network training is about finding weights that minimize prediction error. We usually start our training with a set of randomly generated weights.Then, backpropagation is used to update the weights in an attempt to correctly map arbitrary inputs to outputs.

Our initial weights will be as following: w1 = 0.12, w2 = 0.33, w3 = 0.15, w4 = 0.09, w5 = 0.11 and w6 = 0.12

![Imgur](https://i.imgur.com/WryGHvt.jpg)

## Dataset

Our dataset has one sample with two inputs and one output.

![Imgur](https://i.imgur.com/PMor3Se.jpg)

Our single sample is as following inputs= [2, 3] and output=[1].

![Imgur](https://i.imgur.com/WTFEUDY.jpg)

## Forward Pass

We will use given weights and inputs to predict the output. Inputs are multiplied by weights; the results are then passed forward to next layer.

![Imgur](https://i.imgur.com/Ml6nF58.jpg)

\begin{equation*}
\begin{bmatrix}\color{gold}{2} & \color{gold}{3} \end{bmatrix} . \begin{bmatrix}\color{royalblue}{0.12} & \color{orange}{0.15} \\ \color{royalblue}{0.33} & \color{orange}{0.09} \end{bmatrix} = \begin{bmatrix}\color{deepskyblue}{1.23} & \color{deepskyblue}{0.57} \end{bmatrix} . \begin{bmatrix}\color{darkgray}{0.11} \\ \color{darkgray}{0.12} \end{bmatrix} = \begin{bmatrix}\color{lightgreen}{0.2037} \end{bmatrix}
\end{equation*}

<center>$\color{gold}2*\color{royalblue}{0.12} + \color{gold}3*\color{royalblue}{0.33} = \color{deepskyblue}{1.23}$</center>

<center>$\color{gold}2*\color{orange}{0.15} + \color{gold}3*\color{orange}{0.09} = \color{deepskyblue}{0.57}$</center>

<center> $\color{deepskyblue}{1.23}*\color{darkgray}{0.11} + \color{deepskyblue}{0.57}*\color{darkgray}{0.12} = \color{lightgreen}{0.2037}$</center>

## Calculating Error

Now, it’s time to find out how our network performed by calculating the difference between the actual output and predicted one. It’s clear that our network output, or **prediction**, is not even close to **actual output**. We can calculate the difference or the error as following.


![Imgur](https://i.imgur.com/vZ2wzbY.jpg)

## Reducing Error

Our main goal of the training is to reduce the **error** or the difference between **prediction** and **actual output**. Since **actual output** is constant, “not changing”, the only way to reduce the error is to change **prediction** value. The question now is, how to change **prediction** value?

By decomposing **prediction** into its basic elements we can find that **weights** are the variable elements affecting **prediction** value. In other words, in order to change **prediction** value, we need to change **weights** values.


![Imgur](https://i.imgur.com/4JOCvy5.jpg)


The question now is **how to change\update the weights value so that the error is reduced?**

The answer is **Backpropagation!**

## Backpropagation

**Backpropagation**, short for “backward propagation of errors”, is a mechanism used to update the weights using [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent). It calculates the gradient of the error function with respect to the neural network’s weights. The calculation proceeds backwards through the network.

**Gradient descent** is an iterative optimization algorithm for finding the minimum of a function; in our case we want to minimize the error function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point.

![Imgur](https://i.imgur.com/LfQEuDo.jpg)


\begin{equation*}
^*{W}_X = {W}_X - \color{gold}a (\frac{\partial{Error}}  {\partial{W_X}})
\end{equation*}




For example, to update w6, we take the current w6 and subtract the partial derivative of **error** function with respect to w6. Optionally, we multiply the derivative of the **error** function by a selected number to make sure that the new updated **weight** is minimizing the error function; this number is called **learning rate.**


\begin{equation*}
^*{W}_6 = {W}_6 - \color{gold}a (\frac{\partial{Error}}  {\partial{W_6}})
\end{equation*}


The derivation of the error function is evaluated by applying the chain rule as following:

$\frac{\partial{Error}}  {\partial{W_6}} = \frac{\partial{Error}}  {\partial{prediction}} * \frac{\partial{prediction}}  {\partial{W_6}}$  &emsp;&emsp;&emsp; Chain Rule

$\frac{\partial{Error}}  {\partial{W_6}} = \frac{\partial\frac{1}{2}{(prediction-actual)}^2}{\partial{prediction}}  * \frac{\partial(\color{gold}{{i}_1}\color{royalblue}{{w}_1} + \color{gold}{{i}_2}\color{royalblue}{{w}_2})\color{darkgray}{{w}_5} + (\color{gold}{{i}_1}\color{orange}{{w}_3} + \color{gold}{{i}_2}\color{orange}{{w}_4})\color{darkgray}{{w}_6}} {\partial{W_6}}$

$\frac{\partial{Error}}  {\partial{W_6}} = 2 * \frac{1}{2}(prediction - actual)\frac{{\partial(prediction-actual)}}{\partial{prediction}}  * (\color{gold}{{i}_1}\color{orange}{{w}_3} + \color{gold}{{i}_2}\color{orange}{{w}_4}) $

$\frac{\partial{Error}}  {\partial{W_6}} = (prediction - actual) * (\color{deepskyblue}{{h}_2})$

$\frac{\partial{Error}}  {\partial{W_6}} =\Delta\color{deepskyblue}{{h}_2}$


So to update w6 we can apply the following formula

\begin{equation*}
^*{W}_6 = {W}_6 - \color{gold}a \Delta \color{deepskyblue}{{h}_2}
\end{equation*}


Similarly, we can derive the update formula for w5 and any other weights existing between the output and the hidden layer.

\begin{equation*}
^*{W}_5 = {W}_5 - \color{gold}a \Delta\color{deepskyblue}{{h}_1}
\end{equation*}


However, when moving backward to update w1, w2, w3 and w4 existing between input and hidden layer, the partial derivative for the error function with respect to w1, for example, will be as following.


$\frac{\partial{Error}}  {\partial{W_1}} = \frac{\partial{Error}}  {\partial{prediction}} * \frac{\partial{prediction}}  {\partial{h_1}} * \frac{\partial{h}_1}  {\partial{W_1}}$  &emsp;&emsp;&emsp; Chain Rule

$\frac{\partial{Error}}  {\partial{W_1}} = \frac{\frac{1}{2}{(prediction-actual)}^2}{\partial{prediction}}  * \frac{\partial(\color{deepskyblue}{{h}_1})\color{darkgray}{{w}_5}+(\color{deepskyblue}{{h}_2})\color{darkgray}{{w}_6}} {\partial{h_1}} * \frac{\partial(\color{gold}{{i}_1})\color{royalblue}{{w}_1}+(\color{gold}{{i}_2})\color{royalblue}{{w}_2}} {\partial{w_1}} $

$\frac{\partial{Error}}  {\partial{W_1}} = 2 * \frac{1}{2}(prediction - actual)\frac{{\partial(prediction-actual)}}{\partial{prediction}}  * (\color{darkgray}{{w}_5)} * (\color{gold}{{i}_1}) $

$\frac{\partial{Error}}  {\partial{W_1}} = (prediction - actual) * (\color{darkgray}{{w}_5}\color{gold}{{i}_1}) $

$\frac{\partial{Error}}  {\partial{W_1}} =\Delta\color{darkgray}{{w}_5}\color{gold}{{i}_1}$



We can find the update formula for the remaining weights w2, w3 and w4 in the same way.

In summary, the update formulas for all weights will be as following:

\begin{equation*}
^*\color{darkgray}{{w}_6} = \color{darkgray}{{w}_6} - \color{brown}a ( \color{deepskyblue}{h_2}.\Delta) \\ ^*\color{darkgray}{{w}_5} = \color{darkgray}{{w}_5} - \color{brown}a ( \color{deepskyblue}{h_1}.\Delta) \\ ^*\color{orange}{w_4} = \color{orange}{w_4} - \color{brown} a ( \color{gold}{i_2}.\Delta\color{darkgray}{w_6} ) \\ ^*\color{orange}{w_3} = \color{orange}{w_3} - \color{brown}a (\color{gold}{i_1}.\Delta\color{darkgray}{w_6} )  \\ ^*\color{royalblue}{w_2} = \color{royalblue}{w_2} - \color{brown}a (\color{gold}{i_2}.\Delta\color{darkgray}{w_5}) \\ \\ ^*\color{royalblue}{w_1} = \color{royalblue}{w_1} - \color{brown}a (\color{gold}{i_1}.\Delta\color{darkgray}{w_5)} 
\end{equation*}

We can rewrite the update formulas in matrices as following


\begin{equation*}
\begin{bmatrix}\color{darkgray}{w_5} \\ \color{darkgray}{w_6} \end{bmatrix} =  \begin{bmatrix}\color{darkgray}{w_5} \\ \color{darkgray}{w_6} \end{bmatrix} - \color{brown}a \Delta\begin{bmatrix}\color{deepskyblue}{h_1} \\ \color{deepskyblue}{h_2} \end{bmatrix} = \begin{bmatrix}\color{darkgray}{w_5} \\ \color{darkgray}{w_6} \end{bmatrix} - \begin{bmatrix}\color{brown}a\Delta\color{deepskyblue}{h_1} \\ \color{brown}a\Delta\color{deepskyblue}{h_2} \end{bmatrix} 
\end{equation*}

\begin{equation*}
\begin{bmatrix}\color{royalblue}{w_1} & \color{orange}{w_3}  \\ \color{royalblue}{w_2} & \color{orange}{w_4}\end{bmatrix} =  \begin{bmatrix}\color{royalblue}{w_1} & \color{orange}{w_3}  \\ \color{royalblue}{w_2} & \color{orange}{w_4}\end{bmatrix} - \color{brown}a \Delta\begin{bmatrix}\color{gold}{i_1} \\ \color{gold}{i_2} \end{bmatrix} . \begin{bmatrix}\color{darkgray}{w_5} & \color{darkgray}{w_6} \end{bmatrix} =\begin{bmatrix}\color{royalblue}{w_1} & \color{orange}{w_3}  \\ \color{royalblue}{w_2} & \color{orange}{w_4}\end{bmatrix} - \begin{bmatrix}\color{brown}a\color{gold}{i_1}\Delta\color{darkgray}{w_5} & \color{brown}a\color{gold}{i_1}\Delta\color{darkgray}{w_6}  \\ \color{brown}a\color{gold}{i_2}\Delta\color{darkgray}{w_5} & \color{brown}a\color{gold}{i_2}\Delta\color{darkgray}{w_6} \end{bmatrix} 
\end{equation*}

## Backward Pass

Using derived formulas we can find the new weights.


Learning rate: is a hyperparameter which means that we need to manually guess its value.


$\Delta = 0.2037-1 = -0.7963$  &emsp;&emsp;&emsp; Delta = prediction - actual \\


$a=0.05$  &emsp;&emsp;&emsp; **Learning Rate,** we smartly guess this number 



$ \begin{bmatrix}\color{darkgray}{w_5} \\ \color{darkgray}{w_6} \end{bmatrix} =  \begin{bmatrix}\color{darkgray}{0.11} \\ \color{darkgray}{0.12} \end{bmatrix} - \color{brown}{0.05} (-0.7963)\begin{bmatrix}\color{deepskyblue}{1.23} \\ \color{deepskyblue}{ 0.57} \end{bmatrix} = \begin{bmatrix}\color{darkgray}{0.11} \\ \color{darkgray}{0.12} \end{bmatrix} - \begin{bmatrix}-0.0478 \\ -0.0222 \end{bmatrix} = \begin{bmatrix}\color{darkgray}{0.16} \\ \color{darkgray}{0.14} \end{bmatrix}$ 

$
 \begin{bmatrix}\color{royalblue}{w_1} & \color{orange}{w_3}  \\ \color{royalblue}{w_2} & \color{orange}{w_4}\end{bmatrix} = \begin{bmatrix}\color{royalblue}{0.12} & \color{orange}{0.15} \\ \color{royalblue}{0.33} & \color{orange}{0.09}\end{bmatrix} - \color{brown}{0.05}(-0.7963)\begin{bmatrix}\color{gold}2 \\ \color{gold}3 \end{bmatrix} . \begin{bmatrix}\color{darkgray}{0.11} & \color{darkgray}{0.12} \end{bmatrix} = \begin{bmatrix}\color{royalblue}{0.12} & \color{orange}{0.15} \\ \color{royalblue}{0.33} & \color{orange}{0.09}\end{bmatrix}  - \begin{bmatrix}-0.008 & -0.009  \\ -0.013 & -0.014 \end{bmatrix} = \begin{bmatrix}\color{royalblue}{0.13} & \color{orange}{0.16} \\ \color{royalblue}{0.34} & \color{orange}{0.10}\end{bmatrix}
$

Now, using the new **weights** we will repeat the forward passed


![Imgur](https://i.imgur.com/I86uHwE.jpg)


\begin{equation*}
\begin{bmatrix}\color{gold}{2} & \color{gold}{3} \end{bmatrix} . \begin{bmatrix}\color{royalblue}{0.13} & \color{orange}{0.16} \\ \color{royalblue}{0.34} & \color{orange}{0.1} \end{bmatrix} = \begin{bmatrix}\color{deepskyblue}{1.29} & \color{deepskyblue}{0.63} \end{bmatrix} . \begin{bmatrix}\color{darkgray}{0.16} \\ \color{darkgray}{0.14} \end{bmatrix} = \begin{bmatrix}\color{lightgreen}{0.29} \end{bmatrix}
\end{equation*}



We can notice that the **prediction** 0.29 is a little bit closer to **actual output** than the previously predicted one 0.2037. We can repeat the same process of backward and forward pass until **error** is close or equal to zero.