# Back Propagation Neural Network

Little bit mention from the previous section about how to do training on neural networks. The training process consists of 2 main parts, that named the **Forward Pass** and **Backward Pass**. The blue arrow below is the **Forward Pass** and the red arrow is the **Backward Pass**.

<img src="img/nn_training.png">

In Supervised Learning, training data consists of input and output/target. At the **Forward Pass**, the input will be `"propagated"` to the output layer and the predicted output will be compared to the target by using a function, that commonly called the **Loss Function**.

Then what is the loss function for? In simple **Loss Function** is used to measure how well the performance of our neural network in predicting targets.

$Loss = (Target - Prediction)^2$

There are various types of loss functions, but the most commonly used was Squared Error (L2 Loss) for regression. And for the classification commonly used was Cross Entropy.

# Implementation Backward Pass (Back-Propagation)

The simplicity of this process is to adjust each weight and bias based on the errors obtained at the forward pass. The stages of backprop are as follows:

- Calculate the gradient of the loss function for all parameters by finding the partial derivative of the function. Here we can use the **Chain Rule method**. For those who are still confused about what gradient is, maybe the illustration below can help.

<img src="img/gradient_descent.gif">

- Update all parameters (weight and bias) using the **Stochastic Gradient Descent (SGD)** by subtracting or adding the old weight value to the "partial" (***learning rate***) of the gradient value that got.

<img src="img/nn_bp.png">

The neural network above consists of **2 hidden layers**. The first hidden layer uses **ReLU**, the second hidden layer uses sigmoid and finally the output layer uses linear as the activation function. The bias in the diagram above actually exists but is not illustrated.

There are **4 weights** and **4 biases** between the input layer and the **first hidden layer**, **8 weights** and **2 biases** between the **first and second hidden layers**, **2 weights** and **1 bias** between the second hidden layer and the output layer. So that in total there are **21 parameters** that must be updated.

Here to predict a value with the input and output like this.

$input = [2.0]$

$output = [3.0]$

For the initial weight and bias, The values was determined with the values that can calculate easily.

- **Weight**

$W_{jk} = \begin{bmatrix}w_{ij_{1}} & w_{ij_{2}} & w_{ij_{3}} & w_{ij_{4}} \end{bmatrix} = \begin{bmatrix}0.25 & 0.5 & 0.75 & 1.0 \end{bmatrix}$

$W_{jk} = \begin{bmatrix}
w_{j_{1}k_{1}} & w_{j_{1}k_{2}} \\
w_{j_{2}k_{1}} & w_{j_{1}k_{2}} \\
w_{j_{3}k_{1}} & w_{j_{3}k_{2}} \\
w_{j_{4}k_{1}} & w_{j_{4}k_{2}}
\end{bmatrix} = \begin{bmatrix}
1.0 & 0 \\
0.75 & 0.25 \\
0.5 & 0.5 \\
0.25 & 0.75
\end{bmatrix}$

$W_{ko} = \begin{bmatrix}w_{k_{1}o} \\ w_{k_{2}o} \end{bmatrix} = \begin{bmatrix}1.0 \\ 0.5 \end{bmatrix}$

- **Bias**

$b_{ij} = \begin{bmatrix}b_{ij_{1}} & b_{ij_{2}} & b_{ij_{3}} & b_{ij_{4}} \end{bmatrix} = \begin{bmatrix} 1.0 & 1.0 & 1.0 & 1.0 \end{bmatrix}$

$b_{jk} = \begin{bmatrix}b_{jk_{1}} & b_{jk_{2}} \end{bmatrix} = \begin{bmatrix} 1.0 & 1.0 \end{bmatrix}$

$b_{o} = \begin{bmatrix} 1.0 \end{bmatrix}$

## Input to Hidden Layer 1 (Forward Passs)

<img src="img/nn_input_hidden_layer_1.png">

Here the input data will forward to the hidden layer 1. Then multiply (dot product) and adding the matrix between inputs, weights and biases.

$\begin{bmatrix}j1_{in} & j2_{in} & j3_{in} & j4_{in} \end{bmatrix} = \begin{bmatrix}input \end{bmatrix} \times \begin{bmatrix}W_{ij_{1}} & W_{ij_{2}} & W_{ij_{3}} & W_{ij_{4}} \end{bmatrix} + \begin{bmatrix}b_{ij_{1}} & b_{ij_{2}} & b_{ij_{3}} & b_{ij_{4}} \end{bmatrix}$

$\begin{bmatrix}j1_{in} & j2_{in} & j3_{in} & j4_{in} \end{bmatrix} = \begin{bmatrix}2.0 \end{bmatrix} \times \begin{bmatrix}0.25 & 0.5 & 0.75 & 1.0 \end{bmatrix} + \begin{bmatrix}1.0 & 1.0 & 1.0 & 1.0 \end{bmatrix}$

$\begin{bmatrix}j1_{in} & j2_{in} & j3_{in} & j4_{in} \end{bmatrix} = \begin{bmatrix}2.0 \end{bmatrix} \times \begin{bmatrix}0.25 & 0.5 & 0.75 & 1.0 \end{bmatrix} + \begin{bmatrix}1.0 & 1.0 & 1.0 & 1.0 \end{bmatrix}$

$\begin{bmatrix}j1_{in} & j2_{in} & j3_{in} & j4_{in} \end{bmatrix} = \begin{bmatrix}0.5 & 1.0 & 1.5 & 2.0 \end{bmatrix} + \begin{bmatrix}1.0 & 1.0 & 1.0 & 1.0 \end{bmatrix}$

$\begin{bmatrix}j1_{in} & j2_{in} & j3_{in} & j4_{in} \end{bmatrix} = \begin{bmatrix}1.5 & 2.0 & 2.5 & 3.0 \end{bmatrix}$

The values above was input from each node in hidden layer 1. All values will be issued after going through the activation function. On the hidden layer 1 activation function that used was **ReLU** $\rightarrow f(x) = max(0, x)$. So the output of hidden layer 1 will be like this:

$ReLU\left(\begin{bmatrix}j1_{in} & j2_{in} & j3_{in} & j4_{in} \end{bmatrix}\right) = \begin{bmatrix}max(0, j1_{in}) & max(0, j2_{in}) & max(0, j3_{in}) & max(0, j4_{in}) \end{bmatrix}$

$ReLU\left(\begin{bmatrix}j1_{in} & j2_{in} & j3_{in} & j4_{in} \end{bmatrix}\right) = \begin{bmatrix}max(0, 1.5) & max(0, 2.0) & max(0, 2.5) & max(0, 3.0) \end{bmatrix}$

$ReLU\left(\begin{bmatrix}j1_{in} & j2_{in} & j3_{in} & j4_{in} \end{bmatrix}\right) = \begin{bmatrix}1.5 & 2.0 & 2.5 & 3.0 \end{bmatrix}$

$\begin{bmatrix}j1_{in} & j2_{in} & j3_{in} & j4_{in} \end{bmatrix} = \begin{bmatrix}1.5 & 2.0 & 2.5 & 3.0 \end{bmatrix}$

## Hidden Layer 1 to Hidden Layer 2 (Forward Pass)

<img src="img/nn_hidden_layer_1_hidden_layer_2.png">

Just like the **Forward Pass** on the previous layer, the output of each neuron in the **ReLU** layer will flow to all neurons in the **Sigmoid** layer.

$\begin{bmatrix} k_{1in} & k_{2in}\end{bmatrix} = \begin{bmatrix}j1_{in} & j2_{in} & j3_{in} & j4_{in} \end{bmatrix} \times \begin{bmatrix} 
w_{j_{1}k_{1}} & w_{j_{1}k_{2}} \\
w_{j_{2}k_{1}} & w_{j_{2}k_{2}} \\
w_{j_{3}k_{1}} & w_{j_{3}k_{2}} \\
w_{j_{4}k_{1}} & w_{j_{4}k_{2}}
\end{bmatrix} + \begin{bmatrix}b_{jk_{1}} & b_{jk_{2}}\end{bmatrix}$

$\begin{bmatrix} k_{1in} & k_{2in}\end{bmatrix} = \begin{bmatrix}1.5 & 2 & 2.5 & 3.0 \end{bmatrix} \times \begin{bmatrix} 
1.0 & 0 \\
0.75 & 0.25 \\
0.5 & 0.5 \\
0.25 & 0.75
\end{bmatrix} + \begin{bmatrix}1.0 & 1.0 \end{bmatrix}$

$\begin{bmatrix} k_{1in} & k_{2in}\end{bmatrix} = \begin{bmatrix} 5.0 & 4.0 \end{bmatrix} + \begin{bmatrix} 1.0 & 1.0 \end{bmatrix}$

$\begin{bmatrix} k_{1in} & k_{2in}\end{bmatrix} = \begin{bmatrix} 6.0 & 5.0 \end{bmatrix}$

After activation function (**Sigmoid**):

$Sigmoid \rightarrow f(x) = \frac{1}{1 + e^{-x}}$

$Sigmoid \left(\begin{bmatrix}k_{1in} & k_{2in} \end{bmatrix}\right) = \begin{bmatrix}\frac{1}{1 + e^{-k_{1in}}}\ \frac{1}{1 + e^{-k_{1in}}} \end{bmatrix}$

$Sigmoid \left(\begin{bmatrix}k_{1in} & k_{2in} \end{bmatrix}\right) = \begin{bmatrix}\frac{1}{1 + e^{-6}} & \frac{1}{1 + e^{-5}} \end{bmatrix}$

$Sigmoid \left(\begin{bmatrix}k_{1in} & k_{2in} \end{bmatrix}\right) = \begin{bmatrix}0.9975 & 0.9933 \end{bmatrix}$

$\begin{bmatrix}k_{1in} & k_{2in} \end{bmatrix} = \begin{bmatrix}0.9975 & 0.9933 \end{bmatrix}$

## Hidden Layer 2 to Output (Forward Pass)

<img src="img/nn_hidden_layer_2_output.png">

Just like the **Forward Pass** on the previous layer, the output of each neuron in the **Sigmoid** Layer will flow to the neurons in the **Linear** layer (Output).

$\begin{bmatrix}o_{in}\end{bmatrix} = \begin{bmatrix}k_{1out} & k_{2out}\end{bmatrix} \times \begin{bmatrix}
w_{k_{1}o} \\ 
w_{k_{1}o} 
\end{bmatrix} + \begin{bmatrix}b_{o}\end{bmatrix}$

$\begin{bmatrix}o_{in}\end{bmatrix} = \begin{bmatrix}0.9975 & 0.9933 \end{bmatrix} \times \begin{bmatrix}
1.0 \\ 
0.5 
\end{bmatrix} + \begin{bmatrix}b_{o}\end{bmatrix}$

$\begin{bmatrix}o_{in}\end{bmatrix} = \begin{bmatrix}1.494 \end{bmatrix} + \begin{bmatrix}1.0 \end{bmatrix}$

$\begin{bmatrix}o_{in}\end{bmatrix} = \begin{bmatrix}2.494 \end{bmatrix}$

After activation function (**Linear**)

$Linear \rightarrow f(x) = x$

$f\left(\begin{bmatrix}o_{in} \end{bmatrix}\right) = \begin{bmatrix}2.494 \end{bmatrix}$

$\begin{bmatrix}o_{out} \end{bmatrix} = \begin{bmatrix}2.494 \end{bmatrix}$

The output prediction value had been gotten. The Next thing to do is looking for the _losses_ using **squared errors** (L2 Loss).

$Loss = \frac{1}{2}(Prediction - Target)^2$

$Loss = \frac{1}{2}(o_{out} - output)^2$

$Loss = \frac{1}{2}(2.494 - 3)^2$

$Loss = \frac{1}{2}(-0.506)^2$

$Loss = \frac{1}{2}(0.506)$

$Loss = 0.128$

## Activation Function Derivatives

Before discussing the **Backward Pass**, it's a good idea to take a look first for the activation function that gonna be use.

- **ReLu** Derivatives

$y = max(0, x)$

$\frac{\partial y}{\partial x} = \begin{cases}1 & x > 0\\0 & x \leq 0\end{cases}$

- **Sigmoid** Derivatives

$y = \frac{1}{1 + e^{-x}}$

$\frac{\partial y}{\partial x} = \frac{1}{1 + e^{-x}} \times \left(1 - \frac{1}{1 + e^{-x}}\right)$

- **Linear** Derivatives

$y = x$

$\frac{\partial y}{\partial x} = 1$

## Output to Hidden Layer 2 (Backward Pass)

<img src="img/bp_output_hidden_layer_2.png">

Almost the same with the **Forward Pass**, on the **Backward Pass**, loss will flow to all the nodes in the hidden layer to find the _gradient_ and update the parameters. For the example to update the $w_{k_{1}o}$ parameter, then the **Chain Rule** which described below can be use.

$\frac{\partial Loss}{\partial w_{k_{1}o}} = \frac{\partial Loss}{\partial o_{out}} \times \frac{\partial o_{out}}{\partial o_{in}} \times \frac{\partial o_{in}}{\partial w_{k_{1}o}}$

First take a look about how much the ***Loss*** changes based on output. So we have to look for a ***partial derivative*** of the ***loss function*** to ***output***, it can also call as the ***gradient loss*** function of output. In the equation below, the ***Loss*** will be multiplied by 1/2, the actual main goal about why it needed to derivate was the ***Loss function*** set to produce 1 time ***Loss*** (neutralize the derivative of the quadratic function).

- **Weight**

$Loss = \frac{1}{2} \left(output - o_{out} \right)^2$

$\frac{\partial Loss}{\partial o_{out}} = \frac{\partial \left(\frac{1}{2} \left(output - o_{out} \right)^2 \right)}{\partial o_{out}}$

$\frac{\partial Loss}{\partial o_{out}} = -1 \times 2 \times \frac{1}{2} \left(output - o_{out} \right)$

$\frac{\partial Loss}{\partial o_{out}} = o_{out} - output$

$\frac{\partial Loss}{\partial o_{out}} = 2.494 - 3$

$\frac{\partial Loss}{\partial o_{out}} = -0.506$

- **Bias**

$Loss = \frac{1}{2} \left(output - o_{out} \right)^2$

$\frac{\partial Loss}{\partial b_{o}} = \frac{\partial \left(\frac{1}{2} \left(output - o_{out} \right)^2 \right)}{\partial b_{o}}$

$\frac{\partial Loss}{\partial o_{out}} = -1 \times 2 \times \frac{1}{2} \left(output - o_{out} \right)$

$\frac{\partial Loss}{\partial o_{out}} = -0.506$

Next we will look for the **gradient from $o_{out}$ to $o_{in}$**. Because the activation function used is Linear, its derivatives are very easy to find.

$o_{out} = o_{in}$

$\frac{\partial o_{out}}{\partial o_{in}} = \frac{\partial (o_{in})}{\partial o_{in}}$

$\frac{\partial o_{out}}{\partial o_{in}} = 1$

After that, the **gradient of $o_{in}$ toward $w_{k_{1}o}$, $w_{k_{2}o}$ and bias or** $b_{o}$ will be found. Consider the equation below:

- **Weight**

$o_{in} = w_{k_{1}o}k_{1out} + w_{k_{2}o}k_{2out} + b_{o}$

$\frac{\partial\ o_{in}}{\partial\ w_{k_{1}o}} = \frac{\partial\ w_{k_{1}o}k_{1out} + w_{k_{2}o}k_{2out} + b_{o}}{\partial\ w_{k_{1}o}}$

$\begin{bmatrix} 
\frac{\partial o_{in}}{\partial w_{k_{1}o}} \\ 
\frac{\partial o_{in}}{\partial w_{k_{1}o}}
\end{bmatrix} =
\begin{bmatrix}
k_{1out} \\ 
k_{1out}
\end{bmatrix}  =
\begin{bmatrix}
0.9975 \\ 
0.9933
\end{bmatrix}$

- **Bias**

The **bias** value same with the **Forward Pass**.

$\begin{bmatrix} 
\frac{\partial o_{in}}{\partial b_{o}}
\end{bmatrix} = \begin{bmatrix} 
1.0
\end{bmatrix}$

Finally, The chain rule to find the **Gradient loss for weight and bias** can be applied.

- **Gradient loss** toward **weight** in Hidden Layer 2

$\begin{bmatrix} 
\frac{\partial Loss}{\partial w_{k_{1}o}} \\ 
\frac{\partial Loss}{\partial w_{k_{2}o}}
\end{bmatrix} =
\begin{bmatrix}
\frac{\partial Loss}{\partial o_{out}} \times \frac{\partial o_{out}}{\partial o_{in}} \times \frac{\partial o_{in}}{\partial w_{k_{1}o}} \\ 
\frac{\partial Loss}{\partial o_{out}} \times \frac{\partial o_{out}}{\partial o_{in}} \times \frac{\partial o_{in}}{\partial w_{k_{2}o}}
\end{bmatrix}$

$\begin{bmatrix} 
\frac{\partial Loss}{\partial w_{k_{1}o}} \\ 
\frac{\partial Loss}{\partial w_{k_{2}o}}
\end{bmatrix} =
\begin{bmatrix}
-0.506 \times 1 \times 0.9975 \\ 
-0.506 \times 1 \times 0.9933
\end{bmatrix}$

$\begin{bmatrix} 
\frac{\partial Loss}{\partial w_{k_{1}o}} \\ 
\frac{\partial Loss}{\partial w_{k_{2}o}}
\end{bmatrix} =
\begin{bmatrix}
-0.50474 \\ 
-0.50261
\end{bmatrix}$

- **Gradient loss** for **bias** ($b_{0}$) in Hidden Layer 2

$\begin{bmatrix} 
\frac{\partial Loss}{\partial b_{o}}
\end{bmatrix} =
\begin{bmatrix}
\frac{\partial Loss}{\partial b_{o}} \times \frac{\partial o_{out}}{\partial o_{in}} \times \frac{\partial o_{in}}{\partial b_{o}}
\end{bmatrix}$

$\begin{bmatrix} 
\frac{\partial Loss}{\partial b_{o}}
\end{bmatrix} = \left[-0.506 \times 1 \times 0.9975 \right]$

$\begin{bmatrix} 
\frac{\partial Loss}{\partial b_{o}}
\end{bmatrix} = \left[ -0.506\right]$

### Stochastic Gradient Descent (SGD) Update for Output to Hidden Layer 2

**SGD** is an algorithm used to update parameters in this case Weight and bias. The algorithm is quite simple, basically the initial weight will be reduce by "a portion" of the gradient value that gotten.

Some of this is represented by a hyper-parameter called learning rate ($\alpha$). For example,  a learning rate set to 0.25 even though in practice learning rate 0.25 is not ideal. (Later will be discussed about setting *hyper-parameters*).

- SGD update for **Weight** in Hidden Layer 2

$w'_{k_{1}o} = w_{k_{1}o} - \alpha \left(\frac{\partial Loss}{\partial w_{k_{1}o}} \right) = 1 - 0.25(-0.50474) = 1.1262$

$w'_{k_{2}o} = w_{k_{2}o} - \alpha \left(\frac{\partial Loss}{\partial w_{k_{2}o}} \right) = 1 - 0.25(-0.50261) = 0.6256$

- SGD update for **Bias** in Hidden Layer 2

$b'_{o} = b_{o} - \alpha \left(\frac{\partial Loss}{\partial b_{o}} \right) = 1 - 0.25(-0.506) = 1.1265$

New parameters after update:

$w_{ko} = 
\begin{bmatrix} 
w_{k_{1}o} \\ 
w_{k_{2}o}
\end{bmatrix} = 
\begin{bmatrix} 
1.1262 \\ 
0.6256
\end{bmatrix}$

$b_{o} = \left[ 1.1265\right]$

## Hidden Layer 2 to Hidden Layer 1 (Backward Pass)

<img src="img/bp_hidden_layer_2_hidden_layer_1.png">

Repeat every step that has been done on the previous layer (**Backward Pass**). But, in this step must be more careful when taking approach because it was relatively more complicated than the **Backward Pass** on the previous layer.

$\frac{\partial Loss}{\partial w_{j_{1}k_{1}}} = \frac{\partial Loss}{\partial k_{1out}} \times \frac{\partial k_{1out}}{\partial k_{1in}} \times \frac{\partial k_{1in}}{\partial w_{j_{1}k_{1}}}$

To find the **gradient loss** toward $w_{j_{1}k_{1}}$, the chain rule should be used again. First, gradient loss should be found using $k_{1out}$.

$\frac{\partial Loss}{\partial k_{1out}} = \frac{\partial Loss}{\partial o_{out}} \times \frac{\partial o_{out}}{\partial o_{in}} \times \frac{\partial o_{in}}{\partial w_{k_{1}o}} \times \frac{\partial k_{1in}}{\partial w_{j_{1}k_{1}}}$

$...$

$w_{ko} = 
\begin{bmatrix} 
w_{k_{1}o} \\
w_{k_{2}o}
\end{bmatrix} = 
\begin{bmatrix} 
1.0 \\
0.5
\end{bmatrix}, \quad the\ initial\ weight$

$...$

$\frac{\partial Loss}{\partial k_{1out}} = 0.506 \times 1 \times 0.9975 \times w_{k_{1}o}$

$\frac{\partial Loss}{\partial k_{1out}} = 0.506 \times 1 \times 0.9975 \times 1.0$

$...$

$\frac{\partial Loss}{\partial k_{2out}} = 0.506 \times 1 \times 0.9975 \times w_{k_{2}o}$

$\frac{\partial Loss}{\partial k_{2out}} = 0.506 \times 1 \times 0.9975 \times 0.5$

$...$

$\begin{bmatrix} 
\frac{\partial Loss}{\partial k_{1out}} & \frac{\partial Loss}{\partial k_{2out}}
\end{bmatrix} = \left[ -0.50474\ -0.25130 \right]$

Then find the $k_{1out}$ toward $k_{1in}$ gradient. This time, the derivative of the **sigmoid** will be use. That's already found since earlier.

$k_{1out} = \frac{1}{1 + e^{-k_{1in}}}$

$\frac{\partial k_{1out}}{\partial k_{1in}} = \frac{\partial \left( \frac{1}{1 + e^{-k_{1in}}} \right)}{\partial k_{1in}}$

$\frac{\partial k_{1out}}{\partial k_{1in}} = \frac{1}{1 + e^{-k_{1in}}} \times \left(1 - \frac{1}{1 + e^{-k_{1in}}} \right)$

$\frac{\partial k_{1out}}{\partial k_{1in}} = \frac{1}{1 + e^{-6}} \times \left(1 - \frac{1}{1 + e^{-6}} \right)$

$...$

$k_{2out} = \frac{1}{1 + e^{-k_{2in}}}$

$\frac{\partial k_{2out}}{\partial k_{2in}} = \frac{\partial \left( \frac{1}{1 + e^{-k_{2in}}} \right)}{\partial k_{2in}}$

$\frac{\partial k_{2out}}{\partial k_{2in}} = \frac{1}{1 + e^{-k_{2in}}} \times \left(1 - \frac{1}{1 + e^{-k_{2in}}} \right)$

$\frac{\partial k_{2out}}{\partial k_{2in}} = \frac{1}{1 + e^{-5}} \times \left(1 - \frac{1}{1 + e^{-5}} \right)$

$...$

$\begin{bmatrix}
\frac{\partial k_{1out}}{\partial k_{1in}} \\
\frac{\partial k_{2out}}{\partial k_{2in}}
\end{bmatrix} = 
\begin{bmatrix}
0.00249 \\
0.00665
\end{bmatrix}$

Next, the gradient of $k_{1in}$ toward $w_{j_{1}k_{1}}$ should be searched.

$k_{1in} = w_{j_{1}k_{1}}j_{1out} + w_{j_{2}k_{1}}j_{2out} + w_{j_{3}k_{1}}j_{3out} + w_{j_{4}k_{1}}j_{4out} + b_{jk_{1}}$

$\frac{\partial k_{1in}}{\partial w_{j_{1}k_{1}}} = \frac{\partial \left(w_{j_{1}k_{1}}j_{1out} + w_{j_{2}k_{1}}j_{2out} + w_{j_{3}k_{1}}j_{3out} + w_{j_{4}k_{1}}j_{4out} + b_{jk_{1}} \right)}{\partial w_{j_{i}k_{1}}}$

$\begin{bmatrix} \frac{\partial k_{1in}}{\partial w_{j_{1}k_{1}}} & \frac{\partial k_{1in}}{\partial w_{j_{2}k_{1}}} & \frac{\partial k_{1in}}{\partial w_{j_{3}k_{1}}} & \frac{\partial k_{1in}}{\partial w_{j_{4}k_{1}}} \end{bmatrix} = \begin{bmatrix} j_{1out} & j_{2out} & j_{3out} & j_{4out} \end{bmatrix}$

$\begin{bmatrix} \frac{\partial k_{1in}}{\partial w_{j_{1}k_{1}}} & \frac{\partial k_{1in}}{\partial w_{j_{2}k_{1}}} & \frac{\partial k_{1in}}{\partial w_{j_{3}k_{1}}} & \frac{\partial k_{1in}}{\partial w_{j_{4}k_{1}}} \end{bmatrix} = \begin{bmatrix} 1.5 & 2.0 & 2.5 & 3.0 \end{bmatrix}$

$...$

$k_{2in} = w_{j_{1}k_{2}}j_{1out} + w_{j_{2}k_{2}}j_{2out} + w_{j_{3}k_{2}}j_{3out} + w_{j_{4}k_{2}}j_{4out} + b_{jk_{2}}$

$\frac{\partial k_{1in}}{\partial w_{j_{1}k_{2}}} = \frac{\partial \left(w_{j_{1}k_{1}}j_{1out} + w_{j_{2}k_{2}}j_{2out} + w_{j_{3}k_{2}}j_{3out} + w_{j_{4}k_{2}}j_{4out} + b_{jk_{2}} \right)}{\partial w_{j_{i}k_{2}}}$

$\begin{bmatrix} \frac{\partial k_{2in}}{\partial w_{j_{1}k_{2}}} & \frac{\partial k_{2in}}{\partial w_{j_{2}k_{2}}} & \frac{\partial k_{2in}}{\partial w_{j_{3}k_{2}}} & \frac{\partial k_{2in}}{\partial w_{j_{4}k_{2}}} \end{bmatrix} = \begin{bmatrix} j_{1out} & j_{2out} & j_{3out} & j_{4out} \end{bmatrix}$

$\begin{bmatrix} \frac{\partial k_{2in}}{\partial w_{j_{1}k_{2}}} & \frac{\partial k_{2in}}{\partial w_{j_{2}k_{2}}} & \frac{\partial k_{2in}}{\partial w_{j_{3}k_{2}}} & \frac{\partial k_{2in}}{\partial w_{j_{4}k_{2}}} \end{bmatrix} = \begin{bmatrix} 1.5 & 2.0 & 2.5 & 3.0 \end{bmatrix}$

Now calculate the **gradient loss** to $W_{j_{1}k_{1}}$ by applying the **chain rule** like the previous section.

$\frac{\partial Loss}{\partial w_{j_{i}k_{1}}} = \frac{\partial Loss}{\partial k_{1out}} \times \frac{\partial k_{1out}}{\partial k_{1in}} \times \frac{\partial k_{1in}}{\partial w_{j_{1}k_{1}}}$

$\frac{\partial Loss}{\partial w_{j_{i}k_{1}}} = -0.50474 \times \color{red}{0.00249} \times 1.5$

$\frac{\partial Loss}{\partial w_{j_{i}k_{1}}} = -0.00188$

Finally the gradient already found, notice the red number ones. The gradient of the sigmoid is quite small at $0.00249$ and after the **Chain Rule** the result became more smaller at $-0.00188$.

This phenomenon is called ***Vanishing Gradient*** and is the reason why sigmoid is rarely used anymore.

The calculation that has been done earlier after applied to all parameters, will get all the gradients needed to update.

The gradient in the operation above was very small (vanish), more closer a node to the input layer, more longer the time that needed to process the training, because the value of gradient that gonna used to update was very small and will be more smaller after multiplied by the learning rate.

$\begin{bmatrix} 
\frac{\partial Loss}{\partial w_{j_{1}k_{1}}} & \frac{\partial Loss}{\partial w_{j_{1}k_{2}}} \\
\frac{\partial Loss}{\partial w_{j_{2}k_{1}}} & \frac{\partial Loss}{\partial w_{j_{2}k_{2}}} \\
\frac{\partial Loss}{\partial w_{j_{3}k_{1}}} & \frac{\partial Loss}{\partial w_{j_{3}k_{2}}} \\
\frac{\partial Loss}{\partial w_{j_{4}k_{1}}} & \frac{\partial Loss}{\partial w_{j_{4}k_{2}}}
\end{bmatrix} = 
\begin{bmatrix} 
-0.00188 & -0.00252 \\
-0.00251 & -0.00334 \\
-0.00314 & -0.00417 \\
-0.00377 & -0.00501
\end{bmatrix}$

$\begin{bmatrix} 
\frac{\partial Loss}{\partial b_{jk_{1}}} & \frac{\partial Loss}{\partial b_{jk_{2}}} 
\end{bmatrix} = 
\begin{bmatrix}
-0.00125 & -0.00167
\end{bmatrix}$

### SGD Update for Hidden Layer 2 to Hidden Layer 1

The new **weight** and **bias** will very easy to find after the gradient was found. Still using 0.25 as learning rate value to get a new **weight** and **bias**. Even the value change of **weight** and **bias** were very small, same with previous description.

$\begin{bmatrix}
w'_{j_{1}k_{1}} & w'_{j_{1}k_{2}} \\
w'_{j_{2}k_{1}} & w'_{j_{2}k_{2}} \\
w'_{j_{3}k_{1}} & w'_{j_{3}k_{2}} \\
w'_{j_{4}k_{1}} & w'_{j_{4}k_{2}}
\end{bmatrix} = 
\begin{bmatrix}
w_{j_{1}k_{1}} - \alpha \left(\frac{\partial Loss}{\partial w_{j_{1}k_{1}}} \right) &
w_{j_{1}k_{2}} - \alpha \left(\frac{\partial Loss}{\partial w_{j_{1}k_{2}}} \right) \\
w_{j_{2}k_{1}} - \alpha \left(\frac{\partial Loss}{\partial w_{j_{2}k_{1}}} \right) &
w_{j_{2}k_{2}} - \alpha \left(\frac{\partial Loss}{\partial w_{j_{2}k_{2}}} \right) \\
w_{j_{3}k_{1}} - \alpha \left(\frac{\partial Loss}{\partial w_{j_{3}k_{1}}} \right) &
w_{j_{3}k_{2}} - \alpha \left(\frac{\partial Loss}{\partial w_{j_{3}k_{2}}} \right) \\
w_{j_{4}k_{1}} - \alpha \left(\frac{\partial Loss}{\partial w_{j_{4}k_{1}}} \right) &
w_{j_{4}k_{2}} - \alpha \left(\frac{\partial Loss}{\partial w_{j_{4}k_{2}}} \right) \\
\end{bmatrix} = 
\begin{bmatrix}
1.00047 & 0.00062 \\
0.75062 & 0.25083 \\
0.50078 & 0.50104 \\
0.25094 & 0.75125
\end{bmatrix}$

$\begin{bmatrix} 
b'_{jk_{1}} & b'_{jk_{1}}
\end{bmatrix} = 
\begin{bmatrix} 
b_{jk_{1}} - \alpha \left(\frac{\partial Loss}{\partial b_{jk_{1}}} \right) &
b_{jk_{2}} - \alpha \left(\frac{\partial Loss}{\partial b_{jk_{2}}} \right)
\end{bmatrix} = 
\begin{bmatrix} 
1.00031 & 1.00042
\end{bmatrix}$

## Hidden Layer 1 to Input Layer (Backward Pass)

<img src="img/bp_output_hidden_layer_1_input_layer.png">

Probably it's time do the steps that have learned. First thing is updating the **weight** and **bias** between the input layer and hidden layer 1.

$\frac{\partial Loss}{\partial w_{ij_{1}}} = \frac{\partial Loss}{\partial j_{1out}} \times \frac{\partial j_{1out}}{\partial j_{1in}} \times \frac{\partial j_{1in}}{\partial j_{w_{ij_{1}}}}$

First, search for **gradient loss** toward $j_{1out}$. This time it will be more complicated than the calculation of $k_{1out}$. Because $j_{1out}$ was influenced by a gradient that comes from $k_{2}$. So we have to see the $k$ Layer as a single unit, there's no longer $k_{1}$ and $k_{2}$.

$\frac{\partial Loss}{\partial j_{1out}} = \frac{\partial Loss}{\partial k_{out}} \times \frac{\partial k_{out}}{\partial k_{in}} \times \frac{\partial k_{in}}{\partial w_{j_{1}k}} \times \frac{\partial w_{j_{1}k}}{\partial j_{1out}}$

$...$

$\frac{\partial Loss}{\partial k_{out}} = \frac{\partial Loss}{\partial k_{1out}} + \frac{\partial Loss}{\partial k_{2out}} = -0.50474 + (-0.25130) = -0.75604$

$\frac{\partial k_{out}}{\partial k_{in}} = \frac{\partial k_{1out}}{\partial k_{in}} + \frac{\partial k_{2out}}{\partial k_{in}} = 0.00249 + 0.00665 = 0.00914$

$\frac{\partial k_{in}}{\partial w_{j_{1}k}} = \frac{\partial k_{1in}}{\partial w_{j_{1}k_{1}}} + \frac{\partial k_{2in}}{\partial w_{j_{1}k_{2}}} = 1.5 + 1.5 = 3.0$

$\frac{\partial w_{j_{1}k}}{\partial j_{1out}} = \frac{\partial w_{j_{1}k_{1}}}{\partial j_{1out}} + \frac{\partial w_{j_{1}k_{1}}}{\partial j_{1out}} = w_{j_{1}k_{1}} + w_{j_{1}k_{2}} =1.0 + 0 = 1.0$

$...$

$\frac{\partial Loss}{\partial j_{1out}} = \frac{\partial Loss}{\partial k_{out}} \times \frac{\partial k_{out}}{\partial k_{in}} \times \frac{\partial k_{in}}{\partial w_{j_{1}k}} \times \frac{\partial w_{j_{1}k}}{\partial j_{1out}}$

$\frac{\partial Loss}{\partial j_{1out}} = -0.75604 \times 0.00914 \times 3.0 \times 1.0$

$\frac{\partial Loss}{\partial j_{1out}} = -0.02073$

Continue with $j_{1out}$ gradient towards $j_{1in}$.

$j_{1out} = max(0, j_{1in})$

$j_{1out} = max(0, 1.5)$

$\frac{\partial j_{1out}}{\partial j_{1in}} = \frac{\partial (ReLu)}{\partial j_{1in}} = \begin{cases}1 & j_{1in} > 0\\0 & j_{1in} = 0\end{cases}$

$\frac{\partial j_{1out}}{\partial j_{1in}} = 1$

The next thing to do is find the $j_{1in}$ gradient toward $w_{ij_{1}}$.

$j_{1in} = w_{ij_{1}}i + b_{ij_{1}}$

$\frac{\partial j_{1in}}{\partial w_{ij_{1}}} = \frac{\partial \left(w_{ij_{1}}i + b_{ij_{1}} \right)}{\partial w_{ij_{1}}}$

$\frac{\partial j_{1in}}{\partial w_{ij_{1}}} = i$

$\frac{\partial j_{1in}}{\partial w_{ij_{1}}} = 2.0$

In the end, the **gradient loss** toward $w_{ij_{1}}$ could be calculated by applied the **chain rule**.

$\frac{\partial Loss}{\partial w_{ij_{1}}} = \frac{\partial Loss}{\partial j_{1out}} \times \frac{\partial j_{1out}}{\partial j_{1in}} \times \frac{\partial j_{1in}}{\partial j_{w_{ij_{1}}}}$

$\frac{\partial Loss}{\partial w_{ij_{1}}} = -0.02073 \times 1 \times 2.0$

$\frac{\partial Loss}{\partial w_{ij_{1}}} = -0.04146$

The earlier calculation above will be applied to all parameters. Then all gradients which need to be update has already got.

$\begin{bmatrix} 
\frac{\partial Loss}{\partial w_{ij_{1}}} & 
\frac{\partial Loss}{\partial w_{ij_{2}}} &
\frac{\partial Loss}{\partial w_{ij_{3}}} &
\frac{\partial Loss}{\partial w_{ij_{4}}}
\end{bmatrix} = 
\begin{bmatrix} 
-0.04146 & 
-0.05528 &
-0.06910 &
-0.08292
\end{bmatrix}$

$\begin{bmatrix} 
\frac{\partial Loss}{\partial b_{ij_{1}}} & 
\frac{\partial Loss}{\partial b_{ij_{2}}} &
\frac{\partial Loss}{\partial b_{ij_{3}}} &
\frac{\partial Loss}{\partial b_{ij_{4}}}
\end{bmatrix} = 
\begin{bmatrix} 
-0.02073 & 
-0.02764 &
-0.03455 &
-0.04146
\end{bmatrix}$

### SGD Update for Hidden Layer 1 to Input Layer

$\begin{bmatrix} 
w'_{ij_{1}} & 
w'_{ij_{2}} &
w'_{ij_{3}} &
w'_{ij_{4}}
\end{bmatrix} = 
\begin{bmatrix} 
w_{ij_{1}} - \alpha \left(\frac{\partial Loss}{\partial w_{ij_{1}}} \right) & 
w_{ij_{2}} - \alpha \left(\frac{\partial Loss}{\partial w_{ij_{2}}} \right) &
w_{ij_{3}} - \alpha \left(\frac{\partial Loss}{\partial w_{ij_{3}}} \right) &
w_{ij_{4}} - \alpha \left(\frac{\partial Loss}{\partial w_{ij_{4}}} \right)
\end{bmatrix}$

$\begin{bmatrix} 
w'_{ij_{1}} & 
w'_{ij_{2}} &
w'_{ij_{3}} &
w'_{ij_{4}}
\end{bmatrix} = 
\begin{bmatrix} 
0.26037 & 
0.51382 &
0.76728 &
1.02073
\end{bmatrix}$

$\begin{bmatrix} 
b'_{ij_{1}} & 
b'_{ij_{2}} &
b'_{ij_{3}} &
b'_{ij_{4}}
\end{bmatrix} = 
\begin{bmatrix} 
b_{ij_{1}} - \alpha \left(\frac{\partial Loss}{\partial b_{ij_{1}}} \right) & 
b_{ij_{2}} - \alpha \left(\frac{\partial Loss}{\partial b_{ij_{2}}} \right) &
b_{ij_{3}} - \alpha \left(\frac{\partial Loss}{\partial b_{ij_{3}}} \right) &
b_{ij_{4}} - \alpha \left(\frac{\partial Loss}{\partial b_{ij_{4}}} \right)
\end{bmatrix}$

$\begin{bmatrix} 
b'_{ij_{1}} & 
b'_{ij_{2}} &
b'_{ij_{3}} &
b'_{ij_{4}}
\end{bmatrix} = 
\begin{bmatrix} 
1.02073 & 
1.02764 &
1.03455 &
1.04146
\end{bmatrix}$

Finally finished. The all the new parameters has been found. This process (**Forward Pass** and **Backward Pass**) will be repeated continuously until the smallest loss value reached.

## Old Parameter vs New Parameter

- **Weight**

$W_{jk} = \begin{bmatrix}w_{ij_{1}} & w_{ij_{2}} & w_{ij_{3}} & w_{ij_{4}} \end{bmatrix} = \begin{bmatrix}0.25 & 0.5 & 0.75 & 1.0 \end{bmatrix}$

$W_{jk} = \begin{bmatrix}
w_{j_{1}k_{1}} & w_{j_{1}k_{2}} \\
w_{j_{2}k_{1}} & w_{j_{1}k_{2}} \\
w_{j_{3}k_{1}} & w_{j_{3}k_{2}} \\
w_{j_{4}k_{1}} & w_{j_{4}k_{2}}
\end{bmatrix} = \begin{bmatrix}
1.0 & 0 \\
0.75 & 0.25 \\
0.5 & 0.5 \\
0.25 & 0.75
\end{bmatrix}$

$W_{ko} = \begin{bmatrix}w_{k_{1}o} \\ w_{k_{2}o} \end{bmatrix} = \begin{bmatrix}1.0 \\ 0.5 \end{bmatrix}$

- **Bias**

$b_{ij} = \begin{bmatrix}b_{ij_{1}} & b_{ij_{2}} & b_{ij_{3}} & b_{ij_{4}} \end{bmatrix} = \begin{bmatrix} 1.0 & 1.0 & 1.0 & 1.0 \end{bmatrix}$

$b_{jk} = \begin{bmatrix}b_{jk_{1}} & b_{jk_{2}} \end{bmatrix} = \begin{bmatrix} 1.0 & 1.0 \end{bmatrix}$

$b_{o} = \begin{bmatrix} 1.0 \end{bmatrix}$

$...$

- **Weight**

$W'_{jk} = \begin{bmatrix}w_{ij_{1}} & w_{ij_{2}} & w_{ij_{3}} & w_{ij_{4}} \end{bmatrix} = \begin{bmatrix}0.26037 & 0.51382 & 0.76728 & 1.02073 \end{bmatrix}$

$W'_{jk} = \begin{bmatrix}
w_{j_{1}k_{1}} & w_{j_{1}k_{2}} \\
w_{j_{2}k_{1}} & w_{j_{1}k_{2}} \\
w_{j_{3}k_{1}} & w_{j_{3}k_{2}} \\
w_{j_{4}k_{1}} & w_{j_{4}k_{2}}
\end{bmatrix} = \begin{bmatrix}
1.00047 & 0.00062 \\
0.75062 & 0.25083 \\
0.50078 & 0.50104 \\
0.25094 & 0.75125
\end{bmatrix}$

$W'_{ko} = \begin{bmatrix}w_{k_{1}o} \\ w_{k_{2}o} \end{bmatrix} = \begin{bmatrix}1.1262 \\ 0.6256 \end{bmatrix}$

- **Bias**

$b'_{ij} = \begin{bmatrix}b_{ij_{1}} & b_{ij_{2}} & b_{ij_{3}} & b_{ij_{4}} \end{bmatrix} = \begin{bmatrix} 1.02073 & 1.02674 & 1.03455 & 1.04146 \end{bmatrix}$

$b'_{jk} = \begin{bmatrix}b_{jk_{1}} & b_{jk_{2}} \end{bmatrix} = \begin{bmatrix} 1.00031 & 1.00042 \end{bmatrix}$

$b'_{o} = \begin{bmatrix} 1.1265 \end{bmatrix}$

On the example above only used one data at **Forward** and **Backward Pass**. In general, the **Gradient Descent** consists of 3 types, the **SGD** that used above, the **Batch Gradient Descent** and the **Mini-batch Gradient Descent**.

In the **Batch Gradient Descent (BGD)**, the model would be updated after all data has been "propagated". Whereas the Mini-batch is in the middle of SGD and BGD.

**Mini-batch gradient descent** does **Forwards** and **Backward Pass** on a small group of training data. For example doing an update for every 32/64 pieces of data and calculated the error was the mean from a group of training data.