# Introduction
Consider the following Neural Network,

<img src = "../artifacts/neural_networks_36.png" alt = "drawing" width = "500">

The computation graph for the above looks as follows,

<img src = "../artifacts/neural_networks_37.png" alt = "drawing" width = "500">

# Forward Propagation
In forward propagation, the propagation is from left to right. The following is done during a forward pass (forward propagation),
- Calculate the value of $z_i$.
- Apply activation function on top of it.
- Then pass it to the Neuron in front of it.
- Ultimately, the probabilities are obtained.
- Then these probabilities are used to calculate the loss. Since it is multi-class classification problem, the loss function used is categorical cross entropy.

The final objective is to compute $z^2$.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("spiral.csv")
df.head()

Unnamed: 0,x1,x2,y
0,0.0,0.0,0
1,-0.00065,0.01008,0
2,0.009809,0.017661,0
3,0.007487,0.029364,0
4,-2.7e-05,0.040404,0


In [3]:
# separating features and labels
x = df.drop(columns = ["y"])
y = df["y"]
x.shape, y.shape

((300, 2), (300,))

In [4]:
# initialize the parameters at random
d = 2 # dimensions or number of inputs
n = 3 # number of classes or number of neurons in the output layer
h = 4 # number of neurons in the hidden layer

In [5]:
# input layer to the hidden layer
# weight and bias of layer 1
w1 = 0.01 * np.random.randn(d, h)
b1 = np.zeros((1, h))
w1.shape, b1.shape

((2, 4), (1, 4))

### Calculating $z^1$
Each row of $x$ is multiplied with each column of $w_1$ and bias is added to the result of this.

In [6]:
# z1 = np.dot(x, w) + b
z1 = np.dot(x, w1) + b1
z1.shape

(300, 4)

### Calculating $a^1$
The ReLU function is applied to $z^1$.

In [7]:
# ReLU activation function
a1 = np.maximum(0, z1)
a1.shape

(300, 4)

In [8]:
# hidden layer to the output layer
# weight and bias of layer 2
w2 = 0.01 * np.random.randn(h, n)
b2 = np.zeros((1, n))
w2.shape, b2.shape

((4, 3), (1, 3))

### Calculating $z^2$
In order to calculate $z^2$, $a^1$ is multiplied with $w_2$ and the bias $b^2$ is added to the result.

In [9]:
z2 = np.dot(a1, w2) + b2
z2.shape

(300, 3)

### Calculating $a^2$

In [10]:
# apply the softmax function to compute a2
z2_exp = np.exp(z2)
a2 = z2_exp/ np.sum(z2_exp, axis = 1, keepdims = True)
probs = a2
probs.shape

(300, 3)

<img src = "../artifacts/neural_networks_38.png" alt = "drawing" width = "500">

# Loss Calculation
### Will the loss function change?
No.

# Backward Propagation
### Will the gradient calculation change in case of n layer Neural Network?
No. But, there is an additional requirement to back propagate the gradients for one additional layer.

In [11]:
# number of data points (training samples)
m = y.shape[0]
m

300

### Calculating $dz^2$

<img src = "../artifacts/neural_networks_39.png" alt = "drawing" width = "500">

$dz^2 = \frac{\partial L}{\partial z^2}$

So,

$\frac{\partial L}{\partial z^2} = \frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2}$

Here, $a^2$ is the output probabilities.

Replace $a^2$ with $p$, $\frac{\partial L}{\partial z^2} = \frac{\partial L}{\partial p} * \frac{\partial p}{\partial z^2}$

The above equation is similar to what was calculated previously, i.e., derivative of loss with respect to $z$.

$dz = \frac{\partial J}{\partial p} * \frac{\partial p}{\partial z}$.

The derivative came out to be, $dz = (p_i - I(i = y))$

Hence, $dz^2 = (p_i - I(i = y))$.

In [12]:
dz2 = probs
dz2[range(m), y] -= 1

The shape of $dz^2$ is the same as the shape of probabilities, `(m, n)` (i.e., in this case `(300, 3)`).

### Calculating $dw^2$ and $db^2$
Gradient calculation for $dw^2$ and $db^2$ will also be similar to $dw$ and $db$ as it was in the softmax classifier.

<img src = "../artifacts/neural_networks_40.png" alt = "drawing" width = "500">

$dw^2 = \frac{\partial L}{\partial w^2} = \frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2} * \frac{\partial z^2}{\partial w^2}$.

$dw^2 = dz^2 * \frac{\partial z^2}{\partial w^2}$.

Here, $z^2 = w^{2^T} * a^1 + b^2$.

So, $\frac{\partial z^2}{\partial w^2} = a^1$

$dw^2 = \frac{\partial L}{\partial w^2} = dz^2 * a^1$

The shape of $dz^2$ = `(300, 3)` and the shape of $a^1$ = `(300, 4)`.

$dw^2$ will be used to update $w^2$. Therefore, the shape of $dw^2$ should be same as $w^2$, i.e., `(4, 3)`.

Hence, $dz^2$ and $a^1$ should be multiplied such that, the resulting matrix has the shape `(4, 3)`.

Therefore, the transpose of $a^1$ is multiplied with $dz^2$, $a^{1^T} * dz^2$.

<img src = "../artifacts/neural_networks_41.png" alt = "drawing" width = "500">

In [13]:
dw2 = np.dot(a1.T, dz2)/ m
dw2.shape

(4, 3)

The division by `m` is because, in Gradient Descent, since all the data points are used for calculating the updated $w$, the average is taken by dividing the total number of data points.

<img src = "../artifacts/neural_networks_42.png" alt = "drawing" width = "500">

Why is there a need to divide by m? The goal is to update weights and biases, it can be done by,
1. calculating the derivatives $dw^2$, $db^2$, $dw^1$, $db^1$.
2. Updating the weights, $w^1 = w^1 - \eta * dw^1 *\frac{1}{m}$.

$db^2$ can also be calculated in a similar way.

<img src = "../artifacts/neural_networks_43.png" alt = "drawing" width = "500">

$db^2 = \frac{\partial L}{\partial b^2} = \frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2} * \frac{\partial z^2}{\partial b^2}$

Now, $\frac{\partial z^2}{\partial b^2} = \frac{\partial (w^2 * a^1 + b^2)}{db^2} = 1$.

$db^2 = \frac{\partial L}{\partial b^2} = \frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2} * 1 = dz^2$.

$db^2$ will be used to update $b^2$. Therefore, the shape of $db^2$ will be same as $b^2$, i.e., `(1, 3)`.

But the shape of $dz^2$ is `(300, 3)`. Since gradient descent and not stochastic gradient descent is being performed, the derivatives have to summed up across the rows and then average of them has to taken before using it for the updating.

<img src = "../artifacts/neural_networks_44.png" alt = "drawing" width = "500">

In [14]:
db2 = np.sum(dz2, axis = 0, keepdims = True)/ m
db2.shape

(1, 3)

### Calculating $da^1$
<img src = "../artifacts/neural_networks_45.png" alt = "drawing" width = "500">

$da^1 = \frac{\partial L}{\partial a^1} = \frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^1} * \frac{\partial z^2}{\partial a^1}$

Since, $\frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2} = sz^2$.

Now,

$\frac{\partial z^2}{\partial a^1} = \frac{\partial (w^2 * a^1 + b^2)}{da^1} = w^2$.

$da^1 = \frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2} * w^2 = dz^2 * w^2$.

The shape of $da^1$ will be same as $a^1$, i.e., `(300, 4)`.

<img src = "../artifacts/neural_networks_46.png" alt = "drawing" width = "500">

The shape of $dz^2$ = `(300, 3)` and the shape of $w^2$ = `(4, 3)`

$dw^2$ will be used to update $w^2$. Therefore, the shape of $dw^2$ should be same as $w^2$, i.e., `(4, 3)`.

Hence, $dz^2$ and $w^2$ should be multiplied such that, the resulting matrix has the shape `(4, 3)`.

Therefore, the transpose of $w^2$ is multiplied with $dz^2$, $da^1 = w^{2^T} * dz^2$.

In [15]:
da1 = np.dot(dz2, w2.T)
da1.shape

(300, 4)

### Calculating $dz^1$
To calculate the gradient of $dz^1$, the ReLU layer has to to passed backwards.

<img src = "../artifacts/neural_networks_47.png" alt = "drawing" width = "500">

$\frac{\partial L}{\partial z^1} = \frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2} * \frac{\partial z^2}{\partial a^1} * \frac{\partial a^1}{\partial z^1}$

It is known that, $\frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2} * \frac{\partial z^2}{\partial a^1} = da^1$

$\frac{\partial a^1}{\partial z^1}$ has to be calculated.

<img src = "../artifacts/neural_networks_48.png" alt = "drawing" width = "500">

<img src = "../artifacts/neural_networks_49.png" alt = "drawing" width = "500">

In [16]:
da1[z1 <= 0] = 0
dz1 = da1
dz1.shape

(300, 4)

Why is $da^1$ being directly updated without creating a copy of it?
- The purpose of calculating $da^1$ and $dz^1$ is to ultimately calculate $dw^1$ and $db^1$. Both of them are being used for intermediatory purpose.
- Therefore, making changes in $da^1$ will not change anything as $dz^1$ is already calculated.
- And $da^1$ will not be used anywhere else expect for calculation of $dz^1$.

This also means that the intermediate output values from the forward pass have to be saved.

### Calculating $dw^1$ and $db^1$

<img src = "../artifacts/neural_networks_50.png" alt = "drawing" width = "500">

<img src = "../artifacts/neural_networks_51.png" alt = "drawing" width = "500">

$\frac{\partial L}{\partial w^1} = \frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2} * \frac{\partial z^2}{\partial a^1} * \frac{\partial a^1}{\partial z^1} * \frac{\partial z^1}{\partial w^1}$.

It is known that, $\frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2} * \frac{\partial z^2}{\partial a^1} * \frac{\partial a^1}{\partial z^1} = dz^1$.

$\frac{\partial z^1}{\partial w^1}$ has to be calculated.

$\frac{\partial z^1}{\partial w^1} = \frac{\partial (w^1 * x + b^1)}{\partial w^1} = x$

$\frac{\partial L}{\partial w^1} = dz^1 * x$.

$db^1$ can similarly be calculated as,

<img src = "../artifacts/neural_networks_52.png" alt = "drawing" width = "500">

$\frac{\partial L}{\partial b^1} = \frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2} * \frac{\partial z^2}{\partial a^1} * \frac{\partial a^1}{\partial z^1} * \frac{\partial z^1}{\partial b^1}$.

It is known that, $\frac{\partial L}{\partial a^2} * \frac{\partial a^2}{\partial z^2} * \frac{\partial z^2}{\partial a^1} * \frac{\partial a^1}{\partial z^1} = dz^1$.

$\frac{\partial z^1}{\partial b^1}$ has to be calculated.

$\frac{\partial z^1}{\partial b^1} = \frac{\partial (w^1 * x + b^1)}{\partial b^1} = 1$

Therefore, $\frac{\partial L}{\partial b^1} = dz^1 * 1$.

In [17]:
dw1 = np.dot(x.T, dz1)/ m
db1 = np.sum(dz1, axis = 0, keepdims = True)/ m
dw1.shape, db1.shape

((2, 4), (1, 4))

Now that the gradients have been found, the weights and biases can be updated as,

In [18]:
lr = 1e-0

In [19]:
# update the parameters
w1 += -lr * dw1
b1 += -lr * db1
w2 += -lr * dw2
b2 += -lr * db2
w1, b1, w2, b2

(array([[ 0.01956925, -0.01055101, -0.00950208,  0.01987932],
        [-0.00500924,  0.00625357,  0.00573331,  0.00569931]]),
 array([[-5.63074880e-04, -1.56068427e-03, -1.78393075e-05,
         -4.18104901e-05]]),
 array([[-0.01591582, -0.00317806, -0.02072609],
        [ 0.01133266, -0.00710962, -0.00109631],
        [ 0.00953313,  0.01570165, -0.00829044],
        [ 0.00407602,  0.02038706,  0.00403669]]),
 array([[-1.30325831e-06, -2.09805232e-05,  2.22837815e-05]]))

The parameters are updated until the convergence takes place (error goes down).

### Summary of the entire process
A single gradient descent for weight update looks as follows,

<img src = "../artifacts/neural_networks_53.png" alt = "drawing" width = 500>

The derivatives are as follows,

<img src = "../artifacts/neural_networks_54.png" alt = "drawing" width = 500>

Notice that,
- $dz^2$ is used for the calculation of $dw^2$, $db^2$ and $da^1$.
- Similarly, $da^1$ is used for the calculation of $dz^1$.
- And, $dz^1$ is used used for the calculation of $dw^1$ and $db^1$.

In order to not calculate the values of deeper derivatives, i.e., $da^1$, $dz^1$ over and over again, the derivatives of deeper layers are calculated and stored. The stored values can be used to calculate the derivative of the shallow layers. This is called memoization, it is also used in dynamic programming.

The following is the simplified flowchart of single cycle of updation,

<img src = "../artifacts/neural_networks_55.png" alt = "drawing" width = 500>

During forward propagation,
- The values of $z^j$, $w^j$, $b^j$ in order to use them during back propagation.
- For example, $da^1$ used $w^2$ for its calculation.