<a href="https://colab.research.google.com/github/saranshikens/Basic-ML/blob/main/Neural_Network_from_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**IMPLEMENTING A SIMPLE NEURAL NETWORK USING ONLY NUMPY**

**DERIVING THE BACKWARD PROPAGATION GRADIENTS**  
To optimise the weights and biases, we use gradient descent. Hence, we first obtain the expressions for the gradient of the Loss Function with respect to various parameters.  
Before that, we will derive some useful results obtained by calculus on matrices.

**SOME IDENTITIES**

**Matrix times column vector with respect to the column vector**  
Let $z = Wx$, where $W \in \mathbb{R}^{n \times m}$, and $x \in \mathbb{R}^{m \times 1}$. So, $z \in \mathbb{R}^{n \times 1}$.   
We need to calculate $\dfrac{\partial{z}}{\partial{x}}$.
Observe that $z_{i} = \displaystyle \sum_{j=1}^{m} W_{ij} x_{j}$.  
$\left(\dfrac{\partial{z}}{\partial{x}}\right)_{ik} = \dfrac{\partial{z_i}}{\partial{x_k}} = \dfrac{\partial}{\partial{x_k}} \displaystyle \sum_{j=1}^{m} W_{ij}x_j = \begin{cases}
W_{ik}, & \text{if } j=k\\  
0, & \text{if otherwise}
\end{cases}$  
So, $\left(\dfrac{\partial{z}}{\partial{x}}\right)_{ik} = W_{ik}$. In other words, $\dfrac{\partial{z}}{\partial{x}} = W$

**Column vector times matrix with respect to the column vector**  
Let $z = xW$. A similar methodology as above will yield that $\dfrac{\partial{z}}{\partial{x}} = W^{T}$.

**An element wise function applied to a vector**  
Let $z = f(x)$, where $x \in \mathbb{R}^{m \times 1}$, and $f:\mathbb{R} → \mathbb{R}$. So, $z \in \mathbb{R}^{m \times 1}$.  
Observe that $z_{i} = f(x_i)$.  
$\left(\dfrac{\partial{z}}{\partial{x}}\right)_{ik} = \dfrac{\partial{z_i}}{\partial{x_k}} = \dfrac{\partial}{\partial{x_k}} f(x_i) = \begin{cases}
f'(x_i), & \text{if } i=k\\  
0, & \text{if otherwise}
\end{cases}$  
So, $\dfrac{\partial{z}}{\partial{x}} = \text{diag}(f'(x))$.

**Matrix times column vector with respect to the matrix**  
Let $z = Wx$ and $L: \mathbb{R}^{n \times m} → \mathbb{R}$, where $W \in \mathbb{R}^{n \times m}$, and $x \in \mathbb{R}^{m \times 1}$. So, again $z \in \mathbb{R}^{n \times 1}$.  
For the purpose of this notebook, we shall compute $\dfrac{\partial{L}}{\partial{W}} = \dfrac{\partial{L}}{\partial{z}} \dfrac{\partial{z}}{\partial{W}} = δ \dfrac{\partial{z}}{\partial{W}}$.  
Let us analyze $\dfrac{\partial{z}}{\partial{W}}$ more closely. $\left(\dfrac{\partial{z}}{\partial{W}}\right)_{ij} = \dfrac{\partial{Z}}{\partial{W_{ij}}}$, which in itself is matrix.  
Actually, $\dfrac{\partial{z}}{\partial{W}}$ is an $n \times m \times n$ tensor.   
Now, $z_k = \displaystyle \sum_{l=1}^m W_{kl}x_{l}$. $\dfrac{\partial{z_k}}{\partial{W_{ij}}} = \displaystyle \sum_{l=1}^m x_l \dfrac{\partial{W_{kl}}}{\partial{W_{ij}}} = x_j$, iff $k=i$ and $l=j$.  
So, $\dfrac{\partial{z}}{\partial{W_{ij}}}$ is a tensor where $i^{th}$ matrix is a column vector where $j^{th}$ element is $x_j$ and rest of the elements are 0.  
$\dfrac{\partial{L}}{\partial{W_{ij}}} = \dfrac{\partial{L}}{\partial{z}} \dfrac{\partial{z}}{\partial{W_{ij}}} = \delta \dfrac{\partial{z}}{\partial{W_{ij}}} = \delta_i x_j$.  
Hence, $\dfrac{\partial{L}}{\partial{W}} = \delta^T x^T$.

**Row vector time matrix with respect to the matrix **
Let $z = xW$ and $L: \mathbb{R}^{n \times m} → \mathbb{R}$, where $W \in \mathbb{R}^{n \times m}$, and $x \in \mathbb{R}^{m \times 1}$.  
A similar methodology as above will show that $\dfrac{\partial{L}}{\partial{W}} = x^T \delta$.

**COMPUTING THE NEURAL NETWORK GRADIENTS**

Let the following be the expressions for the outputs at each hidden layer, the final output, and the mean loss.  
$X_1 = W_1X + B_1$  
$A = \text{ReLU}(X_1)$  
$Z = W_2A + B_2$  
$\hat{y} = \text{softmax}(X_2)$  
$L = -1\displaystyle \sum_{i=1}^m y_i \text{log}(\hat{y_i})$, where $\hat{y_i} = \dfrac{e^{z_i}}{\sum_{j=1}^{m} e^{z_j}}$.  
$L = - \displaystyle \sum_{i=1}^{m}y_iz_i + \text{log}\sum_{j=1}^{m} e^{z_j}$  
$\dfrac{\partial{L}}{\partial{z_j}} = -{y_j} + \hat{y_j}$, i.e. $\dfrac{\partial{L}}{\partial{Z}} = \hat{y} - \hat{y}$.

$\dfrac{\partial{L}}{\partial{W_2}} = \dfrac{\partial{L}}{\partial{Z}}\dfrac{\partial{Z}}{\partial{W_2}} = \left(\dfrac{\partial{L}}{\partial{Z}}\right)^T\cdot A^T$  
$\dfrac{\partial{L}}{\partial{b_2}} = \dfrac{\partial{L}}{\partial{Z}}\dfrac{\partial{Z}}{\partial{b_2}} = \dfrac{\partial{L}}{\partial{Z}}$  
$\dfrac{\partial{L}}{\partial{X_1}} = \dfrac{\partial{L}}{\partial{Z}}\dfrac{\partial{Z}}{\partial{A}}\dfrac{\partial{A}}{\partial{X_1}} = \dfrac{\partial{L}}{\partial{Z}}\cdot W_2 * \text{ReLU_deriv}(X_1)$  
$\dfrac{\partial{L}}{\partial{W_1}} = \dfrac{\partial{L}}{\partial{Z}}\dfrac{\partial{Z}}{\partial{A}}\dfrac{\partial{A}}{\partial{X_1}}\dfrac{\partial{X_1}}{\partial{W_1}} = \dfrac{\partial{L}}{\partial{X_1}}\cdot X^T$  
$\dfrac{\partial{L}}{\partial{W_1}} = \dfrac{\partial{L}}{\partial{Z}}\dfrac{\partial{Z}}{\partial{A}}\dfrac{\partial{A}}{\partial{X_1}}\dfrac{\partial{X_1}}{\partial{W_1}} = \dfrac{\partial{L}}{\partial{X_1}}\cdot X^T$
$\dfrac{\partial{L}}{\partial{b_1}} = \dfrac{\partial{L}}{\partial{Z}}\dfrac{\partial{Z}}{\partial{A}}\dfrac{\partial{A}}{\partial{X_1}}\dfrac{\partial{X_1}}{\partial{b_1}} = \dfrac{\partial{L}}{\partial{X_1}}$  
In the code, we divide the gradients by the number of samples, since we consider the mean of the gradients over various inputs.  

In [None]:
import numpy as np
import pandas as pd

In [None]:
path = "/content/sample_data/mnist_train_small.csv"
data = pd.read_csv(path)

In [None]:
data.head()

Unnamed: 0,6,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,...,0.581,0.582,0.583,0.584,0.585,0.586,0.587,0.588,0.589,0.590
0,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
data = np.array(data)
m, n = data.shape
data.shape

(19999, 785)

In [None]:
data_valid = data[0:1000].T
y_valid = data_valid[0]
X_valid = data_valid[1:n]
X_valid = X_valid/255

data_train = data[1000:m].T
y_train = data_train[0]
X_train = data_train[1:n]
X_train = X_train/255

In [None]:
class Neural_Network:
    def __init__(self, lr, n_iter):
        self.lr = lr
        self.n_iter = n_iter
        self.initialize()

    def initialize(self):
        self.W1 = np.random.rand(10,784) - 0.5
        self.b1 = np.random.rand(10,1) - 0.5
        self.W2 = np.random.rand(10,10) - 0.5
        self.b2 = np.random.rand(10,1) - 0.5

    def ReLU(self, X):
        return np.maximum(0,X)

    def ReLU_deriv(self, X):
        return X>=0

    def softmax(self, X):
        return np.exp(X)/sum(np.exp(X))

    def forward_prop(self, X):
        X1 = np.dot(self.W1, X) + self.b1
        X1_activated = self.ReLU(X1)

        X2 = np.dot(self.W2, X1_activated) + self.b2
        output = self.softmax(X2)

        return X1, X1_activated, X2, output

    def one_hot_encoding(self, y):
        y_encoded = np.zeros((y.size, y.max()+1))
        y_encoded[np.arange(y.size), y] = 1
        y_encoded = y_encoded.T
        return y_encoded

    def backward_prop(self, X, y, X1, X1_activated, output):
        y_encoded = self.one_hot_encoding(y)
        dX2 = output - y_encoded
        dW2 = 1/m * dX2.dot(X1_activated.T)
        db2 = 1/m * np.sum(dX2)
        dX1 = np.dot(self.W2, dX2) * self.ReLU_deriv(X1)
        dW1 = 1/m * np.dot(dX1, X.T)
        db1 = 1/m * np.sum(dX1)
        return dW1, db1, dW2, db2

    def update(self, dW1, db1, dW2, db2):
        self.W1 = self.W1 - self.lr*dW1
        self.b1 = self.b1 - self.lr*db1
        self.W2 = self.W2 - self.lr*dW2
        self.b2 = self.b2 - self.lr*db2

    def predict(self, X):
        _, _, _, output = self.forward_prop(X)
        return np.argmax(output, 0)

    def accuracy(self, predictions, y):
        return (predictions == y).sum()/y.size

    def gradient_descent(self, X, y):
        for i in range(self.n_iter):
            X1, X1_activated, X2, output = self.forward_prop(X)
            dW1, db1, dW2, db2 = self.backward_prop(X, y, X1, X1_activated, output)
            self.update(dW1, db1, dW2, db2)
            if i%10==0:
                predictions = self.predict(X)
                print(f"Iteration {i}: Accuracy = {self.accuracy(predictions, y):.4f}")

In [None]:
model = Neural_Network(0.1, 500)
model.gradient_descent(X_train, y_train)

Iteration 0: Accuracy = 0.1201
Iteration 10: Accuracy = 0.1326
Iteration 20: Accuracy = 0.1623
Iteration 30: Accuracy = 0.2665
Iteration 40: Accuracy = 0.3903
Iteration 50: Accuracy = 0.4701
Iteration 60: Accuracy = 0.5184
Iteration 70: Accuracy = 0.5515
Iteration 80: Accuracy = 0.5768
Iteration 90: Accuracy = 0.5969
Iteration 100: Accuracy = 0.6132
Iteration 110: Accuracy = 0.6274
Iteration 120: Accuracy = 0.6413
Iteration 130: Accuracy = 0.6555
Iteration 140: Accuracy = 0.6691
Iteration 150: Accuracy = 0.6813
Iteration 160: Accuracy = 0.6938
Iteration 170: Accuracy = 0.7066
Iteration 180: Accuracy = 0.7171
Iteration 190: Accuracy = 0.7275
Iteration 200: Accuracy = 0.7376
Iteration 210: Accuracy = 0.7449
Iteration 220: Accuracy = 0.7518
Iteration 230: Accuracy = 0.7580
Iteration 240: Accuracy = 0.7645
Iteration 250: Accuracy = 0.7691
Iteration 260: Accuracy = 0.7744
Iteration 270: Accuracy = 0.7794
Iteration 280: Accuracy = 0.7847
Iteration 290: Accuracy = 0.7887
Iteration 300: Accura

In [None]:
predictions_valid = model.predict(X_valid)
accuracy_valid = model.accuracy(predictions_valid, y_valid)
print(f"Validation Accuracy: {accuracy_valid:.4f}")

Validation Accuracy: 0.8110


In [None]:
testing_data = pd.read_csv("/content/sample_data/mnist_test.csv")

In [None]:
testing_data = np.array(testing_data)

m_test, n_test = testing_data.shape

y_test = testing_data[:, 0]
X_test = testing_data[:, 1:n_test]

X_test = X_test / 255.0

In [None]:
predictions_test = model.predict(X_test.T)
accuracy_test = model.accuracy(predictions_test, y_test)
print(f"Testing Accuracy: {accuracy_test:.4f}")

Testing Accuracy: 0.8322
