# **Neural Network from Scratch**

---

The goal is to implement a complete neural network from scratch with only dense and activation layers. I will be training the network on the MNIST dataset.

## **Table of Contents**

1. [The Plan](#1.-The-Plan)
2. [Creating the base layer](#2.-Creating-the-base-layer)
3. [Creating the dense layer](#3.-Creating-the-dense-layer)
4. [Creating the activation layer](#4.-Creating-the-activation-layer)
5. [Implementing activation functions](#5.-Implementing-activation-functions)
6. [Implementing loss functions](#6.-Implementing-loss-functions)
7. [Solving MNIST](#7.-Solving-MNIST)  
   7.1 [Creating the network](#7.1-Creating-the-network)  
   7.2 [Testing the network](#7.2-Testing-the-network)


## **1. The Plan**

---

1. Create a class for a layer object with only forward and backword function that are both empty.
2. Create a class for a dense layer that inherits from the layer class.
3. Create a layer with an activation function.
4. Implement activation functions and loss functions.
5. Solve MNIST.


First, I will import the necessary libraries.


In [273]:
import numpy as np
from keras.datasets import mnist
from keras.utils import to_categorical

## **2. Creating the base layer**

[Table of Contents](#Table-of-Contents)

---


In [274]:
class Layer:
    def __init__(self):
        self.input = None
        self.output = None
    
    def forward(self, input):
        pass
    
    def backward(self, output_gradient, learning_rate):
        pass

Here we create a base class for the layers. This class will have an input and output variable that will be used to store the input and output of the layer.

The forward and backward function are empty and will be implemented in the child classes.


## **3. Creating the dense layer**

[Table of Contents](#Table-of-Contents)

---


In [275]:
class Dense(Layer):
    def __init__(self, input_size, output_size):
        self.weights = np.random.randn(output_size, input_size)
        self.bias = np.random.randn(output_size, 1)
    
    def forward(self, input):
        self.input = input
        return np.dot(self.weights, self.input) + self.bias

    def backward(self, output_gradient, learning_rate):
        weights_gradient = np.dot(output_gradient, self.input.T)
        input_gradient = np.dot(self.weights.T, output_gradient)
        self.weights -= learning_rate * weights_gradient
        self.bias -= learning_rate * output_gradient
        return input_gradient

Now we can create a dense layer that inherits from the layer class. The dense layer will have a weight and bias variable that will be used to store the weights and bias of the layer.

The forward function will take an input and return the output of the layer. The output is calculated by taking the dot product of the input and the weight and adding the bias.

```python
def forward(self, input):
    self.input = input
    return np.dot(input, self.weight) + self.bias
```

The backward function will take a gradient and return the gradient of the input. The gradient of the input is calculated by taking the dot product of the gradient and the weight. The weights and bias are also updated using the gradient.

```python
def backward(self, output_gradient, learning_rate):
    weights_gradient = np.dot(output_gradient, self.input.T)
    input_gradient = np.dot(self.weights.T, output_gradient)
    self.weights -= learning_rate * weights_gradient
    self.bias -= learning_rate * output_gradient
    return input_gradient
```


## **4. Creating the activation layer**

[Table of Contents](#Table-of-Contents)

---


In [276]:
class Activation(Layer):
    def __init__(self, activation, activation_prime):
        self.activation = activation
        self.activation_prime = activation_prime

    def forward(self, input):
        self.input = input
        return self.activation(self.input)

    def backward(self, output_gradient, learning_rate):
        return np.multiply(output_gradient, self.activation_prime(self.input))

The next step is to create an activation layer that inherits from the layer class. The activation layer will have an activation and activation_prime variable that will be used to store the activation function and its derivative.

> The derivate tells us the rate of change of the activation function at a certain point. This is used to calculate the gradient of the input.

The forward function will take an input and return the output of the layer. The output is calculated by applying the activation function to the input.

```python
def forward(self, input):
    self.input = input
    return self.activation(input)
```

The backward function will take a gradient and return the gradient of the input. The gradient of the input is calculated by applying the derivative of the activation function to the gradient.

```python
def backward(self, output_gradient, learning_rate):
    return np.multiply(output_gradient, self.activation_prime(self.input))
```


## **5. Implementing activation functions**

[Table of Contents](#Table-of-Contents)

---


In [277]:
class Tanh(Activation):
    def __init__(self):
        def tanh(x):
            return np.tanh(x)

        def tanh_prime(x):
            return 1 - np.tanh(x) ** 2

        super().__init__(tanh, tanh_prime)

class Softmax(Layer):
    def forward(self, input):
        tmp = np.exp(input)
        self.output = tmp / np.sum(tmp)
        return self.output
    
    def backward(self, output_gradient, learning_rate):
        n = np.size(self.output)
        return np.dot((np.identity(n) - self.output.T) * self.output, output_gradient)

Above are the implementations of the tanh and softmax activation functions. The tanh function is used as the activation function for the hidden layers and the softmax function is used as the activation function for the output layer.


## **6. Implementing loss functions**

[Table of Contents](#Table-of-Contents)

---


In [278]:
def mse(y_true, y_pred):
    return np.mean(np.power(y_true - y_pred, 2))

def mse_prime(y_true, y_pred):
    return 2 * (y_pred - y_true) / np.size(y_true)

Here I have implemented the mean squared error loss function. The loss function takes the true and predicted values and returns the mean squared error. The derivative of the loss function takes the true and predicted values and returns the gradient of the loss function.


## **7. Solving MNIST**

[Table of Contents](#Table-of-Contents)

---


#### **7.1 Creating the network**


In [279]:
def predict(network, input):
    output = input
    for layer in network:
        output = layer.forward(output)
    return output


def train(network, loss, loss_prime, x_train, y_train, epochs=1000, learning_rate=0.01, verbose=True):
    for e in range(epochs):
        error = 0
        for x, y in zip(x_train, y_train):
            # Forward.
            output = predict(network, x)

            # Error.
            error += loss(y, output)

            # Backward.
            grad = loss_prime(y, output)
            for layer in reversed(network):
                grad = layer.backward(grad, learning_rate)

        error /= len(x_train)
        if verbose:
            print(f"{e + 1}/{epochs}, error={error}")

First, I have implemented the predict and train functions. The predict function takes a model and input and returns the output of the model. The train function takes a network, loss function, loss function derivative, input, output, epochs, learning rate, and verbose and trains the network on the input and output.


#### **Testing the network**


Here I have tested the network on the MNIST dataset. The network was trained on 1000 samples and tested on 20 samples. The output of the network was compared to the true values and printed to the console. The accuracy of the network is shown below.

The preprocess_data function is used preprocess the input and output data. It reshapes the input data x into a 2D array with each row representing an image and each column representing a pixel. The pixel intensities are then normalized by dividing by 255 to bring them into the range [0, 1]. The to_categorical function is used to convert the output data y into a one-hot encoded matrix. It returns the first `limit` number of images and labels from the MNIST dataset.


In [280]:
def preprocess_data(x, y, limit):
    x = x.reshape(x.shape[0], 28 * 28, 1)
    x = x.astype("float32") / 255
    y = to_categorical(y)
    y = y.reshape(y.shape[0], 10, 1)
    return x[:limit], y[:limit]


# load MNIST from server
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, y_train = preprocess_data(x_train, y_train, 1000)
x_test, y_test = preprocess_data(x_test, y_test, 20)

network = [
    Dense(28 * 28, 40),
    Tanh(),
    Dense(40, 10),
    Softmax()
]

train(network, mse, mse_prime, x_train, y_train, epochs=100, learning_rate=0.1)

# Loop over the test set, and print the prediction and the true value. Zip is used to iterate over two lists at the same time.
for x, y in zip(x_test, y_test):
    output = predict(network, x)
    print('pred:', np.argmax(output), '\ttrue:', np.argmax(y))
    

1/100, error=0.14131456196100928
2/100, error=0.1281179909229619
3/100, error=0.11820417521838031
4/100, error=0.11070249217766505
5/100, error=0.10487852202268258
6/100, error=0.10063131666280356
7/100, error=0.09626106259373281
8/100, error=0.09140230224424824
9/100, error=0.08741980521998745
10/100, error=0.08377758956217897
11/100, error=0.08043449369946196
12/100, error=0.07753578591514661
13/100, error=0.07494677168794342
14/100, error=0.07141283999120455
15/100, error=0.06816689964832788
16/100, error=0.06502347167172204
17/100, error=0.06153873985104281
18/100, error=0.05701315785895945
19/100, error=0.05292273729074436
20/100, error=0.04951750740229359
21/100, error=0.04681684754605347
22/100, error=0.044698000657800406
23/100, error=0.04292580928427874
24/100, error=0.041183768194564076
25/100, error=0.03961781683957099
26/100, error=0.03815696846172812
27/100, error=0.03680524693473054
28/100, error=0.03557785159153293
29/100, error=0.034517921432361175
30/100, error=0.03359

The accuracy of the network is 65% which is not very good. This is likely due to the fact that the network is not very deep and does not have many layers.


In [282]:
correct = 0
for x, y in zip(x_test, y_test):
    output = predict(network, x)
    if np.argmax(output) == np.argmax(y):
        correct += 1
print(f"Accuracy: {correct} / {len(x_test)} = {correct / len(x_test) * 100:.2f}%")

Accuracy: 13 / 20 = 65.00%
