Skip to content

A Neural Network made 'from scratch' using NumPy only, trained on the MNIST dataset of handwritten digits. Achieved 98.4% accuracy πŸŽ‰

Notifications You must be signed in to change notification settings

wryzxec/MNIST_NeuralNet

Repository files navigation

MNIST Neural Network with NumPy

A Neural Network, trained on the MNIST database, with the task of classifying handwritten digits. Made without the use of ML libraries such as TensorFlow or Pytorch, only using NumPy. Currently runs with a 98% Accuracy!

Contents

β†’ MNIST Database
β†’ Overview on How the Nework Learns
β†’ Network Structure
β†’ Activation Functions
β†’ One-Hot Encoding
β†’ Loss Function
β†’ Mini-Batch Gradient Descent
β†’ Gradient Descent with Momentum
β†’ Back Propagation
β†’ He Initialisation
β†’ Analysing Results

Mnist Database

The MNIST database is a large database, containing 70,000 images of handwritten digits.

Each image is black and white, 28x28 pixels in size, and contains a singular handwritten digit.

MNIST Example Images

For the purpose of this network, 60,000 images are reserved for training and the remaining 10,000 are used for testing the network on unseen images.

Overview on How the Network Learns

  1. Initialisation: The networks weights and biases are initialised.
  2. Forward Propagation: The input data is passed through the network and predictions are obtained.
  3. Computing Loss: The predictions are passed through the loss function to measure the accuracy of the network.
  4. Backward Propagation: Using partial derivatives, compute the gradient of the loss function with respect to each weight and bias in the network.
  5. Update Parameters: Update the weights and biases, opposing the direction of the gradient, to reach a point of minimum loss.
  6. Repeat steps 2-5.

Network Structure

This image is a simplified version of the networks architecture. It contains an input layer, hidden layers and an output layer. One important thing to note is that the output contains 10 different nodes. These correspond to the 10 digits (0-9) that the network is attempting to classify.

The ReLU functions in the hidden layers introduce non-linearity and a softmax function is applied to the output to convert the raw output (logits) to probabilities, which sum to 1.

In reality, the input layer contains 784 (28*28) input features and there are many more neurons in the hidden layers, however the general architecture is the same between the neural network in this image and the one in the code.

Neural Network Architecture Image

Activation Functions

ReLU

$$ ReLU(z) = \begin{cases} 0 &\text{if } z \le 0 \\ z &\text{if } z > 0 \\ \end{cases} $$

Softmax

$$ Softmax(z_i) = \frac{e^{z_i}}{\sum_k e^{z_k}} $$

One-Hot Encoding

When representing the categorical output data $y$ we can use one-hot encoding. This involves representing the data as a binary vector.

For example, we can encode the number $3$ as

$$ [0,0,0,1,0,0,0,0,0,0] $$

Here, each number (0-9) can be represented by simply setting it's respective index (starting from 0) to 1 and all others to 0. This works particularly well when paired with the softmax and categorical cross-entropy loss function.

Loss Function

Categorical Cross-Entropy Loss

$$ L = -\sum_{i=1}^H y_i \cdot \log \hat{y_i} $$

Where $H$ is the number of 'categories', $y$ the true label and $\hat{y}$ the predicted label.

Mini-Batch Gradient Descent

Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches. When data is processed in these small batches it means that the weights and biases are updated with each mini-batch, unlike batch gradient descent where the traning set is processed as a whole unit.

Gradient Descent with Momentum

Gradient Descent with Momentum is an optimisation technique which allows for the network to converge faster to the optimal solution. This is done by calculating an exponentially weighted average of the gradients and then this averaged gradient is used to update the weights and biases.

Velocity Parameter Update

$$ \begin{align*} V_{\delta w} &:= \beta \cdot V_{\delta w} + (1-\beta)\cdot \delta w\\ V_{\delta b} &:= \beta \cdot V_{\delta b} + (1-\beta)\cdot \delta b \end{align*} $$

Weight and Bias Update

$$ \begin{align*} w &:= w-\alpha V_{\delta w}\\ b &:= b-\alpha V_{\delta b} \end{align*} $$

Back Propagation

Deriving $\left(\frac{\delta L}{\delta z_k}\right)$

By the previously stated definition of Categorical Cross Entropy Loss

$$ L = -\sum_{i=1} y_i \log(o_i) $$

where $o_i$ is the output of the softmax function, given by

$$ o_i = \frac{e^{z_i}}{\sum_{k=1} e^{z_k}} $$

Now, using partial derivatives

$$ \begin{align*} \frac{\delta L}{\delta z_k} &= \frac{\delta(-\sum_{i=1}y_i\log(o_i))}{\delta(z_k)}\\ &= -\sum_{i=1} \frac{\delta(y_i \log(o_i))}{\delta(z_k)}\\ &= -\sum_{i=1} y_i\cdot \frac{\delta(\log(o_i)}{\delta(z_k)}\hspace{10pt} \text{Since } y_i \text{ is independent of } z_k \end{align*} $$

By the chain rule we find that

$$ \frac{\delta{\log(o_i)}}{\delta{z_k}} = \frac{\delta(\log(o_i))}{\delta(o_i)}\cdot \frac{\delta(o_i)}{\delta(z_k)} $$

Substituting this back in

$$ \frac{\delta L}{\delta z_k} = -\sum_{i=1} \left[ y_i \cdot \frac{\delta(\log(o_i))}{\delta(o_i)}\cdot \frac{\delta(o_i)}{\delta(z_k)}\right] $$

Since $\frac{\delta(\log(o_i))}{\delta(o_i)} = \frac{1}{o_i}$

$$ \frac{\delta L}{\delta z_k} = -\sum_{i=1} \left[ \frac{y_i}{o_i} \cdot \frac{\delta(o_i)}{\delta(z_k)}\right] $$

$\frac{\delta o_i}{\delta z_k}$ is the derivative of the Softmax output with respect to the input z_k which is given by

$$ \frac{\delta o_i}{\delta z_k} = \begin{cases} o_k(1-o_k) &\text{if } i = k \\ -o_i \cdot o_k &\text{if } i \ne k \\ \end{cases} $$

In order to use this definition within our equation for $\frac{\delta L}{\delta z_k}$ we must re-write the sum

$$ \begin{align*} \frac{\delta L}{\delta z_k} &= -\left[ \sum_{i\ne k}\left(\frac{y_i}{o_k}\cdot -o_i\cdot o_k \right) + \frac{y_k}{o_k}\cdot o_k(1-o_k)\right]\\ &= -\left[ \sum_{i\ne k}\left(-y_i\cdot o_k \right) + y_k(1-o_k)\right]\\ &= -\left[-o_k\sum_{i\ne k}\left(-y_i \right) + y_k(1-o_k)\right] \end{align*} $$

Since $y_i$ is one-hot encoded, we know that

$$ \sum_{i=1}y_i = 1 \hspace{10pt} \text{ and } \hspace{10pt} \sum_{i\ne k}y_i = 1 - y_k $$

Therefore

$$ \begin{align*} \frac{\delta L}{\delta z_k} &= -\left[-o_k(1-y_k) + y_k(1-o_k)\right]\\ &= -\left[-o_k + o_k\cdot y_k + y_k -o_k\cdot y_k\right]\\ &= o_k - y_k \end{align*} $$

Which is implemented in the code as

self.layers[-1].dZ = self.layers[-1].A - one_hot(Y)

Deriving $\left(\frac{\delta L}{\delta W_k} \right)$

By the chain rule

$$ \frac{\delta L}{\delta W_k} = \frac{\delta L}{\delta z_k} \cdot \frac{\delta z_k}{\delta W_k} $$

Since $z_k = A_{k-1}\cdot W_k + b_k$

$$ \begin{align*} \frac{\delta z_k}{\delta W_k} &= \frac{\delta}{\delta W_k}(A_{k-1}\cdot W_k + b_k)\\ &= A_{k-1} \end{align*} $$

We must consider all of the gradients in the layer so we divide by the total number ($m$) and get the resulting equation

$$ \frac{\delta L}{\delta W_k} = \frac{1}{m} \cdot \frac{\delta L}{\delta z_k} \cdot A_{k-1}^T $$

Which is implemented in the code as

self.layers[-1].dW = 1/m * self.layers[-1].dZ.dot(self.layers[-2].A.T)

Deriving $\left(\frac{\delta L}{\delta b_k}\right)$

By the chain rule

$$ \frac{\delta L}{\delta b_k} = \frac{\delta L}{\delta z_k} \cdot \frac{\delta z_k}{\delta b_k} $$

Since $z_k = A_{k-1}\cdot W_k + b_k$

$$ \begin{align*} \frac{\delta L}{\delta b_k} &= \frac{\delta L}{\delta z_k} \cdot \frac{\delta}{\delta b_k}(A_{k-1}\cdot W_k + b_k)\\ &= \frac{\delta L}{\delta z_k} \cdot 1\\ &= \frac{\delta L}{\delta z_k} \end{align*} $$

Therefore, $\frac{\delta L}{\delta b_k}=\frac{\delta L}{\delta z_k}$ So to get $\frac{\delta L}{\delta b_k}$ we simply take an average

$$ \frac{\delta L}{\delta b_k}=\frac{1}{m} \cdot \sum_{k=1}^m \frac{\delta L}{\delta z_k} $$

Which is implemented in the code as

self.layers[-1].db = 1/m * np.sum(self.layers[-1].dZ, 1).reshape(-1, 1)

He Initialisation

Weights cannot be initialised to 0. Since $a = \overline{w}X + b = b$ for $X=0$. This would result in all neurons outputting the same value, regardless of input (symmetry).

For this model, He initialisation is used and is defined as follows:

$$ W \sim N(0, \sqrt{\frac{2}{n_{in}}}) $$

This denotes a normal distribution with mean $0$ and standard deviation $\sqrt{\frac{2}{n_{in}}}$ where $n_{in}$ represents the number of input units to the layer.

Using He initialisation reduces the chances of gradients 'vanishing' or 'exploding' during backpropagation and also leads to faster convergence. It is particularly suited for neural networks utilising the ReLU activation function.

Analysing Results

Network Performance

Epochs Mini-Batch Size Neurons (Layer1/Layer2/.../Layer n) Learning Rate Momentum Applied (True/False) Accuracy (%)
20 128 200/100/25/10 0.5 True 98.42
20 128 100/50/10 0.5 True 98.14
20 128 100/50/10 0.5 False 97.51
20 60,000 100/50/10 0.5 True 84.13
20 60,000 100/50/10 0.5 False 65.16

Note: A mini-batch size of 60,000 just means that batch gradient descent is being used over mini-batch gradient descent.

The Effect of Momentum on Network Performance

Momentum Comparison Graphs

About

A Neural Network made 'from scratch' using NumPy only, trained on the MNIST dataset of handwritten digits. Achieved 98.4% accuracy πŸŽ‰

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages