A Neural Network, trained on the MNIST database, with the task of classifying handwritten digits. Made without the use of ML libraries such as TensorFlow or Pytorch, only using NumPy. Currently runs with a 98% Accuracy!
β MNIST Database
β Overview on How the Nework Learns
β Network Structure
β Activation Functions
β One-Hot Encoding
β Loss Function
β Mini-Batch Gradient Descent
β Gradient Descent with Momentum
β Back Propagation
β He Initialisation
β Analysing Results
The MNIST database is a large database, containing 70,000 images of handwritten digits.
Each image is black and white, 28x28 pixels in size, and contains a singular handwritten digit.
For the purpose of this network, 60,000 images are reserved for training and the remaining 10,000 are used for testing the network on unseen images.
- Initialisation: The networks weights and biases are initialised.
- Forward Propagation: The input data is passed through the network and predictions are obtained.
- Computing Loss: The predictions are passed through the loss function to measure the accuracy of the network.
- Backward Propagation: Using partial derivatives, compute the gradient of the loss function with respect to each weight and bias in the network.
- Update Parameters: Update the weights and biases, opposing the direction of the gradient, to reach a point of minimum loss.
- Repeat steps 2-5.
This image is a simplified version of the networks architecture. It contains an input layer, hidden layers and an output layer. One important thing to note is that the output contains 10 different nodes. These correspond to the 10 digits (0-9) that the network is attempting to classify.
The ReLU functions in the hidden layers introduce non-linearity and a softmax function is applied to the output to convert the raw output (logits) to probabilities, which sum to 1.
In reality, the input layer contains 784 (28*28) input features and there are many more neurons in the hidden layers, however the general architecture is the same between the neural network in this image and the one in the code.
When representing the categorical output data
For example, we can encode the number
Here, each number (0-9) can be represented by simply setting it's respective index (starting from 0) to 1 and all others to 0. This works particularly well when paired with the softmax and categorical cross-entropy loss function.
Where
Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches. When data is processed in these small batches it means that the weights and biases are updated with each mini-batch, unlike batch gradient descent where the traning set is processed as a whole unit.
Gradient Descent with Momentum is an optimisation technique which allows for the network to converge faster to the optimal solution. This is done by calculating an exponentially weighted average of the gradients and then this averaged gradient is used to update the weights and biases.
Velocity Parameter Update
Weight and Bias Update
By the previously stated definition of Categorical Cross Entropy Loss
where
Now, using partial derivatives
By the chain rule we find that
Substituting this back in
Since
In order to use this definition within our equation for
Since
Therefore
Which is implemented in the code as
self.layers[-1].dZ = self.layers[-1].A - one_hot(Y)
By the chain rule
Since
We must consider all of the gradients in the layer so we divide by the total number (
Which is implemented in the code as
self.layers[-1].dW = 1/m * self.layers[-1].dZ.dot(self.layers[-2].A.T)
By the chain rule
Since
Therefore,
Which is implemented in the code as
self.layers[-1].db = 1/m * np.sum(self.layers[-1].dZ, 1).reshape(-1, 1)
Weights cannot be initialised to 0. Since
For this model, He initialisation is used and is defined as follows:
This denotes a normal distribution with mean
Using He initialisation reduces the chances of gradients 'vanishing' or 'exploding' during backpropagation and also leads to faster convergence. It is particularly suited for neural networks utilising the ReLU activation function.
Epochs | Mini-Batch Size | Neurons (Layer1/Layer2/.../Layer n) | Learning Rate | Momentum Applied (True/False) | Accuracy (%) |
---|---|---|---|---|---|
20 | 128 | 200/100/25/10 | 0.5 | True | 98.42 |
20 | 128 | 100/50/10 | 0.5 | True | 98.14 |
20 | 128 | 100/50/10 | 0.5 | False | 97.51 |
20 | 60,000 | 100/50/10 | 0.5 | True | 84.13 |
20 | 60,000 | 100/50/10 | 0.5 | False | 65.16 |
Note: A mini-batch size of 60,000 just means that batch gradient descent is being used over mini-batch gradient descent.