# Multiclass Classification with the MNIST dataset

The MNIST dataset contains images of handwritten digits ranging from 0-9, that have a size of 28x28. The training and test sets contain 60,000 and 10,000 samples respectively. It is considered a good dataset for experimenting with learning techniques because the data has already been processed and formatted and can be used as is. The datasets can be downloaded [here](http://yann.lecun.com/exdb/mnist/). 

#### Loading required libraries

In [1]:
from NeuralNet import NeuralNet, get_normalisation_constants
import numpy as np
from sklearn.preprocessing import LabelBinarizer
from mnist import MNIST

#### Loading the dataset (as described [here](https://github.com/sorki/python-mnist))

In [2]:
mndata = MNIST('.')
images, labels = mndata.load_training()
images_test, labels_test = mndata.load_testing()
x_train = np.array(images).T
y_train = np.array(labels)
x_test = np.array(images_test).T
y_test = np.array(labels_test)

Let us take a look at the labels we have.

In [3]:
y_train[:10]

array([5, 0, 4, 1, 9, 2, 1, 3, 1, 4], dtype=uint8)

Since our labels need to be binarized for our model to function, we use the LabelBinarizer function to transform our data.

In [4]:
lb = LabelBinarizer()
y_train = lb.fit_transform(y_train).T
y_test = lb.fit_transform(y_test).T

#### Checking the dimensions of the data
Since our neural network implementation takes the different examples in columns, our x_train should have a shape of (784, 60000). Our binarized y_train should have dimensions of (10, 60000).

In [5]:
x_train.shape

(784, 60000)

In [6]:
y_train.shape

(10, 60000)

## Building our neural network
To improve the performance of our algorithm, we divide the data by the maximum value 255.

In [7]:
x_train = x_train / 255
x_test = x_test / 255

#### Neural network created with batch gradient descent
We will first create a network using batch gradient descent. For this model, we shall use the tanh activation function and initialize our weights using the Xavier approach. Our network will have 2 hidden layers with 30 units each. We will train our model for 500 epochs and use a learning rate of 0.1. We shall call this network nn_bgd. Please give the following code 5 minutes to run.

In [8]:
np.random.seed(0)
nn_bgd = NeuralNet()
nn_bgd.initializer_xavier(784, [30, 30, 10])
nn_bgd.gd_batch(x_train, y_train, epochs=500, alpha=1, activation='tanh')

KeyboardInterrupt: 

The cost is constantly decreasing so we can be happy about the progress of learning. Now we get the predictions on the training set and calculate the training error

In [None]:
p, err = nn_bgd.predict(x_train, y_train)
print(err)

Now we calculate the test error

In [None]:
p, err = nn_bgd.predict(x_test, y_test)
print(err)

We had a training error of 3.07% and a test error of 3.93%. Since the difference is so low, we can rule out the possibility of overfitting and we need not consider regularization techniques.

#### Neural network created with mini-batch gradient descent
We will now create a network using mini-batch gradient descent. In this case, we shall use the relu activation function, and use the ADAM optimizer for the weight updates. For initiallizing the weights, we shall use the He method.

In [16]:
np.random.seed(0)
nn_mbgd = NeuralNet()
nn_mbgd.initializer_he(784, [30, 30, 10])
nn_mbgd.gd_mini_batch(x_train, y_train, epochs=200, alpha=.1, activation='relu', mini_batch_size=256)

The cost after epoch 50 is 0.046104916388539044.
The cost after epoch 100 is 0.013927177522875324.
The cost after epoch 150 is 0.004509681891616496.
The cost after epoch 200 is 0.0015621479698567143.


We calculate training error rate

In [17]:
p, err = nn_mbgd.predict(x_train, y_train)
print(err)

0.00075


Now we calculate the test error rate

In [18]:
p, err = nn_mbgd.predict(x_test, y_test)
print(err)

0.0364


With mini-batch gradient descent we got equally impressive error rates of 0.075% and 3.64% for the training and test data.