<a href="https://colab.research.google.com/github/wahyumulyautama/MachineLeaning_TEL-U/blob/main/Final_Exam_MNIST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing key libraries, and reading data

In [1]:
import pandas as pd
import numpy as np

np.random.seed(1212)

import keras
from keras.models import Model
from keras.layers import *
from keras import optimizers

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
df_train = pd.read_csv('/content/drive/MyDrive/DOCUMENT/Dataset/MNIST/train.csv')
df_test = pd.read_csv('/content/drive/MyDrive/DOCUMENT/Dataset/MNIST/test.csv')

In [8]:
df_train.head() # 784 features, 1 label

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Splitting into training and validation dataset

In [9]:
df_features = df_train.iloc[:, 1:785]
df_label = df_train.iloc[:, 0]

X_test = df_test.iloc[:, 0:784]

print(X_test.shape)

(28000, 784)


In [51]:
from sklearn.model_selection import train_test_split
X_train, X_cv, y_train, y_cv = train_test_split(df_features, df_label, 
                                                test_size = 0.2,
                                                random_state = 1212)

X_train=X_train.values.reshape(33600,784) #(33600, 784)
X_cv=X_cv.values.reshape(8400,784) #(8400, 784)

X_test = X_test.reshape(28000, 784)

## Data cleaning, normalization and selection

In [52]:
print((min(X_train[1]), max(X_train[1])))

(0, 255)


As the pixel intensities are currently between the range of 0 and 255, we proceed to normalize the features, using broadcasting. In addition, we proceed to convert our labels from a class vector to binary One Hot Encoded

In [53]:
from keras.utils import np_utils
# Feature Normalization 
X_train = X_train.astype('float32'); X_cv= X_cv.astype('float32'); X_test = X_test.astype('float32')
X_train /= 255; X_cv /= 255; X_test /= 255

# Convert labels to One Hot Encoded
num_digits = 10
y_train = keras.utils.np_utils.to_categorical(y_train, num_digits)
y_cv = keras.utils.np_utils.to_categorical(y_cv, num_digits)

In [54]:
# Printing 2 examples of labels after conversion
print(y_train[0]) # 2
print(y_train[3]) # 7

[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]


## Model Fitting

We proceed by fitting several simple neural network models using Keras (with TensorFlow as our backend) and collect their accuracy. The model that performs the best on the validation set will be used as the model of choice for the competition.

Model 1: Simple Neural Network with 4 layers (300, 100, 100, 200)

In our first model, we will use the Keras library to train a neural network with the activation function set as ReLu. To determine which class to output, we will rely on the SoftMax function

In [55]:
# Input Parameters
n_input = 784 # number of features
n_hidden_1 = 300
n_hidden_2 = 100
n_hidden_3 = 100
n_hidden_4 = 200
num_digits = 10

In [56]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

In [57]:
# Our model would have '6' layers - input layer, 4 hidden layer and 1 output layer
model = Model(Inp, output)
model.summary() # We have 297,910 parameters to estimate

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 784)]             0         
                                                                 
 Hidden_Layer_1 (Dense)      (None, 300)               235500    
                                                                 
 Hidden_Layer_2 (Dense)      (None, 100)               30100     
                                                                 
 Hidden_Layer_3 (Dense)      (None, 100)               10100     
                                                                 
 Hidden_Layer_4 (Dense)      (None, 200)               20200     
                                                                 
 Output_Layer (Dense)        (None, 10)                2010      
                                                                 
Total params: 297,910
Trainable params: 297,910
Non-trainab

In [58]:
# Insert Hyperparameters
learning_rate = 0.1
training_epochs = 20
batch_size = 100
sgd = tf.keras.optimizers.SGD(lr=learning_rate)

  super(SGD, self).__init__(name, **kwargs)


In [59]:
# We rely on the plain vanilla Stochastic Gradient Descent as our optimizing methodology
model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

In [60]:
history1 = model.fit(X_train, y_train,
                     batch_size = batch_size,
                     epochs = training_epochs,
                     verbose = 2,
                     validation_data=(X_cv, y_cv))

Epoch 1/20
336/336 - 2s - loss: 1.8543 - accuracy: 0.5151 - val_loss: 0.9947 - val_accuracy: 0.7780 - 2s/epoch - 7ms/step
Epoch 2/20
336/336 - 2s - loss: 0.6319 - accuracy: 0.8347 - val_loss: 0.4562 - val_accuracy: 0.8705 - 2s/epoch - 5ms/step
Epoch 3/20
336/336 - 2s - loss: 0.4038 - accuracy: 0.8849 - val_loss: 0.3592 - val_accuracy: 0.8988 - 2s/epoch - 5ms/step
Epoch 4/20
336/336 - 2s - loss: 0.3355 - accuracy: 0.9027 - val_loss: 0.3194 - val_accuracy: 0.9101 - 2s/epoch - 5ms/step
Epoch 5/20
336/336 - 2s - loss: 0.2977 - accuracy: 0.9121 - val_loss: 0.2848 - val_accuracy: 0.9163 - 2s/epoch - 5ms/step
Epoch 6/20
336/336 - 2s - loss: 0.2701 - accuracy: 0.9204 - val_loss: 0.2649 - val_accuracy: 0.9239 - 2s/epoch - 5ms/step
Epoch 7/20
336/336 - 2s - loss: 0.2480 - accuracy: 0.9286 - val_loss: 0.2446 - val_accuracy: 0.9300 - 2s/epoch - 5ms/step
Epoch 8/20
336/336 - 2s - loss: 0.2286 - accuracy: 0.9342 - val_loss: 0.2296 - val_accuracy: 0.9336 - 2s/epoch - 5ms/step
Epoch 9/20
336/336 - 2s 

Using a 4 layer neural network with:

1. 20 training epochs
2. A training batch size of 100
3. Hidden layers set as (300, 100, 100, 200)
4. Learning rate of 0.1


In [61]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

# We rely on ADAM as our optimizing methodology
adam = tf.keras.optimizers.Adam(lr=learning_rate)
model2 = Model(Inp, output)

model2.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

  super(Adam, self).__init__(name, **kwargs)


In [62]:
history2 = model2.fit(X_train, y_train,
                      batch_size = batch_size,
                      epochs = training_epochs,
                      verbose = 2,
                      validation_data=(X_cv, y_cv))

Epoch 1/20
336/336 - 2s - loss: 0.3292 - accuracy: 0.9018 - val_loss: 0.1517 - val_accuracy: 0.9543 - 2s/epoch - 7ms/step
Epoch 2/20
336/336 - 2s - loss: 0.1217 - accuracy: 0.9632 - val_loss: 0.1274 - val_accuracy: 0.9605 - 2s/epoch - 5ms/step
Epoch 3/20
336/336 - 2s - loss: 0.0806 - accuracy: 0.9756 - val_loss: 0.1145 - val_accuracy: 0.9673 - 2s/epoch - 5ms/step
Epoch 4/20
336/336 - 2s - loss: 0.0577 - accuracy: 0.9824 - val_loss: 0.0944 - val_accuracy: 0.9717 - 2s/epoch - 5ms/step
Epoch 5/20
336/336 - 2s - loss: 0.0456 - accuracy: 0.9855 - val_loss: 0.1063 - val_accuracy: 0.9727 - 2s/epoch - 6ms/step
Epoch 6/20
336/336 - 2s - loss: 0.0392 - accuracy: 0.9875 - val_loss: 0.1130 - val_accuracy: 0.9675 - 2s/epoch - 5ms/step
Epoch 7/20
336/336 - 2s - loss: 0.0305 - accuracy: 0.9899 - val_loss: 0.0999 - val_accuracy: 0.9743 - 2s/epoch - 5ms/step
Epoch 8/20
336/336 - 2s - loss: 0.0224 - accuracy: 0.9927 - val_loss: 0.1200 - val_accuracy: 0.9718 - 2s/epoch - 6ms/step
Epoch 9/20
336/336 - 3s 

In [64]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

learning_rate = 0.01
adam = tf.keras.optimizers.Adam(lr=learning_rate)
model2a = Model(Inp, output)

model2a.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

  super(Adam, self).__init__(name, **kwargs)


In [65]:
history2a = model2a.fit(X_train, y_train,
                        batch_size = batch_size,
                        epochs = training_epochs,
                        verbose = 2,
                        validation_data=(X_cv, y_cv))

Epoch 1/20
336/336 - 2s - loss: 0.3382 - accuracy: 0.9008 - val_loss: 0.1498 - val_accuracy: 0.9545 - 2s/epoch - 7ms/step
Epoch 2/20
336/336 - 2s - loss: 0.1224 - accuracy: 0.9620 - val_loss: 0.1145 - val_accuracy: 0.9655 - 2s/epoch - 5ms/step
Epoch 3/20
336/336 - 2s - loss: 0.0859 - accuracy: 0.9730 - val_loss: 0.0963 - val_accuracy: 0.9713 - 2s/epoch - 5ms/step
Epoch 4/20
336/336 - 2s - loss: 0.0568 - accuracy: 0.9819 - val_loss: 0.1003 - val_accuracy: 0.9705 - 2s/epoch - 6ms/step
Epoch 5/20
336/336 - 2s - loss: 0.0413 - accuracy: 0.9869 - val_loss: 0.0949 - val_accuracy: 0.9736 - 2s/epoch - 6ms/step
Epoch 6/20
336/336 - 2s - loss: 0.0367 - accuracy: 0.9877 - val_loss: 0.1035 - val_accuracy: 0.9727 - 2s/epoch - 5ms/step
Epoch 7/20
336/336 - 2s - loss: 0.0322 - accuracy: 0.9893 - val_loss: 0.1161 - val_accuracy: 0.9714 - 2s/epoch - 5ms/step
Epoch 8/20
336/336 - 2s - loss: 0.0246 - accuracy: 0.9923 - val_loss: 0.1245 - val_accuracy: 0.9683 - 2s/epoch - 5ms/step
Epoch 9/20
336/336 - 2s 

Model 2B

In [67]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

learning_rate = 0.5
adam = tf.keras.optimizers.Adam(lr=learning_rate)
model2b = Model(Inp, output)

model2b.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

  super(Adam, self).__init__(name, **kwargs)


In [68]:
history2b = model2b.fit(X_train, y_train,
                        batch_size = batch_size,
                        epochs = training_epochs,
                            validation_data=(X_cv, y_cv))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


The accuracy, as measured by the 3 different learning rates 0.01, 0.1 and 0.5 are around 98%, 97% and 98% respectively. As there are no considerable gains by changing the learning rates, we stick with the default learning rate of 0.01.

We proceed to fit a neural network with 5 hidden layers with the features in the hidden layer set as (300, 100, 100, 100, 200) respectively. To ensure that the two models are comparable, we will set the training epochs as 20, and the training batch size as 100.

In [69]:
# Input Parameters
n_input = 784 # number of features
n_hidden_1 = 300
n_hidden_2 = 100
n_hidden_3 = 100
n_hidden_4 = 100
n_hidden_5 = 200
num_digits = 10

In [70]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
x = Dense(n_hidden_5, activation='relu', name = "Hidden_Layer_5")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

In [71]:
# Our model would have '7' layers - input layer, 5 hidden layer and 1 output layer
model3 = Model(Inp, output)
model3.summary() # We have 308,010 parameters to estimate

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_10 (InputLayer)       [(None, 784)]             0         
                                                                 
 Hidden_Layer_1 (Dense)      (None, 300)               235500    
                                                                 
 Hidden_Layer_2 (Dense)      (None, 100)               30100     
                                                                 
 Hidden_Layer_3 (Dense)      (None, 100)               10100     
                                                                 
 Hidden_Layer_4 (Dense)      (None, 100)               10100     
                                                                 
 Hidden_Layer_5 (Dense)      (None, 200)               20200     
                                                                 
 Output_Layer (Dense)        (None, 10)                2010

In [73]:
# We rely on 'Adam' as our optimizing methodology
adam = tf.keras.optimizers.Adam(lr=0.01)

model3.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

  super(Adam, self).__init__(name, **kwargs)


In [74]:
history3 = model3.fit(X_train, y_train,
                      batch_size = batch_size,
                      epochs = training_epochs,
                      validation_data=(X_cv, y_cv))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Compared to our first model, adding an additional layer did not significantly improve the accuracy from our previous model. However, there are computational costs (in terms of complexity) in implementing an additional layer in our neural network. Given that the benefits of an additional layer are low while the costs are high, we will stick with the 4 layer neural network.

We now proceed to include dropout (dropout rate of 0.3) in our second model to prevent overfitting.

In [75]:
# Input Parameters
n_input = 784 # number of features
n_hidden_1 = 300
n_hidden_2 = 100
n_hidden_3 = 100
n_hidden_4 = 200
num_digits = 10

In [76]:
Inp = Input(shape=(784,))
x = Dense(n_hidden_1, activation='relu', name = "Hidden_Layer_1")(Inp)
x = Dropout(0.3)(x)
x = Dense(n_hidden_2, activation='relu', name = "Hidden_Layer_2")(x)
x = Dropout(0.3)(x)
x = Dense(n_hidden_3, activation='relu', name = "Hidden_Layer_3")(x)
x = Dropout(0.3)(x)
x = Dense(n_hidden_4, activation='relu', name = "Hidden_Layer_4")(x)
output = Dense(num_digits, activation='softmax', name = "Output_Layer")(x)

In [77]:
# Our model would have '6' layers - input layer, 4 hidden layer and 1 output layer
model4 = Model(Inp, output)
model4.summary() # We have 297,910 parameters to estimate

Model: "model_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_11 (InputLayer)       [(None, 784)]             0         
                                                                 
 Hidden_Layer_1 (Dense)      (None, 300)               235500    
                                                                 
 dropout (Dropout)           (None, 300)               0         
                                                                 
 Hidden_Layer_2 (Dense)      (None, 100)               30100     
                                                                 
 dropout_1 (Dropout)         (None, 100)               0         
                                                                 
 Hidden_Layer_3 (Dense)      (None, 100)               10100     
                                                                 
 dropout_2 (Dropout)         (None, 100)               0   

In [78]:
model4.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [79]:
history = model4.fit(X_train, y_train,
                    batch_size = batch_size,
                    epochs = training_epochs,
                    validation_data=(X_cv, y_cv))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


With a validation score of close to 98%, we proceed to use this model to predict for the test set.

In [80]:
test_pred = pd.DataFrame(model4.predict(X_test, batch_size=200))
test_pred = pd.DataFrame(test_pred.idxmax(axis = 1))
test_pred.index.name = 'ImageId'
test_pred = test_pred.rename(columns = {0: 'Label'}).reset_index()
test_pred['ImageId'] = test_pred['ImageId'] + 1

test_pred.head()

Unnamed: 0,ImageId,Label
0,1,8
1,2,8
2,3,8
3,4,8
4,5,8


In [81]:
test_pred.to_csv('mnist_submission.csv', index = False)

Using this model, we are able to achieve a score of 0.976, which places us at the top 55th percentile!