In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Assignment 3 Part 1

## Objective 0: Data Input, Verification, and Transformation

First I will import relevant packages for checking and changing the working directory, creating dataframes, splitting the data into train/validation/test splits, and for analyzing the data.

In [105]:
import numpy as np
import pandas as pd
import os
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from numpy import asarray, save, load
import matplotlib.pyplot as plt

Checking the working directory:

In [3]:
os.getcwd()

'C:\\Users\\Wyatt\\Downloads'

The data for this analysis is contained in a folder in this directory, I will change the directory to that folder and check the available data:

In [4]:
os.chdir('C:\\Users\\Wyatt\\Downloads\\Assignment 3')
os.getcwd()
os.listdir('Pickles')

'C:\\Users\\Wyatt\\Downloads\\Assignment 3'

['assign-3-part-1-test.pickle', 'assign-3-part-1-train.pickle']

Reading these pickles and assigning them to the relevant names:

In [5]:
training_data = pd.read_pickle('C:\\Users\\Wyatt\\Downloads\\Assignment 3\\Pickles\\assign-3-part-1-train.pickle')
test_data = pd.read_pickle('C:\\Users\\Wyatt\\Downloads\\Assignment 3\\Pickles\\assign-3-part-1-test.pickle')

Checking the shape of the data and ensuring that the data falls within appropriate values:

In [6]:
training_data.shape
test_data.shape
print(sum(training_data['label']>9)+sum(training_data['label']<0), 'Values in the training data fall outside the range [0,9]')
print((1680 - sum(np.isnan(test_data['label']))), 'Values in the test data column titled label are not NaN')

(40320, 786)

(1680, 786)

0 Values in the training data fall outside the range [0,9]
0 Values in the test data column titled label are not NaN


Ensuring that the pixles in the images fall between 0 and 255:

In [7]:
training_pixel_too_few = training_data.loc[:,training_data.columns != ('label' or 'imageID')].copy()< 0
print(sum(training_pixel_too_few.any()), 'values are less than 0 pixels in the training set')
training_pixel_too_many = training_data.loc[:,training_data.columns != ('imageID')].copy()>255
print(sum(training_pixel_too_many.any()), 'values exceed 255 pixels in the training set')
test_pixel_too_few = test_data.loc[:,test_data.columns != ('label' or 'imageID')].copy()<0
print(sum(test_pixel_too_few.any()), 'values are less than 0 pixels in the test set')
test_pixel_too_many = test_data.loc[:,test_data.columns != ('imageID')].copy()>255
print(sum(test_pixel_too_many.any()), 'values exceed 255 pixels in the test set')

0 values are less than 0 pixels in the training set
0 values exceed 255 pixels in the training set
0 values are less than 0 pixels in the test set
0 values exceed 255 pixels in the test set


Rescaling the pixels from [0,255] to [0,1]:

In [8]:
rescaled_training_data = training_data.loc[:,training_data.columns != 'imageID'].copy()
rescaled_training_data = rescaled_training_data.loc[:,rescaled_training_data.columns != 'label'].copy()
rescaled_training_data = np.divide(rescaled_training_data,255.).copy()

rescaled_test_data = test_data.loc[:,test_data.columns != 'imageID'].copy()
rescaled_test_data = rescaled_test_data.loc[:,rescaled_test_data.columns != 'label'].copy()
rescaled_test_data = np.divide(rescaled_test_data,255.).copy()

Splitting the training data into train, validate, test splits:

In [9]:
X = rescaled_training_data
y = training_data['label'].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

## Part 1 Objective 1: Train and Evaluate Four Versions of a MLP That Predicts Digit Labels

Checking the data shape for the train dataset:

In [10]:
X_train.shape

(24192, 784)

Importing time and creating a custom callback to be able to check time by epoch by callback:

In [11]:
import time
class TimeHistory(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.times = []

    def on_epoch_begin(self, batch, logs={}):
        self.epoch_time_start = time.time()

    def on_epoch_end(self, batch, logs={}):
        self.times.append(time.time() - self.epoch_time_start)

In [12]:
time_callback = TimeHistory()

Choosing the model and layer choices. I will run four seperate models:

Model 1: Three hidden layers, one with 300 neurons, one with 100 neurons, those two with relu activation, one with 10 neurons and softmax activation
Model 2: Four hidden layers: 500, 250, 100 neurons with relu activation, 10 neurons with softmax activation
Model 3: Five hidden layers: 500, 250, 100, 50 neurons with relu activation, 10 neurons with softmax activation
Model 4: Six hidden layers: 500, 250, 100, 50, 25 neurons with relu activation, 10 neurons with softmax activation

All models will be compiled using sparse categorical crossentropy and a sgd optimizer, metrics are designated as accuracy.

In [13]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[784]))
model.add(keras.layers.Dense(300, activation="relu"))
model.add(keras.layers.Dense(100, activation="relu"))
model.add(keras.layers.Dense(10, activation="softmax"))

In [14]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

Running model 1 and saving to history, times for epochs saved to times:

In [None]:
history = model.fit(X_train, y_train, epochs=30, validation_data=(X_val, y_val),callbacks=[time_callback])
times = time_callback.times

Running model 2 and saving to history2, times2 for epochs saved to times:

In [None]:
model2 = keras.models.Sequential()
model2.add(keras.layers.Flatten(input_shape=[784]))
model2.add(keras.layers.Dense(500, activation="relu"))
model2.add(keras.layers.Dense(250, activation="relu"))
model2.add(keras.layers.Dense(100, activation="relu"))
model2.add(keras.layers.Dense(10, activation="softmax"))
model2.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
history2 = model2.fit(X_train, y_train, epochs=30, validation_data=(X_val, y_val),callbacks=[time_callback])
times2 = time_callback.times

Running model 3 and saving to history3, times for epochs saved to times3:

In [None]:
model3 = keras.models.Sequential()
model3.add(keras.layers.Flatten(input_shape=[784]))
model3.add(keras.layers.Dense(500, activation="relu"))
model3.add(keras.layers.Dense(250, activation="relu"))
model3.add(keras.layers.Dense(100, activation="relu"))
model3.add(keras.layers.Dense(50, activation="relu"))
model3.add(keras.layers.Dense(10, activation="softmax"))
model3.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
history3 = model3.fit(X_train, y_train, epochs=30, validation_data=(X_val, y_val),callbacks=[time_callback])
times3 = time_callback.times

Running model 4 and saving to history4, times for epochs saved to times4:

In [None]:
model4 = keras.models.Sequential()
model4.add(keras.layers.Flatten(input_shape=[784]))
model4.add(keras.layers.Dense(500, activation="relu"))
model4.add(keras.layers.Dense(250, activation="relu"))
model4.add(keras.layers.Dense(100, activation="relu"))
model4.add(keras.layers.Dense(50, activation="relu"))
model4.add(keras.layers.Dense(25, activation="relu"))
model4.add(keras.layers.Dense(10, activation="softmax"))
model4.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
history4 = model4.fit(X_train, y_train, epochs=30, validation_data=(X_val, y_val),callbacks=[time_callback])
times4 = time_callback.times

In [161]:
frame = ([history.history["accuracy"],history2.history["accuracy"], history3.history["accuracy"], history4.history["accuracy"],history.history["val_accuracy"],history2.history["val_accuracy"],history3.history["val_accuracy"],history4.history["val_accuracy"],times,times2,times3,times4])
frame = np.reshape(frame, (30,12))
frame_df = pd.DataFrame(frame)
frame_df.columns = ("Acc1","Acc2","Acc3","Acc4","ValAcc1","ValAcc2","ValAcc3","ValAcc4","Time 1","Time 2","Time 3","Time 4")
frame_df.index = (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
frame_df

Unnamed: 0,Acc1,Acc2,Acc3,Acc4,ValAcc1,ValAcc2,ValAcc3,ValAcc4,Time 1,Time 2,Time 3,Time 4
1,0.792369,0.898024,0.913773,0.923652,0.930638,0.937459,0.942956,0.947214,0.950893,0.955729,0.957465,0.960235
2,0.963376,0.966187,0.968378,0.969866,0.971644,0.973049,0.974868,0.976811,0.978133,0.979084,0.980655,0.982391
3,0.983176,0.98392,0.984954,0.985904,0.986814,0.988054,0.782655,0.904018,0.9208,0.933986,0.942006,0.949239
4,0.95515,0.960317,0.964782,0.968047,0.971313,0.97371,0.97648,0.979291,0.98173,0.98206,0.984912,0.986318
5,0.987889,0.988715,0.989583,0.990782,0.992436,0.993262,0.994254,0.994668,0.995494,0.996321,0.996858,0.997313
6,0.750744,0.905093,0.925761,0.937417,0.947669,0.955398,0.962219,0.966063,0.970445,0.974909,0.978299,0.981109
7,0.982556,0.986235,0.987806,0.989335,0.991526,0.992642,0.994296,0.995453,0.996197,0.996734,0.997561,0.998347
8,0.998801,0.99876,0.999297,0.999504,0.999711,0.999711,0.680101,0.89852,0.926009,0.94151,0.950893,0.960235
9,0.967221,0.971602,0.977059,0.980985,0.984086,0.986648,0.989625,0.991774,0.994172,0.995164,0.99661,0.99723
10,0.997851,0.998636,0.999132,0.999752,0.999835,0.999917,0.999959,0.999959,1.0,1.0,1.0,1.0


All models showed dramatic reduction in loss over epochs, however model 3 and 4 had the highest training and validation accuracies across epochs.

Model 2 and 3 had identical validation accuracy after 30 epochs, however model 3 took less time to run and had better training accuracy.
Model 4 took slighly longer to run than model 3, however had identical training accuracy and was slightly less overfit to training data.

Looking at loss over epochs, and accuracy and validation accuracy for the training and validation splits, model 3 and 4 appear to be the best two models.

I will compare model 3 and 4 by using the test split from earlier before applying the better model to the test data loaded from the test pickle.

In [111]:
model3_evaluated = model3.evaluate(X_test, y_test)
model4_evaluated = model4.evaluate(X_test, y_test)

model3_evaluated[1]
model4_evaluated[1]



0.9666418433189392

0.9673858880996704

Model 4 performs slightly better on the accuracy metric than model 3 when applied to the test data. I will use this model to predict values for the data from the testing pickle.

## Predict Test Data Labels, Prepare a File for Kaggle

Predicting classes of the rescaled test data and generating an excel file with columns class and imageID and exporting it to the directory titled excelpath.

In [85]:
X_new_test = rescaled_test_data
y_test_pred = model4.predict_classes(X_new_test)
index_df = test_data.loc[:,test_data.columns == 'imageID'].copy()
index_array = np.asarray(index_df)
index_array_reshape = index_array.reshape((1, 1680))
index_list = index_array.tolist()
test_df = pd.DataFrame(y_test_pred, columns=["Class"])
test_df["imageID"] = np.asarray(test_data.loc[:,test_data.columns == ('imageID')])
Excel_directory = 'Excel'
Excel_parent_dir = "C:\\Users\\Wyatt\\Downloads\\Assignment 3\\"
Excelpath = os.path.join(Excel_parent_dir, Excel_directory)
os.mkdir(Excelpath)

In [88]:
os.getcwd()
os.chdir("C:\\Users\\Wyatt\\Downloads\\Assignment 3\\Excel")
os.getcwd()

'C:\\Users\\Wyatt\\Downloads\\Assignment 3\\Excel'

'C:\\Users\\Wyatt\\Downloads\\Assignment 3\\Excel'

In [98]:
test_df.to_csv('test_df.csv', index = False)