# Introduction to CNN with MNIST

This notebook will look at three strategies for solving the MNIST Digit Recognizer computer vision competition. 

*Note*: The top scores in this competition are achieved by dowloading the full MNIST dataset from an external source. Using the full dataset as a training set will mean that your model will have seen every image from the competition test set. The scores of 100% accuracy take each image from the test set and just lookup the most similar image from the full dataset. Neither of these approaches would be possible in real life. It seems like the maximum 'non-cheating' score in this competition is around 0.997.

The three strategies explored in this notebook are:

1. Reltively simple LeNet CNN architecture
2. LeNet CNN architecture with data augmentation and learning rate annealing
3. A more complex CNN architecture with data augmentation, learning rate annealing, and k-fold cross validation

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
%matplotlib inline

np.random.seed(2)

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import confusion_matrix
import itertools

from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D, MaxPool2D, BatchNormalization, Activation
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import ReduceLROnPlateau
from keras.regularizers import L1L2
dense_regularizer = L1L2(l2=0.0001)

import os
os.chdir('/kaggle/working')

sns.set(style='white', context='notebook', palette='deep')

# Load and Process Data

In [None]:
# Load the data
train = pd.read_csv("/kaggle/input/digit-recognizer/train.csv")
test = pd.read_csv("/kaggle/input/digit-recognizer/test.csv")

In [None]:
# Create Y df (labels) and drop the labels from X
Y = train["label"]
X = train.drop(labels = ["label"],axis = 1) 

Do a quick check of how many of each label we have in the training set. An unblanced dataset could cause problems with our model's prediction ability.

In [None]:
# Check the distribution of labels in the training set
g = sns.countplot(Y, color = 'darkviolet')
Y.value_counts()

We have a relatively balanced training set so there's no need to re-sample.

**Normalization** - Each value in the dataset is a greyscale value between 0 and 255. It's best to normalise the data so that each value is between 0 and 1 before applying any models.

In [None]:
# Normalize the data for greyscale pixels
X = X / 255.0
test = test / 255.0

**Reshaping** - The raw datset has 784 pixel-value columns per image. We need to change the shape of the pixel set from a 1 x 784 flat line into a 28 x 28 x 1 square image, where height and width are 28 pixels with a depth of 1.

In [None]:
# Reshape images
X = X.values.reshape(-1,28,28,1)
test = test.values.reshape(-1,28,28,1)

**One hot encode** the label values (digits from 0-9).

In [None]:
# Encode labels to one hot vectors
Y = to_categorical(Y, num_classes = 10)

# Visualizing the Data

Plot the first 9 images in the test set.

In [None]:
j = 0
for i in range(9):
    plt.subplot(330 + (j+1))
    j=j+1
    fig = plt.imshow(test[i][:,:,0], cmap=plt.get_cmap('Purples'))
    fig.axes.get_xaxis().set_visible(False)
    fig.axes.get_yaxis().set_visible(False)

# Create train and validation sets

In [None]:
# Set the random seed
random_seed = 2

# Split the train and the validation set for the fitting
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size = 0.1, random_state=random_seed)

# Strategy 1 - Simple CNN Architecture

Let's define a relatively simple CNN model. This is a relatively simple architecture: [[Conv2D->relu]*2 -> MaxPool2D -> Dropout]*2 -> Flatten -> Dense -> Dropout -> Out

In [None]:
# Set the CNN model 

model1 = Sequential()

model1.add(Conv2D(filters = 32, kernel_size = (5,5), padding = 'Same', activation ='relu', input_shape = (28,28,1)))
model1.add(Conv2D(filters = 32, kernel_size = (5,5), padding = 'Same', activation ='relu'))
model1.add(MaxPool2D(pool_size=(2,2)))
model1.add(Dropout(0.25))

model1.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', activation ='relu'))
model1.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', activation ='relu'))
model1.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model1.add(Dropout(0.25))

model1.add(Flatten())
model1.add(Dense(256, activation = "relu"))
model1.add(Dropout(0.5))
model1.add(Dense(10, activation = "softmax"))

# Compile the model
model1.compile(optimizer = 'adam' , loss = "categorical_crossentropy", metrics=["accuracy"])

In [None]:
# define number of epochs and batch size
epochs = 5
batch_size = 86

# fit the model
model1.fit(
    X_train, 
    Y_train,
    batch_size=batch_size, 
    epochs=epochs,
    validation_data=(X_val, Y_val),
    verbose=2
)

Training for 30 epochs, this strategy achieves a validation score of 0.9926 and gives a leaderboard score of 0.9939 - already pretty good!

In [None]:
# predict results
results = model1.predict(test)

# select the indix with the maximum probability
results = np.argmax(results,axis = 1)
results = pd.Series(results,name="Label")

# create the submission file
submission1 = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)

# save submission file in output folder
submission1.to_csv("sub1.csv",index=False)

# Strategy 2 - Add data augmentation and a learning rate annealer

Next we'll use the same model but we'll augment the training set by applying random roations up to 10 degrees, random zooms up to x10% and random vertical and horizontal shifts up to x10%.

We'll also add a learning rate annealer that reduces the learning rate after 3 epochs if the score on the validation set hasn't improved.

In [None]:
# recompile model1 with a new optimizer
model1.compile(optimizer = 'RMSprop' , loss = "categorical_crossentropy", metrics=["accuracy"])

In [None]:
# Set a learning rate annealer
learning_rate_reduction = ReduceLROnPlateau(monitor='val_loss', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001)

In [None]:
datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image 
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False)  # randomly flip images


datagen.fit(X_train)

In [None]:
# define number of epochs and batch size
epochs = 5
batch_size = 86

history = model1.fit_generator(datagen.flow(X_train,Y_train, batch_size=batch_size),
                              epochs = epochs, validation_data = (X_val,Y_val),
                              verbose = 2, steps_per_epoch=X_train.shape[0] // batch_size
                              , callbacks=[learning_rate_reduction])

Training for 30 epochs, strategy 2 improves validation score to 0.9962 but the leaderboard score drops to 0.99378. We've over-fitted this time.

In [None]:
# predict results
results = model1.predict(test)

# select the indix with the maximum probability
results = np.argmax(results,axis = 1)
results = pd.Series(results,name="Label")

# create the submission file
submission2 = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)

# save submission file in output folder
submission2.to_csv("sub2.csv",index=False)

# Strategy 3 - Build a more complex CNN architecture and train using a k-fold cross validation scheme.

Lastly, we'll build a more complicated CNN model and instead of using a simple train / validation split, we'll use a k-fold strategy that let's every member of the training set spend some part of the validation set.

In [None]:
# Set the CNN model 2

def Model_2(x=None):
    # we initialize the model
    model = Sequential()

    # Conv Block 1
    model.add(Conv2D(64, (5, 5), input_shape=(28,28,1),  padding='same', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))
    model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(Conv2D(64, (5, 5),   padding='same', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))
    model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(Conv2D(64, (5, 5),  padding='same', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))
    model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    model.add(Dropout(0.3))

    # Conv Block 2
    model.add(Conv2D(128, (3, 3),  padding='same', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))
    model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(Conv2D(128, (3, 3),  padding='same', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))
    model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(Conv2D(128, (3, 3),  padding='same', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))
    model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    model.add(Dropout(0.2))

    # Conv Block 3
    model.add(Conv2D(256, (3, 3),  padding='same', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))
    model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(Conv2D(256, (3, 3),  padding='same', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))
    model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(Conv2D(256, (3, 3),  padding='same', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))
    model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    model.add(Dropout(0.2))

    # Conv Block 4
    model.add(Conv2D(512, (3, 3),  padding='same', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))
    model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(Conv2D(512, (3, 3),  padding='same', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))
    model.add(BatchNormalization())
    model.add(Activation('elu'))
    model.add(MaxPooling2D(pool_size=(3, 3), strides=(3, 3)))

    # FC layers
    model.add(Flatten())
    model.add(Dense(10, activation='softmax', kernel_regularizer=dense_regularizer,kernel_initializer="he_normal"))

    return model

model2 = Model_2()

In [None]:
# Define the optimizer for model 2
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

# compile model 2
model2.compile(optimizer = optimizer , loss = "sparse_categorical_crossentropy", metrics=["accuracy"])

In [None]:
# define new Y df for use with the new model
Y = train["label"]
Y = np.array(Y)

In [None]:
# Stratified K-Fold
k_fold = StratifiedKFold(n_splits=10, random_state=12, shuffle=True)

batch_size = 86
epochs = 10

for k_train_index, k_test_index in k_fold.split(X, Y):
    #model.fit(X[k_train_index,:], Y[k_train_index], epochs=5)
    model2.fit_generator(datagen.flow(X[k_train_index,:],Y[k_train_index], batch_size=batch_size),
                              epochs = epochs, validation_data = (X[k_test_index,:],Y[k_test_index]),
                              verbose = 2, steps_per_epoch=X[k_train_index,:].shape[0] // batch_size
                              , callbacks=[learning_rate_reduction])

In [None]:
val_loss, val_acc = model2.evaluate(X, Y)
val_acc

Training 10 folds for 10 epochs each, strategy 3 improves validation score to 0.9998. The leaderboard score is 0.99596.

In [None]:
# predict results
results = model2.predict(test)

# select the indix with the maximum probability
results = np.argmax(results,axis = 1)
results = pd.Series(results,name="Label")

# create the submission file
submission3 = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)

# save submission file in output folder
submission3.to_csv("sub3.csv",index=False)

# Evaluate the model

In [None]:
# Plot confusion matrix

Y = train["label"]
Y = to_categorical(Y, num_classes = 10)

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.BuPu):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Predict the values from the validation dataset
Y_pred = model2.predict(X)
# Convert predictions classes to one hot vectors 
Y_pred_classes = np.argmax(Y_pred,axis = 1) 
# Convert validation observations to one hot vectors
Y_true = np.argmax(Y,axis = 1) 
# compute the confusion matrix
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes) 
# plot the confusion matrix
plot_confusion_matrix(confusion_mtx, classes = range(10))

In [None]:
# Display some error results 

# Errors are difference between predicted labels and true labels
errors = (Y_pred_classes - Y_true != 0)

Y_pred_classes_errors = Y_pred_classes[errors]
Y_pred_errors = Y_pred[errors]
Y_true_errors = Y_true[errors]
X_val_errors = X[errors]

def display_errors(errors_index,img_errors,pred_errors, obs_errors):
    """ This function shows 6 images with their predicted and real labels"""
    n = 0
    nrows = 2
    ncols = 3
    fig, ax = plt.subplots(nrows,ncols,sharex=True,sharey=True)
    for row in range(nrows):
        for col in range(ncols):
            error = errors_index[n]
            ax[row,col].imshow((img_errors[error]).reshape((28,28)), cmap=plt.get_cmap('Purples'))
            ax[row,col].set_title("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
            n += 1

# Probabilities of the wrong predicted numbers
Y_pred_errors_prob = np.max(Y_pred_errors,axis = 1)

# Predicted probabilities of the true values in the error set
true_prob_errors = np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))

# Difference between the probability of the predicted label and the true label
delta_pred_true_errors = Y_pred_errors_prob - true_prob_errors

# Sorted list of the delta prob errors
sorted_dela_errors = np.argsort(delta_pred_true_errors)

# Top 6 errors 
most_important_errors = sorted_dela_errors[-6:]

# Show the top 6 errors
display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors)

These are 6 errors our model makes when it's asked to predict over the whole training dataset. A few of them are incorrect predictions even most humans would likely make and a couple look like incorrectly labelled examples in the training set. These are examples of errors that it's usually unrealistic to expect a machine learning model to eliminate so I think we can be pretty confident that we have a very robust model.