# Digit Recognizer using the MNIST dataset

This notebook aims at identifying digits from the MNIST ("Modified National Institute of Standards and Technology") dataset, consisting of 42,000 handwritten images.  
It is divided into 3 parts, while each part illustrates a more complex model. The models are compared with respect to their running time, accuracy and number of parameters to train. 
At the end, image augmentation had been applied to the model with the best score.  


# Reading the data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

train_set = pd.read_csv('../input/digit-recognizer/train.csv')
test_set = pd.read_csv('../input/digit-recognizer/test.csv')

Let's take a look at the data 

In [None]:
train_set.head()

In [None]:
print(f'train shape is: {train_set.shape}')
print(f'test shape is: {test_set.shape}')

## Preparing the data
The training and test data consist of 42,000 and 28,000 pictures respectively, with a size of 28 x 28. i.e. 784 pixels per image. 

The first column in the train data is the label.

We will seperate it into X and y variables:

In [None]:
X = train_set.drop('label', axis=1)
labels = train_set['label']

Now we will convert the labels into`One-Hot` representation:

In [None]:
y = pd.get_dummies(labels)

In [None]:
print(y.iloc[:6,:])
print(f"y shape:{y.shape}")

### Normalizing the data
We will normalize the data by transforming all the data points to a scale of [0 , 1]:


In [None]:
X = X / 255.0
test_set = test_set / 255.0

### Reshaping the data
Since we are planning to use convolutional NNs, we need to convert the 1D representation of the image into 2D representation.

In [None]:
X = X.values.reshape(-1,28,28,1)
test_set = test_set.values.reshape(-1,28,28,1)

In [None]:
print(f"X:{X.shape} y:{y.shape}")

We will split the original training set into train and validation sets using [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from scikit-learn:

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_val, y_train, y_val = train_test_split(X,y,test_size=0.2, random_state=42)

Since we are going to use different models, we would like to define a few parameters that will be consistent across all of them.

In [None]:
batch_size = 32
epochs=50

# Buliding a simple Neural Network (NN) using Keras

As a naive first approach, we will build a simple clasifier NN, with a single hidden layer.

To avoid [overfitting](https://elitedatascience.com/overfitting-in-machine-learning), we also add a `Dropout` layer (which randomly removes 30% of the neuron connections between the input layer and the hidden layer).

We use a `Flatten` layer to convert the data shape to 1d instead of 2d.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

simple_NN = keras.Sequential([
    layers.Dense(100, activation='relu', input_shape=(28,28,1)),
    layers.Dropout(0.3),
    layers.Flatten(),
    layers.Dense(units=100, activation='relu'),
    layers.Dense(units=10, activation='softmax')
])



In [None]:
simple_NN.summary()

Note that using the fully-connected architecture, we end up with **7.8M parameters** to train.

In [None]:

simple_NN.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)
simple_NN.optimizer.lr=0.001


## Early Stopping 
Before training the network, we define an early stopping criterion, to avoid redundent epochs once the model has already converged.


In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)


## Reduce Learning Rate On Plateau
We define a [`ReduceLROnPlateau`](https://keras.io/api/callbacks/reduce_lr_on_plateau/) callback to reduce the learning rate when the metric we chose (`val_loss`) has stopped improving.

In [None]:
lrr = ReduceLROnPlateau(monitor='val_loss',patience=3,verbose=1,factor=0.5, min_lr=0.00001)

We can now fit our simple network to the data and examine the results.

In [None]:

history_simple_NN = simple_NN.fit(
    x=X_train,
    y=y_train,
    validation_data=(X_val, y_val),
    batch_size=batch_size,
    epochs=epochs,
    shuffle=True,
    verbose=2,
    callbacks=[early_stopping, lrr]
)


We can see that the model converges at around ~0.96 `val_accuracy` score (different results may occur at each run, due to the randomization of the initial parameters).

Let's also take a look at the `loss` and `accuracy` values at each epoch. However, since we will do the same for each model in the notebook, we will first create a helper function for it.

In [None]:
def plot_loss_and_accuracy(history):
    history_df = pd.DataFrame(history)
    history_df.loc[0:, ['loss', 'val_loss']].plot()
    history_df.loc[0:, ['accuracy', 'val_accuracy']].plot()

In [None]:
plot_loss_and_accuracy(history_simple_NN.history)

We can see that both metrics converge much faster on the validation data than the training data. This is usually a sign of over-fitting. This may be the result of the high number of trainable parameters in this architecture.

To reduce this number, we will turn to a different method - Convolutional Neural Network.

# Buliding a Convolutional Neural Network (CNN) using Keras
CNN is an artificial neural network that has so far been most popularly used for analyzing images for computer vision tasks.
The basis of a CNN are the convolutional layers, which are able to pick out or detect patterns in the data.

For a quick intro to the theory behind CNNs, I recommend [this video](https://www.youtube.com/watch?v=nmnaO6esC7c&list=PLWKotBjTDoLj3rXBL-EIPRN9V3a9Cx07&index=1).

Notice that we are using 3 layers of convolution, which are considered the feature extraction layers, and a final `Dense` (fully-connected) layer, which acts as the classifier.
 

In [None]:

import keras
from tensorflow.keras import layers


CNN = keras.Sequential([
    layers.Conv2D(32, kernel_size=(3,3), activation='relu', padding='same', input_shape=(28,28,1)),
    layers.Conv2D(64, kernel_size=(3,3), activation='relu', padding='same'),
    layers.Conv2D(64, kernel_size=(3,3), activation='relu', padding='same'),
    layers.Flatten(),
    layers.Dense(10, activation='softmax')
])



In [None]:
CNN.summary()

Note that using the CNN architecture, we now have only ~560K parameters to train, compared to 7.8M parameters in the previous architecture. This is over **90% less** parameters to train!


In [None]:
CNN.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)
CNN.optimizer.lr=0.001

Again, we can fit our network to the data and examine the results.

In [None]:
history_CNN = CNN.fit(
    x=X_train, 
    y=y_train,
    validation_data=(X_val, y_val),
    batch_size=batch_size,
    epochs=epochs,
    shuffle= True,
    verbose=2,
    callbacks=[early_stopping, lrr]
)

We can see that the model converges at around ~0.98 `val_accuracy` score.

Let's also take a look at the `loss` and `accuracy` values at each epoch:

In [None]:
plot_loss_and_accuracy(history_CNN.history)

These plots also ilustrate a situation of overfitting, but weaker than before. There are a few solutions in this case; the simplest solution is to **decrease the complexity** of the model by removing layers (shallower newtork) or reducing the number of neurons in each layer (narrower network). However, our model is pretty shallow and narrow as it is, and therefore we don't have much to reduce. The solution, counterintuitively is by adding more layers, but of a special kind: 

1. [`Dropout`](https://keras.io/api/layers/regularization_layers/dropout/): This layer randomly sets a given fraction of the input units 0.
2. [`MaxPooling`](https://keras.io/api/layers/pooling_layers/max_pooling2d/): Applies a moving window over the  input and changing each value to the maximum value of the window.

Another solution is to add more data, by collecting new examples or by data augmantation, which we will discuss below.  

In [None]:
import keras
from tensorflow.keras import layers

CNN_2 = keras.Sequential([
    layers.Conv2D(32, kernel_size=(3,3), activation='relu', padding='same', input_shape=(28,28,1)),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Conv2D(64, kernel_size=(3,3), activation='relu', padding='same'),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Conv2D(64, kernel_size=(3,3), activation='relu', padding='same'),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Flatten(),
    layers.Dropout(0.4),
    layers.Dense(10, activation='softmax')
])

In [None]:
CNN_2.summary()

We now have only 61K parameters to train - another reduction of ~90% compared to our previous network.

In [None]:
CNN_2.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)
CNN.optimizer.lr=0.001

In [None]:
history_CNN_2 = CNN_2.fit(
    x=X_train, 
    y=y_train,
    validation_data=(X_val, y_val),
    batch_size=batch_size,
    epochs=epochs,
    shuffle= True,
    verbose=2,
    callbacks=[early_stopping, lrr]
)

We can see that the model converges at around ~0.992 `val_accuracy` score, which is an improvement to the previous scores.

Adding the differnt layers to deal with the over-fitting has another advantage which is decreasing the running time of each epoch from `~16 seconds` to `~4 seconds`.

Let's again take a look at the `loss` and `accuracy` values at each epoch:

In [None]:
plot_loss_and_accuracy(history_CNN_2.history)

We can see that the validation and training data seem to converge together this time, which implies that we have solved the main causes of overfitting.

# Image augmantation to increase the accuracy

To further enhance our model, we would like to add more data to train on. Since we do not have more data, we will use data augmantation to artificially create more data samples. 

To do that, we will use the Keras [`ImageDataGenerator`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator). This tool allows us to create new trainning images by manipulating the exiting ones (scaling, rotating, flipping, etc.). However, since not all manipulations make sense in the context of handwritten digits, i.e. flipping the number 7 vertically is not a valid digit, we will only use a small subset of the possible manipulations.

In [None]:
from keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    featurewise_center=False,
    featurewise_std_normalization=False,
    rotation_range=10,
    zoom_range=0.1,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=False,
    vertical_flip=False,
    validation_split=0.2
)

datagen.fit(X_train)


In [None]:
data_size = len(X_train) 
steps_per_epoch = int(data_size / batch_size)
print(steps_per_epoch)

Now we can use the data generator as an input to our `fit` function, and re-train our CNN model with the generated data.

In [None]:
history_CNN_2_datagen = CNN_2.fit(
    datagen.flow(X_train,y_train,batch_size=batch_size),
    epochs=epochs,
    shuffle=True,
    validation_data=(X_val,y_val),
    verbose=2,
    callbacks=[early_stopping, lrr],
    steps_per_epoch=steps_per_epoch
)

In [None]:
plot_loss_and_accuracy(history_CNN_2_datagen.history)

# Confusion Matrix
[`Confusion matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) is a tool to evaluate the accuracy of a classification process. 
The horizontal axis shows the labels that the model predicted and the vertical axis shows the real labels. The number in each cell indicates the number of images that match the selected combination. The diagonal shows the number of images in which the model correctly predicted reality.

In [None]:
val_predictions = CNN_2.predict(X_val)
y_pred = val_predictions.argmax(axis=-1)

In [None]:
cm_plot_labels = [x for x in range(10)]
print(cm_plot_labels)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_val.idxmax(axis=1), y_pred)

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=cm_plot_labels)

disp = disp.plot(include_values=True,
                 cmap=plt.cm.Blues, ax=None, xticks_rotation='horizontal')

plt.show()

# Exploring the wrong predictions of the model 

In [None]:
import matplotlib.pyplot as plt

rows = 5
cols = 9

f = plt.figure(figsize=(2*cols,2*rows))
sub_plot = 1
#y_val_array = y_val.values
y_val_array = y_val.idxmax(axis=1).values


for i in range(X_val.shape[0]):
    if y_val_array[i] != y_pred[i]:
        f.add_subplot(rows,cols,sub_plot) 
        sub_plot += 1
        plt.imshow(X_val[i].reshape([28,28]),cmap="Blues")
        plt.axis("off")
        plt.title(f"True: {str(y_val_array[i])} Pred: {str(y_pred[i])}", y=-0.15,color="Red")
        if sub_plot >= (rows * cols) +1:
            break
plt.savefig("error_plots.png")
plt.show()

# Writing the output to a file 

In [None]:
predictions = CNN_2.predict(test_set)
results = predictions.argmax(axis=-1)

In [None]:

result = pd.DataFrame()
result['ImageId'] = list(range(1,28001))
result['Label'] = results
result.to_csv("output.csv", index = False)

