# Introduction

This is my first Computer Vision model. It's a custom CNN that was built following a simple iterative process : 
- get a baseline model quickly
- whenever the model suffers from bias, either train it longer or make it more complex (with more parameters to train)
- whenever the model does not generalize well, either add regularization techniques (I used drop-out) or use a bigger training set.
- and iterate...

This simple "recipe" worked well on this dataset :
- v1 = baseline model : accuracy = **98.99%** on test set, some difficulties to generalize well.
- v2 = v1 + drop-out : **99,22%** on test set, some difficulties to generalize well.
- v3 = v2 + data augmentation : **99,4%** on test set, bias problem appeared with this bigger dataset
- v4 = v3 + additional ConvNet blocks and FC layers : **99,45%** on the test set, bias problem fixed but again some difficulties to generalize well.
- v5 = v4 + data augmentation : **99,61%** on test set.

**At this level of performance, the model performs (almost) as well as human beings : images that are misclassified by the model are not clean and not easy to recognize for human beings too.**

# Agenda

1. [EDA](#1)

2. [Dataset preparation with image data augmentation](#2)

3. [Model Creation & Training](#3)

    3.1. [Model creation](#3.1)

    3.2. [Model training with CV](#3.2) 

    3.3 [Model performance evaluation](#3.3)

4. [Error Analysis](#4)

    4.1. [Confusion matrix](#4.1)
    
    4.2. [Display a sample of images with bad predicitions](#4.2)
    
    4.3. [Statistics about the "confidence score" for correct and bad predictions](#4.3)

5. [Submit Predictions](#5)


In [None]:
# Imports
import os, warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data sets
ds_train = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')
ds_test = pd.read_csv('/kaggle/input/digit-recognizer/test.csv')

# 1. EDA <a id='1'/>

A quick analysis of the data set shows :

- The training data set has 42000 images of 28x28 pixels.  

- The test data set has 28000 images of 28x28 pixels.

- The data set is pretty well balanced with a minimum of 3795 images representing a 5 and a maximum of 4684 images representing a 1.

- Some pixels are 0 for all images. E.g. the pixels in the 4 corners of the image are always 0.

- The data set was produced by different writers.

- Some handwritten digits are much cleaner and easier to recognize than others. E.g. the 9th image is a 5 but it could be also a 9. The 20th image is a 5 but could be also a 6...

In [None]:
ds_train.head()

In [None]:
ds_train.shape[0]

In [None]:
ds_train['label'].value_counts().sort_values()

In [None]:
all_zeros_pixels = (ds_train.drop('label',axis=1) == 0).all().replace(to_replace=[True, False], value=[0,1]).values.reshape(28,28)
plt.imshow(all_zeros_pixels)
plt.title('PIXELS WITH 0 FOR ALL IMAGES')
plt.show()

In [None]:
fig=plt.figure(figsize=(20, 5))
columns = 10
rows = 2
for i in range(0, columns*rows):
    label = ds_train['label'].iloc[i]
    img = ds_train.drop('label',axis=1).iloc[i].values.reshape(28,28)
    fig.add_subplot(rows, columns, i+1)
    plt.title('LABEL = ' + str(label))
    plt.axis('off')
    plt.imshow(img)
plt.show()

# 2. Dataset preparation & Image data augmentation <a id='2'/>

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import confusion_matrix

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from kerastuner.tuners import RandomSearch
from tensorflow.keras.layers.experimental import preprocessing

# Reproducability
# check https://keras.io/getting_started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development for details.
def set_seed(seed=42):
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
set_seed()

Let's :

- split the training set into X_train with the pixel columns and y_train with the corresponding labels

- change X_train format to the one expected by Keras. 

In [None]:
X_train = ds_train.copy()

# Creating the X_train dataframe with the pixel columns only and the y_train with the labels
y_train = X_train['label']
X_train.drop('label', axis=1, inplace=True)

# Reshaping X_train and X_val to Keras input format
X_train = X_train.to_numpy().reshape(42000,28,28,1)

Let's define a function that generates the augmented samples and add them to the original samples:

In [None]:
#############################################################################
# input :
# X = ndarray as expected by Keras (n_samples,height,width,n_channels)
# y = 1-d array (n_samples)
# output :
# X = ndarray as expected by Keras (10 x n_samples,height,width,n_channels)
# y = 1-d array (10 x n_samples)
#############################################################################
def data_augmentation(X, y):

    # Defining the data augmentations using Keras preprocessing layers 
    data_augmentation1 = keras.Sequential([
        preprocessing.RandomTranslation(height_factor=0.1, width_factor=0.1, fill_mode='constant'),
        preprocessing.RandomRotation(factor=0.1, fill_mode='constant')
    ])

    data_augmentation2 = keras.Sequential([
        preprocessing.RandomTranslation(height_factor=0.1, width_factor=0.1, fill_mode='constant'),
        preprocessing.RandomZoom(height_factor=0.15, width_factor=0.15, fill_mode='constant')
    ])
    
    data_augmentation3 = keras.Sequential([
        preprocessing.RandomTranslation(height_factor=0.1, width_factor=0.1, fill_mode='constant'),
        preprocessing.RandomZoom(height_factor=0.15, width_factor=0.15, fill_mode='constant'),
        preprocessing.RandomRotation(factor=0.1, fill_mode='constant')
    ])

    # Generating the augmented samples
    X_new1_1 = data_augmentation1(X)
    X_new1_2 = data_augmentation1(X)
    X_new1_3 = data_augmentation1(X)
    X_new2_1 = data_augmentation2(X)
    X_new2_2 = data_augmentation2(X)
    X_new2_3 = data_augmentation2(X)
    X_new3_1 = data_augmentation3(X)
    X_new3_2 = data_augmentation3(X)
    X_new3_3 = data_augmentation3(X)
    
    # Concatenating X with the augmented samples
    X = np.concatenate((X, X_new1_1, X_new1_2, X_new1_3, X_new2_1, X_new2_2, X_new2_3, X_new3_1, X_new3_2, X_new3_3)) 
    y = pd.concat([y, y.copy(), y.copy(), y.copy(), y.copy(), y.copy(), y.copy(), y.copy(), y.copy(), y.copy()], ignore_index=True) 
    
    return X, y

Here we are going to check the data_augmentation function on 10 images. To do this, I'm going to :  
- select the 10 first images of the original training set
- apply the data_augmentation function to these 10 images
- As a result, I will get 100 images as a result : 10 original images + 90 augmented samples (each image has 9 augmented samples)
- display the the 10 first images of the original training set and their 90 corresponding augmented samples

In [None]:
X10 = X_train[0:10]
y10 = y_train[0:10]
X100, y100 = data_augmentation(X10, y10)

fig=plt.figure(figsize=(20,20))
pos = 1
for i in range(0, 10):
    for j in range(i+0, i+100, 10):
        fig.add_subplot(10, 10, pos)
        plt.imshow(tf.squeeze(X100[j]))
        plt.title('Label = ' + str(y100[j]))
        plt.axis('off')
        pos = pos + 1
plt.show()

# 3. Model creation and training <a id='3' />

## 3.1 Model creation  <a id='3.1'/>

In [None]:
def build_model():
    
    inputs = tf.keras.Input(shape=(28, 28, 1))
    
    # Normalizing inputs
    x = inputs
    x = tf.keras.layers.BatchNormalization()(x)
    
    # ConvNet - this for loop creates 4 conv_blocks
    for i in range(4):
        x = tf.keras.layers.Convolution2D(256, kernel_size=(3, 3), padding="same")(x)
        x = tf.keras.layers.ReLU()(x)
        x = tf.keras.layers.MaxPool2D()(x)
        x = tf.keras.layers.Dropout(0.3)(x)
   
    # Head - this for loop creates 4 layers
    x = tf.keras.layers.Flatten()(x)
    for i in range(4):
        x = tf.keras.layers.Dense(256, activation="relu")(x)
        x = tf.keras.layers.Dropout(0.3)(x)
    
    # Output layer
    outputs = tf.keras.layers.Dense(10, activation="softmax")(x)

    # Returning a compiled model
    model = tf.keras.Model(inputs, outputs)
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
        loss="sparse_categorical_crossentropy",
        metrics="accuracy"
    )
    return model

## 3.2 Model training with CV <a id='3.2'/>

In [None]:
N_EPOCHS = 100

# I use Cross Validation with N_SPLITS. 
N_SPLITS = 5

# N_ITERATION allows me to run less iterations to save time.
# If N_ITERATION < N_SPLITS then the number of trainings/evaluations will stop earlier.
N_ITERATION = 1

In [None]:
# Creating hist_df to store history objects for each training / split
hist_df = pd.DataFrame(columns=['iteration', 'history'])
iteration = 1
index = 0

# This boolean variable is used to save one model only. 
saved_model = False

# Reshaping X_train from Keras input format (42000, 28, 28, 1) to (n_samples, n_features) format (42000, 784) as expected by StratifiedKFold.split() function
X_train = X_train.reshape(42000, 784)

# Training and evaluating the model 5 times, each time with a different training/validation set
skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=42)
for train_index, val_index in skf.split(X_train, y_train): # returns indices to build 10 splits of X_train, y_train, X_val, y_val
    
    # Getting the training set and validation set before data augmentation
    X_train_, X_val_ = X_train[train_index], X_train[val_index]
    X_train_ = X_train_.reshape(33600,28,28,1) # Reshaping X_train to Keras input format
    X_val_ = X_val_.reshape(8400,28,28,1) # Reshaping X_val to Keras input format
    y_train_, y_val_ = y_train[train_index], y_train[val_index]
    
    # Generating augmented samples
    X_train_, y_train_ = data_augmentation(X_train_, y_train_) #X_train_ is now (336000,28,28,1) and y_train_ is now (336000)
    
    # Building the model
    model = build_model()
    
    # Training and evaluating each model for this split
    history = model.fit(x=X_train_, y=y_train_, validation_data=(X_val_, y_val_), epochs=N_EPOCHS, batch_size=64)
    
    # Saving the trained model as a saved model file -- only one model is saved
    if(saved_model == False):
        model.save('model')
        saved_model = True
    
    # Storing the history objects into a dataframe 
    hist_df.loc[index, 'iteration'] = iteration
    hist_df.loc[index, 'history'] = history
    
    if(iteration == N_ITERATION):
        break
        
    index = index + 1
    iteration = iteration + 1

## 3.3 Model performance evaluation <a id='3.3'/>

In [None]:
hist = []
for i in range(N_ITERATION):
    hist.append(pd.DataFrame(hist_df[hist_df['iteration']==(i+1)]['history'][i].history))
    if i==0:
        hist_full = hist[0]
    else:
        hist_full = pd.concat([hist_full, hist[i]])

# Dropping the 1st EPOCHS of each iteration because their losses are high and their accuracies are low 
hist_full.drop([0,1,2], inplace=True)  # 3 EPOCHS dropped / iteration     

# Displaying CV metrics
fig,axes=plt.subplots(1,2,figsize=(20,6))
sns.lineplot(data=hist_full[['loss','val_loss']], dashes=False, ax=axes[0])
axes[0].axhline(0.05, ls='--')
axes[0].axhline(0, ls='--')
sns.lineplot(data=hist_full[['accuracy', 'val_accuracy']], dashes=False, ax=axes[1])
axes[1].axhline(0.99, ls='--')
axes[1].axhline(0.995, ls='--')
axes[1].axhline(0.996, ls='--')
axes[1].axhline(0.997, ls='--')
axes[1].axhline(0.998, ls='--')
axes[1].axhline(1, ls='--')
plt.show()

# 4. Error Analysis <a id='4'/>

## 4.1. Confusion matrix <a id='4.1'/>

In [None]:
# Reloading the saved model -- this is the model trained in the 1st loop of the CV
model = keras.models.load_model('model')

# Retrieving the validation set -- corresponding to the 1st loop of the CV
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for _, val_index in skf.split(X_train, y_train): # returns indices to build 10 splits of X_train, y_train, X_val, y_val
    X_val = X_train[val_index]
    y_val = y_train[val_index]
    break

X_val = X_val.reshape(8400,28,28,1)    
    
# Making predictions
scores = model.predict(X_val)
y_pred = np.argmax(scores, axis=1)

# Displaying the confusion matrix
plt.figure(figsize = (14,7))
sns.heatmap(confusion_matrix(y_val, y_pred), annot=True, fmt='d', cmap='YlOrBr')
plt.show()

## 4.2. Display a sample of images with bad predicitions <a id='4.2'/>

Let's display misclassified images:

In [None]:
errors = pd.DataFrame({'y_val':y_val, 'prediction':y_pred})[y_val!=y_pred]

if(errors.shape[0]>20):

    fig=plt.figure(figsize=(20, 20))
    columns = 4
    rows = 5
    j = 0
    # Loop on each X_val entry for which the prediction is not correct
    for i in (errors.index):
        label = ds_train['label'].iloc[i]
        predict = errors.loc[i, 'prediction']
        img = ds_train.drop('label',axis=1).iloc[i].values.reshape(28,28)
        fig.add_subplot(rows, columns, j+1)
        j = j + 1
        plt.title('GROUND TRUTH = ' + str(label) + ' PREDICT = ' + str(predict))
        plt.imshow(img)
        if(j==20):
            break #Loop is broken after 20 iterations
    plt.show()

## 4.3. Statistics about the "confidence score" for bad predictions <a id='4.3'/>

In [None]:
predictions = pd.DataFrame({'y_val':y_val, 'y_pred':y_pred, 'scores':np.max(scores, axis=1)})

fig,axes=plt.subplots(1,2,figsize=(20,4))
sns.histplot(predictions[predictions['y_val']==predictions['y_pred']]['scores'], kde=False, ax=axes[0]).set_title('Confidence score distribution for correct predictions')
sns.histplot(predictions[predictions['y_val']!=predictions['y_pred']]['scores'], kde=False, ax=axes[1]).set_title('Confidence score distribution for bad predictions')
plt.show()

# 5. Submit Predicitions <a id='5'/>

In [None]:
#################################################
# Let's retrain our model with the whole dataset
#################################################

# Creating the X_train dataframe with the pixel columns only and the y_train with the labels
X_train = ds_train.copy()
y_train = X_train['label']
X_train.drop('label', axis=1, inplace=True)

# Reshaping X_train and X_val to Keras input format
X_train = X_train.to_numpy().reshape(42000,28,28,1)

# Generating augmented samples
X_train, y_train = data_augmentation(X_train, y_train) #X_train_ is now (420000,28,28,1) and y_train_ is now (420000)

# Building the model
model = build_model()

# Training the model on the whole training set
model.fit(x=X_train, y=y_train, epochs=N_EPOCHS, batch_size=64)

#################################################
# Let's generate the predictions on the test set
#################################################

X_test = ds_test.copy()
X_test = X_test.to_numpy().reshape(28000,28,28,1)
predictions = model.predict(X_test)

##################################
# Let's submit the new predictions
##################################
output = pd.DataFrame({'ImageId': list(range(1, 28001)), 'Label': np.argmax(predictions, axis=1)})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")