# CNN + MNIST + Data Augmentation
The goal of this notebook is to classify, with the best accuracy possible handwritten digits.
The input is a `(28,28)` "image" in grey scale.

This notebook is using multiple technics to achieve 99.9 accuracy : 
* CNN
* Denser Dataset (we use MNIST images)
* Data Augmentation

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings("ignore")
plt.rcParams['figure.figsize'] = [20, 20]

# helper to show many images at once, for debug purpose
def show_images(images, labels, shape=(3,3)):
    fig, p = plt.subplots(shape[0], shape[1])
    i = 0
    for x in p:
        for ax in x:
            ax.imshow(images[i])
            ax.set_title(labels[i])
            i += 1

In [None]:
# load the train and test data (csv files)
# file structure: LABEL, PIXELS...
train = pd.read_csv("/kaggle/input/digit-recognizer/train.csv") 
# file structure: PIXELS...
test = pd.read_csv("/kaggle/input/digit-recognizer/test.csv")

# we reshape and normalize the data
train_image = np.array(train.drop(['label'], axis=1), dtype="float32") / 255
train_image = train_image.reshape(-1, 28, 28, 1)

# categorical transform 1 => [0 1 0 0 0 0], 3 => [0 0 0 1 0 0]...
train_label = tf.keras.utils.to_categorical(train['label'])

test = np.array(test, dtype="float32") / 255
test = test.reshape(-1, 28, 28, 1)

show_images(train_image[:25], train_label[:25], shape=(5,5))

## Adding more data to the dataset
The dataset Kaggle provide is not the best. MNIST is a database of handwritten digit that can add data to our dataset.

In DL, the more data the better most of the time.

So the final dataset is Kaggle Data + MNIST.

I choose to concatenate the training data and test data because we are using validation during the training process to track the evolution of the accuracy of the model.

In [None]:
from tensorflow.keras.datasets import mnist
(image_train_mnist, label_train_mnist), (image_test_mnist, label_test_mnist) = mnist.load_data()
image_mnist = np.concatenate((image_train_mnist, image_test_mnist))
label_mnist = np.concatenate((label_train_mnist, label_test_mnist))
image_mnist = image_mnist.reshape(-1,28,28,1)
image_mnist = image_mnist.astype(np.float32) / 255
label_mnist = tf.keras.utils.to_categorical(label_mnist,num_classes=10)
images = np.concatenate((train_image, image_mnist))
labels = np.concatenate((train_label, label_mnist))

# final dataset shape
print("training image dataset shape:", images.shape)
print("training label dataset shape:", labels.shape)

show_images(images[:25], labels[:25], shape=(5,5))

## Data Augmentation
To provide more data during the training process, we are going to use Data Augmentation.

Data Augmentation is a fancy word to say that we are going the rotate, translate, zoom and shear image to create similar ones, but nevertheless differents ones.

Data Augmentation help the model to avoid overfitting

In [None]:
datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.20,
    shear_range=15,
    zoom_range=0.10,
    validation_split=0.25,
    horizontal_flip=False
)

# the train generator generate the images using the Data Augmentations rules defined in datagen
train_generator = datagen.flow(
    images,
    labels, 
    batch_size=256,
    subset='training',
)

# the validation generator generate the images using the Data Augmentations rules defined in datagen
validation_generator = datagen.flow(
    images,
    labels, 
    batch_size=64,
    subset='validation',
)

## The model
The model is a very simple CNN. Nothing fancy here.

We use default optimizer and loss criteria.

In [None]:
def create_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Reshape((28, 28, 1)),
        tf.keras.layers.Conv2D(filters=32, kernel_size=(5,5), activation="relu", padding="same", input_shape=(28,28,1)),
        tf.keras.layers.MaxPool2D((2,2)),
        tf.keras.layers.Conv2D(filters=64, kernel_size=(3,3), activation="relu", padding="same"),
        tf.keras.layers.Conv2D(filters=64, kernel_size=(3,3), activation="relu", padding="same"),
        tf.keras.layers.MaxPool2D((2,2)),
        tf.keras.layers.Conv2D(filters=128, kernel_size=(3,3), activation="relu", padding="same"),
        tf.keras.layers.Conv2D(filters=128, kernel_size=(3,3), activation="relu", padding="same"),
        tf.keras.layers.MaxPool2D((2,2)),

        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(512, activation="sigmoid"),
        tf.keras.layers.Dropout(0.25),
        tf.keras.layers.Dense(512, activation="sigmoid"),
        tf.keras.layers.Dropout(0.25),
        tf.keras.layers.Dense(256, activation="sigmoid"),
        tf.keras.layers.Dropout(0.1),
        tf.keras.layers.Dense(10, activation="sigmoid")
    ])

    model.compile(
        optimizer="adam", loss = 'categorical_crossentropy', metrics = ['accuracy']
    )

    return model

model = create_model()

### ReduceLR and Checkpoint
ReduceLR is a way to tune the adam optimizer during training. If the val_loss metric start to become constant, the learning rate is decreased. That way we never stop learning. 

Checkpoint is to avoid loosing the best model if it is not the last epoch one.

In [None]:
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss',factor=0.1,patience=5 ,min_lr=0.000001,verbose=1)
checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath='model.hdf5',monitor='val_loss',save_best_only=True,save_weights_only=True,verbose=1)

We train the model over 60 epochs. Just trying differents number at this point.

In [None]:
history = model.fit_generator(train_generator, epochs=60, validation_data=validation_generator, callbacks=[reduce_lr,checkpoint], verbose=1)

In [None]:
model.load_weights('model.hdf5')

history_frame = pd.DataFrame(history.history)
history_frame.loc[:, ['loss', 'val_loss']].plot()
history_frame.loc[:, ['accuracy', 'val_accuracy']].plot();

test_loss, test_acc = model.evaluate(images[:500],  labels[:500], verbose=2)
print("model accuracy :", test_acc, ", model loss", test_loss)

In [None]:
# Code used to submit the model to the competition

df = pd.read_csv("/kaggle/input/digit-recognizer/test.csv").astype("float32") / 255.0
res = tf.keras.backend.argmax(model.predict(df))
csv = pd.DataFrame({'ImageId': range(1, len(res) + 1), "Label": res})
csv.to_csv('submission.csv', index=False)