## **Intro**
MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” 
dataset of computer vision. It consists of thousans of handwritten digits.
Goal is simple: we need to teach machine to recognize them.

To address this classical problem we will cover two simple steps:
1. Make short exploration of MNIST handwritten digits dataset.
2. Build a Model which will be able to recognize datasets digits.

Our Model will be based on so called CNN, Convolutional Neural Networks [3].
What are they?
They’re basically just neural networks that use Convolutional layers,
which are based on the mathematical operation of convolution.
Conv layers consist of a set of filters - 2d matrices of numbers.
We will also use Keras [4] with Tensorflow [5] backend. A lot of helpfull beginners tutorial may be found on projects
sites.

**Why use them?**  
Because they are specially produced to be useful in computer vision problems such as digit or image recognition
and actually became industrial standard.

**Thanks**  
Victor Zhou who explained CNN's in a very simple but understandable way [1] and Yassine Ghoussam who made a very fundamental tutorial on Kaggle [2].

In [None]:
# Import libraries and tools
# Data preprocessing and linear algebra
import pandas as pd
import numpy as np
np.random.seed(2)

# Visualisation
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
%matplotlib inline

# Tools for cross-validation, error calculation
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools
from keras.utils.np_utils import to_categorical

# Machine Learning
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau

First, lets outline our goal.  
"In this competition, your goal is to correctly identify digits from a dataset of 
tens of thousands of handwritten images"

## Data load ##

In [None]:
train = pd.read_csv('../input/digit-recognizer/train.csv')
test = pd.read_csv('../input/digit-recognizer/test.csv')

## Data exploration ##

In [None]:
train.info()

In [None]:
test.info()

In [None]:
# train.head()
# As we can see our dataset consists of label (meaning 1-9 digit) and pixels of handwritten digits.
# So we can go next to form X_train and Y_train datasets which gonna be used in ML algorhytm later.

In [None]:
# Form X_train, Y_train
# Put digits aka true answer in Y_train
Y_train = train['label']
# Drop it as Target variable from X_train 
X_train = train.drop(['label'], axis = 1)

In [None]:
# By the way we can drop train dataset in order to save some disk space since we will use only X_train further.
del train

In [None]:
# Count how many digits we have in Y_train set
Y_train.value_counts(ascending=False)

### Check missing data ###

In [None]:
X_train.isnull().any().count()

In [None]:
test.isnull().any().count()

We see that there are no empty data in datasets. Very good luck! Move on.

## Data preprocessing ##

### Normalization ###

In [None]:
# Lets normalize the image pixel values from [0, 255] to [-0.5, 0.5] 
# to make our network easier to train (using smaller, centered values leads to better results).
X_train = (X_train / 255) - 0.5
test = (test / 255) - 0.5

### Reshape ###

In [None]:
# Reshape each image from (28, 28) to (28, 28, 1) because Keras requires the third dimension.
# MNIST images are gray scaled - only one channel. For RGB images, there is 3 channels, 
# so we will reshape 784px vectors to 28x28x3 3D matrices.
X_train = X_train.values.reshape(-1,28,28,1)
test = test.values.reshape(-1,28,28,1)

In [None]:
print(X_train.shape)

In [None]:
print(test.shape)

### One-hot encoding ###

Keras expects the training targets to be 10-dimensional vectors, since there are 10 nodes in our Softmax 
output layer.  
On the other hand our train and test datasets contain single integers representing the class for each image.
Keras has a 'to_categorical' methid, which turns our array of class integers into an array of one-hot 
vectors instead.  
For example, 2 would become [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] etc.

In [None]:
Y_train = to_categorical(Y_train, num_classes = 10)

In [None]:
# Split X_train to train and validation datasets

In order to validate our models result we have to use classical approcah: split our train data into 
two parts: train and validation subsets. A good idea is deviding in 90% for train and 10% for validation
need. Such proportion allows to teach model on enough amount of data and on the other hand for validation
purpose we ususally don't need more than 10%.

In [None]:
# Set random seed
random_seed = 2

In [None]:
# Split data
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.1, random_state=random_seed)

## Machine Learning ##

Every Keras model is built using the Sequential class, which represents a linear stack of layers (there are also more complex and functional Model class, but for now we will not dive deep into it).
We’ll be using the Sequential model, our CNN will be a linear stack of layers.

The Sequential constructor takes an array of Keras Layers.  
We’ll use 3 types of layers for our CNN: Convolutional, Max Pooling (MaxPool2D), and Softmax.

We will first build a very simple network, try to predict and then build second more complex model in
order to improve our score (if need) and evaluate how much gain the complexity of the model can give.

In [None]:
# # Before class initiation we need to define models hypermarameters, which we will use in our class
# num_filters = 8 #lets use 8 filters
# filter_size = 3 #filter is matrix 3x3
# pool_size = 2 #traverse the input image in 2x2 blocks

We need to give to our model the ability to make predictions. 
Lets do it by using de-facto standard final layer for a multiclass classification problem: the Softmax layer,
which is a fully-connected (dense) layer that uses the Softmax function as its activation.

In [None]:
# # Initiate class
# model = Sequential([
#     Conv2D(num_filters, filter_size, input_shape=(28, 28, 1)), #input layer
#     MaxPooling2D(pool_size=pool_size),
#     Flatten(),
#     Dense(10, activation='softmax'), #output softmax layer has 10 nodes
# ])

In [None]:
# # Compile the model
# # We decide 3 factors: the optimizer, the loss function, a list of metrics
# model.compile(
#     optimizer='adam',
#     loss='categorical_crossentropy',
#     metrics=['accuracy'],
# )

**NB**  
Choosing 3 factors is empyrical action as well as tuning model hyperparameters. Both of them
have a lot of options and variants. For now lets use well-known ones.

In [None]:
# # # Train the model
# # # We decide 3 parameters: training data, number of epochs, batch size
# # model.fit(
# #     X_train,
# #     Y_train,
# #     epochs=3,
# #     #batch_size=32,
# # )
# Epoch 1/3
# 37800/37800 [==============================] - 24s 630us/step - loss: 0.4036 - accuracy: 0.8837
# Epoch 2/3
# 37800/37800 [==============================] - 15s 394us/step - loss: 0.2115 - accuracy: 0.9386
# Epoch 3/3
# 37800/37800 [==============================] - 15s 404us/step - loss: 0.1537 - accuracy: 0.9562

In [None]:
# # Evaluate the model
# model.evaluate(
#     X_val,
#     Y_val,
# )
# 4200/4200 [==============================] - 1s 243us/step
# [0.1418064293833006, 0.9576190710067749]

In [None]:
# Predict
# predictions = model.predict(X_train)

In [None]:
# # print(np.argmax(predictions, axis=1))
# [8 7 9 ... 2 9 4]

Ok. Accuracy 95% on train data is very well result on such simple network. We can see how powerfull
they can be. This is especially noticeable in real-world computer vision tasks, in which everything 
is more complicated.

Comment out our model in order to keep it in mind and not don't get confused while implementing
second one.

Lets see what happen if we build more complicated network structure.  
- How it will affect the score?  
- And what price will we pay for this improvement?

### A second more complex model ###

A typical CNN work process starts with feature extraction and finishes with classification. 
Feature extraction is performed by alternating convolution layers with subsambling layers. 
Classification is performed with dense layers followed by a final softmax layer. 
For image classification, this architecture performs better than an entirely fully connected feed forward neural network (but for MNIST dataset, truth be sayed, it would also work fne since data is simple).

Scientists created lots of network architectures coveryng lots of real-world problem. 
Every of them actually can be used in adressing our classical "hello-world" problem, but let heavy artillery 
be used for heavy tasks.

Detailed description of the CNN nodes, as well as methods for choosing the architecture are shown in [6].

For example, let our CNN architecure be like this:
In -> Conv2D (relu)-> MaxPool2D -> Dropout -> Flatten -> Dense (relu) -> Dropout -> Dense (softmax)-> Out

It has kind of classical form.

In [None]:
# Initialize model
model = Sequential()

model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', activation ='relu', input_shape = (28,28,1)))
model.add(Conv2D(filters = 32, kernel_size = (5,5),padding = 'Same', activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', activation ='relu'))
model.add(Conv2D(filters = 64, kernel_size = (3,3),padding = 'Same', activation ='relu'))
model.add(MaxPool2D(pool_size=(2,2), strides=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(256, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(10, activation = "softmax"))

A few words about our architecture.
It has 2 Conv layers with 32 filters beacause according to [6] "32 maps in the first convolutional layer and 64 maps in the second convolutional layer is the best. Architectures with more maps only perform slightly better and are not worth the additonal computation cost". Then one Pooling layer to choose best features. Then one Dropout layer which randomly turn neurons on and off to improve convergence. Then same structure with less params. Then Flatten layer since we don't need all dimensions, just output. Then Dense-relu layer to improve convergence. And finally Dense-softmax since we need to squash the matrix into output probabilities.

In [None]:
# Define the optimizer
# In our previous model we used Adam optimizer. Now lets try another one - RMSprop, which is enough
# powerfull but can save comp resource. We will use default params.
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)

In [None]:
# Compile the model
model.compile(
    optimizer = optimizer , 
    loss = "categorical_crossentropy", 
    metrics=["accuracy"]
)

In order to make the optimizer converge faster and closest to the global minimum of the loss function. 
The LR is the step by which the optimizer walks through the 'loss landscape'. The higher LR, the bigger are the steps and the quicker is the convergence.

In [None]:
# Define an annealing method of the learning rate (LR)
learning_rate_reduction = ReduceLROnPlateau(monitor='val_accuracy', 
                                            patience=3, 
                                            verbose=1, 
                                            factor=0.5, 
                                            min_lr=0.00001
                                           )

### Data augmentation ###

Our goal is to avoid overfitting. We can enlarge amount of data in order to cover cases when
digit is written small, not centered or even rotated.

Approaches that alter the training data in ways that change the array representation while keeping 
the label the same are known as data augmentation techniques. 
Some popular augmentations are: grayscales, horizontal flips, vertical flips, random crops, color jitters, translations, rotations.

Data augmentation may increase score up to 1-1.5%. It is huge.

In [None]:
# A. Fit model without data augmentation
history = model.fit(X_train, Y_train, batch_size = 128, epochs = 10, 
validation_data = (X_val, Y_val), verbose = 2)

We obtain 99,3% acuracy. Lets try improving it a little by using augmentation.

In [None]:
# Make some data augmentation. Used [2] approach, but it can easily be modified. It is a very intuitive work.
augment = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image 
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False
        )

In [None]:
# Re-fit using augmentation
augment.fit(X_train)

### Fit the model ###

In [None]:
# B. Fit the model using our augmentaton
history = model.fit_generator(augment.flow(X_train,Y_train, batch_size=128),
                              epochs = 10, validation_data = (X_val,Y_val),
                              verbose = 2, steps_per_epoch = X_train.shape[0] // 128,
                              callbacks=[learning_rate_reduction]
                             )

We obtain 99,52% accuracy. Good.

### Predict ###

In [None]:
predictions_complex_model = model.predict(X_train)

In [None]:
print(np.argmax(predictions_complex_model, axis=1))

### Model evaluation ###

In [None]:
# Loss and accuracy curves for training and validation 
fig = plt.figure()
plt.subplot(2,1,1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='lower right')
plt.subplot(2,1,2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')
plt.tight_layout()

We see a few things:
1. The validation accuracy is greater than the training accuracy. That means that our model doesn't not overfit 
on training set. It is good.
2. Making model more complex increase accuracy from 95% to 99%. It is significiant so making model architecture
more complex is reasonable.
3. Our accuracy and loss curves are not smooth. It is not very good and signals us to change some blocks of
network or to make some experimens with parameters.

In [None]:
# Predict the values from the validation dataset
Y_pred = model.predict(X_val)
# Convert predictions classes to one hot vectors 
Y_pred_classes = np.argmax(Y_pred,axis = 1) 
# Convert validation observations to one hot vectors
Y_true = np.argmax(Y_val,axis = 1) 
# Calculate the confusion matrix
conf_mat = confusion_matrix(Y_true, Y_pred_classes)

In [None]:
# PLot confusion matrix
sns.set(font_scale=1.2) # for label size
sns.heatmap(conf_mat, annot=True, annot_kws={"size": 10}) # font size
plt.figure(figsize=(16,10))
plt.show()

### Predict on test dataset ###

In [None]:
# Make final prediction showing our model a real-test data for the first time
results = model.predict(test)
# Select the indix with the maximum probability
results = np.argmax(results,axis = 1)
# Save result as pandas series
results = pd.Series(results,name="Label")

### Save results to csv ###

In [None]:
submission = pd.concat([pd.Series(range(1,28001),name = "ImageId"),results],axis = 1)
submission.to_csv("cnn_mnist_result.csv",index=False)

In [None]:
# Literature
# [1] https://victorzhou.com/blog/keras-cnn-tutorial/#the-full-code
# [2] https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6/notebook
# [3] https://en.wikipedia.org/wiki/Convolutional_neural_network
# [4] https://keras.io/
# [5] https://www.tensorflow.org/
# [6] https://www.kaggle.com/cdeotte/how-to-choose-cnn-architecture-mnist/notebook