<h1> Digit recognizer - Learning basics of  image processing </h1>

Image processing algorithms that need to be learnt -
    1. Tensorflow
    2. Keras
    3. ...
    
This problem may serve a good training ground for image processing of images that are stored as transformed flattened datasets. Real life examples may involve more complexities like manual labelling, actual images instead of datasets, etc. (for eg. the[ Plant seedling problem](https://www.kaggle.com/c/plant-seedlings-classification))


<h3>Reference</h3>

* Questions
    * How can we decide the optimal value of parameters and hyperparameters for a NN?
    * Does the shape of input data vary for different image classifiers? - keras : numpx, numpx, 3; traditional : numpx x numpx x 3, 1 

<h2>Content</h2>

*[Source](https://www.kaggle.com/yassineghouzam/introduction-to-cnn-keras-0-997-top-6)*
* Introduction
* Data preparation
    * Load data
    * Check for null and missing values
    * Normalization
    * Reshape
    * Label encodingSplit training and validation set
* CNN
    * Define the model
    * Set the optimizer and annealer
    * Data augmentation
* Model evaluation
    * Train and validation curvers
    * Confusion matrix
* Predction and submission

<h3>Introduction</h3>
* 5 layered Sequential Convolutional Neural Network
* Build using keras API(Tensorflow backend)

In [None]:
#import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
%matplotlib inline

np.random.seed(2)

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools

from keras.utils.np_utils import to_categorical # for one-hot encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau

sns.set(style = 'white', context = 'notebook', palette = 'deep')

<h3>Data preparation</h3>

In [None]:
# load data
train = pd.read_csv('../input/train.csv')
test  = pd.read_csv('../input/test.csv')

In [None]:
Y_train = train['label']

# Drop 'label' column
X_train = train.drop(labels = ['label'], axis = 1)

free_space = 1
if free_space:
    del train
    
g = sns.countplot(Y_train)

Y_train.value_counts()

Number of training examples almost consistent across all labels

In [None]:
# check for missing values

X_train.isnull().any().describe()

No missing values present in the data

In [None]:
# normalization

# We perform a grayscale normalization to reduce the effect of illumination's difference
# Also CNN converges faster on a [0 1] than a [0 255]

X_train = X_train / 255.0
test    = test / 255.0

In [None]:
# reshape
X_train = X_train.values.reshape(-1, 28, 28, 1) # don't understand this code fully - why -1?
test = test.values.reshape(-1, 28, 28, 1)

In [None]:
# label encoding - encode labels into one hot vectors, eg. 2 = [0 1 0 0 0 0 0 0 0 0]
Y_train = to_categorical(Y_train, num_classes = 10)

In [None]:
# split train and validation sets
random_seed = 2
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state = random_seed)

In [None]:
# view an example
g = plt.imshow(X_train[1][:,:,0])

<h3>CNN</h3>

In [None]:
# Set the CNN model
# Architecture used - In -> [[Conv2D->relu]*2 -> MaxPool2D -> Dropout]*2 -> Flatten -> Dense -> Drop out -> Out

model = Sequential()
model.add(Conv2D(filters = 32,# add comments later - what does it mean by applying filters
                 kernel_size = (5,5), 
                 padding = 'Same',
                activation = 'relu',
                input_shape = (28, 28, 1)))
model.add(Conv2D(filters = 32,
                 kernel_size = (5,5), 
                 padding = 'Same',# what is padding
                activation = 'relu'))# why do I not need to provide the input shape?
model.add(MaxPool2D(pool_size = (2,2)))
model.add(Dropout(0.25))

model.add(Conv2D(filters = 64,
                 kernel_size = (3,3), 
                 padding = 'Same',
                activation = 'relu'))
model.add(Conv2D(filters = 64,
                 kernel_size = (3,3), 
                 padding = 'Same',
                activation = 'relu'))
model.add(MaxPool2D(pool_size = (2,2),# how does the selection happen?
                   strides = (2,2)))# what are strides?
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(256, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(10, activation = 'softmax'))

In [None]:
# set the optimizer and annealer
    # read up on categorical cross entropy - used to measure loss in multiclass classification

# define the optimizer
optimizer = RMSprop(lr = 0.001,
                    rho = 0.9,# what is rho?
                   epsilon = 1e-08,# ?
                   decay = 0.0)# ?

# compile the model
model.compile(optimizer = optimizer,
             loss = 'categorical_crossentropy',
             metrics=["accuracy"])

In [None]:
# set a learning rate annealer
# start with high learning rate for faster computation and reduce it by pre defined factor if the accuracy does not improve

learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc',
                                           patience=3,
                                           verbose=1,
                                           factor=0.5,
                                           min_lr=0.00001)

In [None]:
epochs = 5 # turn to 30 for 0.9967 accuracy
batch_size = 86 # how to decide optimal batch size?

In [None]:
# data augmentation - read up

# approaches that alter the training dataset without changing the labels thus creating more data artificially are data called data augmentation techniques. 
# Common methods include grayscales, horizontal/vertical flips, random crops, color jitters, translations, rotations, etc.

datagen = ImageDataGenerator(featurewise_center = False,# set input mean to 0 over the dataset
                            samplewise_center = False,# set each sample mean to 0
                            featurewise_std_normalization = False, # divide inputs by std of the dataset
                            samplewise_std_normalization = False, # divide each input by its std
                            zca_whitening = False, # apply ZCA whitening
                            rotation_range = 10, # randomly rotate images in the range (degrees, 0 to 180)
                            zoom_range = 0.01,# randomly zoom image
                            width_shift_range = 0.01,# randomly shift images horizontally (fraction of total width)
                            height_shift_range = 0.01,# randomly shift images verticall (fraction of  total height)
                            horizontal_flip = False,# randomly flip images
                            vertical_flip = False)#randomly flip images
datagen.fit(X_train)# how much data is generated, 2 time, 3 times the original train data?

In [None]:
# fit the model
history = model.fit_generator(datagen.flow(X_train, Y_train, batch_size=batch_size),
                             epochs = epochs,
                             validation_data = (X_val, Y_val),
                             verbose = 2,
                             steps_per_epoch = X_train.shape[0]  // batch_size,# am I passing the complete training data per epoch?
                             callbacks = [learning_rate_reduction])# how does callbacks work?

<h3>Evaluate the model</h3>

In [None]:
# train and test validation curves - read up
# can I retrain a saved keras model? - will help training on higher epocs without runtime error
fig, ax = plt.subplots(2, 1)
ax[0].plot(history.history['loss'], color = 'b', label = 'Training loss')
ax[0].plot(history.history['val_loss'], color = 'r', label = 'Validation loss', axes = ax[0])
legend = ax[0].legend(loc = 'best', shadow = True)

ax[1].plot(history.history['acc'], color = 'b', label = 'Training accuracy')
ax[1].plot(history.history['val_acc'], color = 'r', label = 'Validation accuracy')
legend = ax[1].legend(loc = 'best', shadow = True)

In [None]:
# confusion matrix

def plot_confusion_matrix(cm, classes, normalize = False, title = 'Confusion matrix', cmap = plt.cm.Blues):
    '''
    This function prints out the confusion matrix.
    Normalization can be applied by setting normalize = True
    '''
    plt.imshow(cm, interpolation='nearest', cmap = cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation = 45)
    plt.yticks(tick_marks, classes)
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]
    
    thresh = cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                horizontalalignment='center',
                color='white' if cm[i, j] > thresh else 'black')
        
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    
# predict the values from valiidation set
Y_pred = model.predict(X_val)
# convert prediction classes to one hot vectors
Y_pred_classes = np.argmax(Y_pred, axis = 1)
# convert validation observations to one hot vectors
Y_true = np.argmax(Y_val, axis = 1)
# compute the confusion matrix
confusion_mtx = confusion_matrix(Y_true, Y_pred_classes)
# plot the confusion matrix
plot_confusion_matrix(confusion_mtx, classes = range(10))

In [None]:
# predict results
results = model.predict(test)

# select the index with maximum probability
results = np.argmax(results, axis = 1)
results = pd.Series(results, name = 'Label')

In [None]:
submission = pd.concat([pd.Series(range(1, 28001), name = 'ImageId'), results], axis = 1)
submission.to_csv('180304_digit_recognizer_mnist_v1.csv', index = False)

In [None]:
import os
os.listdir('../input/')