# About this kernel

Hello everyone, in this kernel, I will share some information I discovered in the training process of kannadamnist dataset. And some tips to get the stable result with 0.99+ accuracy. <br />

<br />
This kernel will cover: <br /> <br />
**1. The training results before I got 0.99 accuracy and some thoughts I would like to share regarding to previous result** <br />
**2. Some useful techiques, which I heavily use in other competitions or projects** <br />
**3. How I improve the kernel to get better result** <br />
**4. What else might be helpful to get better result than my current result** <br />

At the begining, I will show you the results from my previous training and what I discovered.<br />
I already trained 3 models with different split of data in colab and save their valid accuracy for 30 epochs. <br />
And the valid accuracy had been save every 6 epochs. <br />
I found some interesting things in the result. And I change some places in the kernel upon these information to get my current result. <br />
So after showing the results of previous training, I will show the new results after I adjusted the kernel. <br />

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
#This csv file saved all the validation result of my previous training.

old_result_df = pd.read_csv('../input/kmnist-data/Old_training_result.csv')
old_result_df.head(5)

Here are two helper function to show the validation results

In [None]:
def display_single_fold_val_accuracy(final_df, fold):
    epochs = [6,18,30]
    color = ['y','g','b']
    plt.figure(figsize=(12,6))
    for c, e in zip(color,epochs):
        for fold in [fold]:
            df = final_df[(final_df['fold'] == fold) & (final_df['epoch'] == e)]
            label = 'fold:'+str(fold)+', epoch:' +str(e) 
            plt.plot(df['class'].values, df['valid_accuracy'].values, label = label, c =c)
    plt.plot(np.arange(0,10,1), np.ones(10), '^', label = 'Top accuracy', color='c')
    plt.plot(np.arange(0,10,1), np.ones(10)*0.995, '--', label = 'baseline', color='r')
    plt.xticks(np.arange(0,10,1))
    plt.title('Valid accuracy on fold'+str(fold)+' model')
    plt.legend()
    plt.show()

def display_top_accuracy_on_final_epoch(final_df):
    color = ['y','g','b']
    plt.figure(figsize=(12,6))
    for c,f in zip(color, [1,2,3]):
        df = final_df[(final_df['fold'] == f) & (final_df['epoch'] == 30)]
        label = 'fold:'+str(f)+', epoch:' +str(30) 
        plt.plot(df['class'].values, df['valid_accuracy'].values, label = label, c =c)
    plt.plot(np.arange(0,10,1), np.ones(10), '^', label = 'Top accuracy', color='c')
    plt.plot(np.arange(0,10,1), np.ones(10)*0.995, '--', label = 'baseline', color='r')
    plt.xticks(np.arange(0,10,1))
    plt.title('Valid accuracy on all models at epoch 30')
    plt.legend()
    plt.show()

## Results of my first model
The following picture is the validation accuracy of my first model on epoch 6, 18 and 30. <br />
Also 0.995 valid accuracy is the baseline of the results. <br />
From the picture, we can discover some information: <br /> <br />
**1. Model 1 perform worse on class 0 and class 6, both of them have valid accuracy below baseline at epoch 30.** <br />
**2. Model 1 can reach almost 1.0 valid accuracy on class 5.** <br />
**3. Some classes got better valid accuracy when epoch is low: class 1, class6, class9** <br />

In [None]:
display_single_fold_val_accuracy(old_result_df, 1)

## Results of my second model
The following picture is the validation accuracy of my second model on epoch 6, 18 and 30. <br />
Also 0.995 valid accuracy is the baseline of the results. <br />
From the picture, we can discover some information: <br /> <br />
**1. Model 2 perform worse on class 0 and class 6 either! Also below the baseline at epoch 30** <br />
**2. Model 2 can reach almost 1.0 valid accuracy on class 5 at epoch 6.** <br />
**3. Model 2 can reach almost 1.0 valid accuracy on class 9 at epoch 18** <br />
**4. Also some classes got better valid accuracy when epoch is low: class 4, class 5, class6 and class 9** <br />

In [None]:
display_single_fold_val_accuracy(old_result_df, 2)

## Results of my third model
The following picture is the validation accuracy of my third model on epoch 6, 18 and 30. <br />
Also 0.995 valid accuracy is the baseline of the results. <br />
From the picture, we can discover some information: <br /> <br />
**1. Model 3 perform worse on class 0, class 6, class7 and class 9! All of them are lower than the baseline at epoch 30** <br />
**2. Model 3 can reach 1.0 valid accuracy on class 5 at epoch 18.** <br />
**3. Model 3 has other 2 classes are very close to 1.0 valid accuracy : class 1 and class 8.** <br />
**4. Also some classes got better valid accuracy when epoch is low: class 0, class 5** <br />

In [None]:
display_single_fold_val_accuracy(old_result_df, 3)

## Results of all my models in last epoch

The following picture is the valid accuracy at the epoch 30 of all my trained models. <br /> 
We can simply get some infromation here: <br />

**1. There are certrain classes have worse valid accuracy in every model.** <br />
**2. Also there are some classes have quite different results on different model.**<br />
**3. It is hard to tell which model perform the best on average valid accuracy.**<br />

## Conclusion and what might happen

From previous information, we can have some simple conclusions

**1. Some classes are not that easy to train compare to others** <br />
**2. Model might perform different performance, since the training data are slightly different(with different split)** <br />
**3. Some classes got better valid accuracy when epoch is bigger, some didn't.** <br />


Let's talk about first and third concolusions above. Why valid accuracy decrease when epoch increase? Does this sound familiar to you? <br />
This is overfitting, right? From the training result, we can see some classes are much harder to train, some are not. <br />
Easy classes can easily get high valid accuracy when epoch is low. But at the same time, hard classes are still underfitting. <br />
So when the model get too confidence on easy classes, it might cause some error when it classifying the hard classes. <br />
Since some of the kannada numbers are a little similar, so model migth classify the hard examples to easy classes. <br />
Thus, some of the easy classes start to overfitting, and of course, the accuracy start to decrease. <br />

How can we solve this problem? <br />

**1. Use some speical loss fucntion, which will help model to not getting too confidence on easy class and improve underfitting on hard class.** <br />
**2. Use more data. This is the most obvious one.** <br />
**3. Ensemble more models might help.** <br />

## How did I modify the kernel to get better result


Since I can't get more data, so I try with 1 and 3 options to improve the kernel. <br />
And I did get much stable results after I apply 1 and 3 options to my kernel. <br />


For option 1 above, I try with Symmetric Cross Entropy loss function, which is used to prevent noisy data and model get too confidence on easy class. The code implementation of Symmetric Cross entropy can be easlily find on github, or you can just refer to the following loss function section of this kernel. <br />


For option 3, I increased the folding number of KFold to train more model with different data spliting. I will explain the Multi-Fold training in training section of this kernel<br />


**Following part is the training validation accuracy after I adjusted the kernel**<br />

In [None]:
#This csv file saved all the validation result of my current training.
new_result_df = pd.read_csv('../input/kmnist-data/New_training_result.csv')
new_result_df.head(5)

## Result of my new models

We can see there are many valid accuracy still remain 1.0 when the epoch is 30. <br />
This means our new loss function might did prevent the model to get too confidence on easy classes. <br />
Although there are still some models perform bad on cetrain classes, but since we have trained more models with more folds.
So the models will balance each other with their own prediction, which might help the score <br />

In [None]:
plt.figure(figsize=(12,8))
for fold in range(1,9,1):
    df = new_result_df[new_result_df['fold'] == fold]
    plt.plot(df.Class.values-1, df.Valid_accuracy.values, label = 'fold_'+str(fold)+'_model_epoch_'+str(30))
plt.plot(np.arange(0,10,1), np.ones(10)*0.995, '--', label = 'baseline', c='r')
plt.plot(np.arange(0,10,1), np.ones(10), '^', label = 'Top accuracy', color='c')
plt.legend()
plt.xticks(np.arange(0,10,1))
plt.show()

# Let's begin!

This kernel is using small convolution neural network with tf-keras framework. <br />
Since this will be a starter kernel, so I will explan the detail as more as possible. <br />
Although this is a starter kernel, I will use slightly advanced architecture and some practical tips to enhance the performance. <br />

Here I will simply introduce the kernel first<br />

## This kernel include
**1. Very small portion of EDA** <br />
**2. Albumentation augmentation library** <br />
**3. Squeeze and Excitiation neural network** <br />
**4. Swish acitvation function** <br />
**5. Symmetric cross entropy loss function** <br />
**6. Multi-Fold Cross-validation and model ensembles** <br />
**7. Test time augmentation** <br />

## Import the modules

In [None]:
import tensorflow as tf
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import seaborn as sn
import albumentations as albu
from sklearn.model_selection import train_test_split, KFold
from tqdm import tqdm_notebook
import gc
import os
import warnings 
warnings.filterwarnings('ignore')
main_dir = '../input/Kannada-MNIST/'
tf.keras.__version__

# Data prepare/EDA/Visualization

**Note**

Here I read the csv files with pandas api. <br />
There are 3 datasets <br />
1. train.csv -> training dataset
2. Dig-MNIST.csv -> validation dataset
3. test.csv -> testing dataset

I will merge training dataset with validation dataset first then use train_test_split or KFold function  to produce training & validation dataset. <br />
Since I'm not sure whether there will be difference between original training dataset and validation dataset. <br />
So I merge them together and trying to make the data more random. <br />

In [None]:
pretrain_weights_path = [
    '../input/kmnist-data/0_model.hdf5',
    '../input/kmnist-data/1_model.hdf5',
    '../input/kmnist-data/2_model.hdf5',
    '../input/kmnist-data/3_model.hdf5',
    '../input/kmnist-data/4_model.hdf5',
    '../input/kmnist-data/5_model.hdf5',
    '../input/kmnist-data/6_model.hdf5',
    '../input/kmnist-data/7_model.hdf5',
]


uptrain = False
submit = True
num_classes = 10
num_features = (28,28,1)
batch_size = 1024
lr = 3e-4
epochs = 30

k_fold_split = 8

In [None]:
train_df = pd.read_csv(main_dir + 'train.csv')
valid_df = pd.read_csv(main_dir + 'Dig-MNIST.csv')
test_df = pd.read_csv(main_dir + 'test.csv')

#Check out some training data
train_df.head(5)

**Note**

From the print out data above. We will know
1. The first column is the label of dataset.
2. Other columns are the digital value of pixels of kannada mnist. And there are 784 columns of digital value in 1 row.
<br />

The dataset is the same as regular mnist dataset.

In [None]:
#Extract the label from training dataframe and discard the label column
train_label = train_df['label']
test_indices = test_df['id']

train_df = train_df.drop(['label'], axis = 1)
test_df = test_df.drop(['id'], axis = 1)

#Convert dataframe into numpy array 
train_x = train_df.values
train_y = train_label.values

test_x = test_df.values

print("shape of train_x :", train_x.shape)
print("shape of train_y :", train_y.shape)
print("shape of test_x :", test_x.shape)

**Note**

First we turn the original data columns into square dimension, which is 28x28x1(height x width x channel) <br />
Since we will use the convolution neural network to train the data, so we need to transform the data into image dimension. <br />
And I do one-hot encoding on the original labels <br />

In [None]:
train_x = train_x.reshape(-1,28,28,1)
# One-hot encode the original label
train_y = tf.keras.utils.to_categorical(train_y, num_classes)
test_x = test_x.reshape(-1,28,28,1)

In [None]:
#check some image data
temp_imgs = train_x[8:12]
temp_labels = train_y[8:12]

nrows = 2
ncols = 2
plt.figure(figsize=(6,6))

for idx, (img, label) in enumerate(zip(temp_imgs, temp_labels)):
    plt.subplot(nrows, ncols, idx+1)
    plt.imshow(np.squeeze(img,axis=2), cmap = 'gray')
    plt.title("label : " + str(np.argmax(label)))
    plt.axis('off')
plt.show()

In [None]:
#check the quantity of each labels
label_counts = train_label.value_counts().reset_index()
#rename the columns of label_counts
label_counts.columns = ['label', 'quantity']
#sort the value in label_counts on label column
label_counts = label_counts.sort_values('label')

plt.figure(figsize = (8,4))
plt.bar(label_counts['label'], label_counts['quantity'])
plt.xlabel('label')
plt.ylabel('quantity')
plt.title('label quantity in training dataset')
plt.show()


**Note**

Split the training and validation dataset with train_test_split function. <br />
It is very important to separate your training and validation dataset. <br />
Since your model might learn the pattern/feature of your training data, but it still have chance to perform bad on the dataset it didnt see before.<br />
So we split the original training dataset into two parts for training/validating, thus we can know the model handle the data it didnt see before well or not. <br />
This just one of the way to check your model is overfitting or underfitting, there are also KFold, cross-validation.... <br />
<br />
<br />
Also random state is quite important here, fix the random_state parameter will keep every time train/valid dataset split in same way. <br />
Which is easy for your to tune your model. 

But I didn't use train_test_split in this kernel, instead I use K-Fold to do the cross-validation. (at the trainging section)

In [None]:
#x_train, x_test, y_train, y_test = train_test_split(train_x, train_y, test_size = 0.2, random_state = 2019)

# Data generator and Data augmentation

**Note**

Here I using albumentation library to do the data augmentation. <br />
Albumentation is an excellent library for data augmentation, I prefer to use it than ImageDataGenerator in Keras. <br />
Since there are more augmentation options in albumentation library, like GridDistortion, RandomGamma, RandomContrast.... <br />
So we can make the data more diversity, which can increase the generalization ability of our model. <br />
Generalization abilty of model is like how well of our model can handle the data it didnt see before. This is pretty important to the machine learning model, since we hope our model not only can predict the training data right, but also get good prediction ability on real world data. <br />


**Note**

I created two data generator for training and validation. And there are no augmentation operation on validation dataset. Since we want to observe the performance on data which is more original. But there is one techique called TTA(Test Time Augmentation). TTA is a techique that we do augmentation operation on testing data. For instance, we have a testing data we would like to predict, we can perform the augumentation on it and make several versions of this original testing data. Then we use our model to predict all the testing data we just created to average the result. This can make the prediction more stable since we average the error. I will perform a simple version of TTA at the end for predicting submission csv file.

## Check some augmentation effect

In [None]:
def display_aug_effect(img, aug, repeat=3, aug_item = 'rotate'):
    '''
    img : input image for display
    aug : augmentation object to perform image multiplication
    repeat : how much time you want to perfrom multiplication
    aug_item : certain multiplication you want to apply to input image
    '''
    plt.figure(figsize=(int(4*(repeat+1)),4))
    plt.subplot(1,repeat+1, 1)
    plt.imshow(img, cmap='gray')
    plt.title('original image')
    
    for i in range(repeat):
        plt.subplot(1, repeat+1, i+2)
        temp_aug_img = aug(image = img.astype('uint8'))['image']
        plt.imshow(temp_aug_img, cmap='gray')
        plt.title(aug_item + ' : ' + str(i+1))
    
    plt.axis('off')
    plt.show()

**Shift Scale Rotate**

In [None]:
temp_aug = albu.ShiftScaleRotate(scale_limit=0.2, rotate_limit=20, shift_limit=0.15, p=1, border_mode=0)
display_aug_effect(np.squeeze(temp_imgs[0], axis=2), temp_aug, aug_item = 'ShiftScaleRotate')

**Grid Distortion**

In [None]:
temp_aug = albu.GridDistortion(p=1)
display_aug_effect(np.squeeze(temp_imgs[1], axis=2), temp_aug, aug_item = 'GridDistortion')

**Random Brightness/Random Gamma/Random Contrast**

In [None]:
temp_aug = albu.OneOf([ albu.RandomBrightness(limit=10), albu.RandomGamma(gamma_limit=(80, 120)), albu.RandomContrast(limit=1.5) ], p=1)
display_aug_effect(np.squeeze(temp_imgs[2], axis=2), temp_aug, aug_item = 'Gamma/Brightness/Contrast')

**Random Crop**

In [None]:
temp_aug = albu.RandomCrop(height=24,width=24,p=1)
display_aug_effect(np.squeeze(temp_imgs[3], axis=2), temp_aug, aug_item = 'RandomCrop')

In [None]:
class InputGenerator(tf.keras.utils.Sequence):
    
    def __init__(self,
                 x,
                 y=None,
                 aug=None,
                 batch_size=128,
                 training=True):
        
        self.x = x
        self.y = y
        self.aug = aug
        self.batch_size = batch_size
        self.training = training
        self.indices = range(len(x))
    
    def __len__(self):
        return len(self.x) // self.batch_size
    
    def __getitem__(self,index):
        
        batch_indices = self.indices[index * self.batch_size : (index+1)*self.batch_size]
        batch_data = self.__get_batch_x(batch_indices)
        
        if self.training:
            batch_label = self.__get_batch_y(batch_indices)
            return batch_data, batch_label
        else:
            return batch_data
    
    def on_epoch_start(self):
        
        if self.training:
            np.random.shuffle(self.indices)
            
    def __get_batch_x(self, batch_indices):
        
        batch_data = []
        for idx in batch_indices:
            cur_data = self.x[idx].astype('uint8')
            cur_data = self.aug(image = cur_data)['image']
            batch_data.append(cur_data)
            
        return np.stack(batch_data)/255.0
    
    def __get_batch_y(self, batch_indices):
        
        batch_label = []
        for idx in batch_indices:
            batch_label.append(self.y[idx])
            
        return np.stack(batch_label)


        
train_aug = albu.Compose([
                    albu.ShiftScaleRotate(scale_limit=0.2, rotate_limit=15.0, shift_limit=0.15, p=0.5, border_mode=0, value = 0)]
                    )

valid_aug = albu.Compose([])

# Build the model

**Little summary of model** <br />
<br />
**Model architure** : Use 8 convolution layers with 1 squeeze-and-excitiation blocks and 3 fully-connected layers. <br />
**Activation function** : The activation function in the hidden layers of  neural network is Swish, and the activation function in output layer is softmax. <br />
**Optimizer** : RMSProp. <br />
**Loss function** : Symmetric categorical cross entropy <br />
<br />
<br />
**Note**

Here I would like to introduce two parts of below code block. <br />
1. Swish activation
2. Squeeze-and-Excitiation nerual network

<br />
**Swish activation** <br />
<br />
Swish activation is proposed by google brain, which just simply multiply the input tensor with sigmoid tensor( f(x) = x * sigmoid(x) ). The ramp function of swish is quite similar to relu, but  it is not so monotonic as relu. We all know relu will make the input tensor to zero once its value smaller than zero. This behavior will make the parameter of neuron won't be updated and just assign to its original value. But swish can create much more difference during the training since it will not simply make the value to zero. Although swish activation might wont help too much on small architecture(like I created in this kernel), but it might be helpful on deeper neural network architecture.

<br />
**Squeeze and Excitation neural network** <br />
<br />
SE net is an architecture proposed by J Hu. It can be divide in two parts, squeeze and excitation parts. 
In very simple explanation, it aggregates feature maps across their spatial dimension and makes them as channel descriptor. And it can learn the importance of dependencies of each channels during training. 
Then multiply the weights to its corresponding channels of input tensor to enhance/decrease the importance. <br />
<br />
Original article : https://arxiv.org/abs/1709.01507

In [None]:
def _swish(x):
    '''
    x : input tensor
    
    return : swish activated tensor
    '''
    return tf.keras.backend.sigmoid(x) * x

#helper function of Squeeze and Excitation block.
def _seblock(input_channels=32, se_ratio=2):
    '''
    input_channels : the channels of input tensor
    se_ratio : the ratio for reducing the first fully-conntected layer
    
    return : helper function for entire se block
    '''
    def f(input_x):
        
        reduced_channels = input_channels // se_ratio
        
        x = tf.keras.layers.GlobalAveragePooling2D()(input_x)
        x = tf.keras.layers.Dense(units=reduced_channels, kernel_initializer='he_normal')(x)
        x = tf.keras.layers.Activation(_swish)(x)
        x = tf.keras.layers.Dense(units=input_channels, kernel_initializer='he_normal', activation='sigmoid')(x)
        
        x = tf.keras.layers.multiply([x,input_x])
        
        return x
    return f

def _cn_bn_act(filters=64, kernel_size=(3,3), strides=(1,1)):
    '''
    filters : filter number of convolution layer
    kernel_size : filter/kernel size of convolution layer
    strides : stride size of convolution layer
    
    return : helper function for convolution -> batch normalization -> activation
    '''
    def f(input_x):
        
        x = tf.keras.layers.Conv2D(filters=filters, kernel_size=kernel_size, strides=strides, padding='same', kernel_initializer='he_normal')(input_x)
        x = tf.keras.layers.BatchNormalization()(x)
        x = tf.keras.layers.Activation(_swish)(x)
        
        return x
    
    return f


def _dn_bn_act(units=128):
    '''
    units : units for fully-connected layer
    
    return : helper function for fully-connected -> batch normalization -> activation
    '''
    def f(input_x):
        
        x = tf.keras.layers.Dense(units=units, kernel_initializer='he_normal')(input_x)
        x = tf.keras.layers.BatchNormalization()(x)
        x = tf.keras.layers.Activation(_swish)(x)
        return x
    
    return f

def build_model(input_shape = (28,28,1), classes = 10):
    '''
    input_shape : input dimension of single data
    classes : class number of label
    
    return : cnn model
    '''
    input_layer = tf.keras.layers.Input(shape = input_shape)
    
    x = _cn_bn_act(filters=64)(input_layer)
    #x = _seblock(input_channels=64)(x)
    x = _cn_bn_act(filters=64)(x)
    x = _cn_bn_act(filters=64)(x)
        
    x = tf.keras.layers.MaxPooling2D(pool_size=(2,2))(x)
    x = tf.keras.layers.Dropout(0.2)(x)
    
    x = _cn_bn_act(filters=128)(x)
    #x = _seblock(input_channels=128)(x)
    x = _cn_bn_act(filters=128)(x)
    x = _cn_bn_act(filters=128)(x)
    
    x = tf.keras.layers.MaxPooling2D(pool_size=(2,2))(x)
    x = tf.keras.layers.Dropout(0.2)(x)
    
    x = _cn_bn_act(filters=256)(x)    
    x = _seblock(input_channels=256)(x)
    x = _cn_bn_act(filters=256)(x)    
    
    x = tf.keras.layers.Dropout(0.2)(x)
    x = tf.keras.layers.Flatten()(x)
    
    x = _dn_bn_act(units=256)(x)
    x = _dn_bn_act(units=128)(x)
    output_layer = tf.keras.layers.Dense(units=classes, kernel_initializer='he_normal', activation = 'softmax')(x)
    
    model = tf.keras.models.Model(inputs=[input_layer], outputs=[output_layer])
    return model

**Note**

I prefer to add Precision and Recall into metrics. Since I like to monitor the trending of false positive and negative positive during training. <br />
You can always customize your own metrics/loss fucntion for model training. <br />
As long as the function has y_true and y_pred parameters(Remember that the y_true and y_pred send in function are tensors)

In [None]:
def Precision(y_true, y_pred, epsilon=1e-7):
    
    y_true_f = tf.keras.backend.flatten(y_true)
    y_pred_f = tf.keras.backend.flatten(y_pred)
    
    y_pred_f = tf.keras.backend.round(y_pred_f)
    
    TP = tf.keras.backend.sum(y_true_f * y_pred_f)
    FP = tf.keras.backend.sum((1-y_true_f) * y_pred_f)
    
    return TP/(TP+FP+epsilon)

def Recall(y_true, y_pred, epsilon=1e-7):
    
    y_true_f = tf.keras.backend.flatten(y_true)
    y_pred_f = tf.keras.backend.flatten(y_pred)
    
    y_pred_f = tf.keras.backend.round(y_pred_f)
    
    TP = tf.keras.backend.sum(y_true_f * y_pred_f)
    TN = tf.keras.backend.sum(y_true_f * (1-y_pred_f))
    
    return TP/(TP+TN+epsilon)

In [None]:
def symmetric_cross_entropy(alpha=1.0, beta=1.0, epsilon=1e-7):
    def loss(y_true, y_pred):
        
        y_pred_ce = tf.clip_by_value(y_pred, epsilon, 1.0)
        y_true_rce = tf.clip_by_value(y_true, epsilon, 1.0)

        ce = alpha*tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred_ce), axis = -1))
        rce = beta*tf.reduce_mean(-tf.reduce_sum(y_pred * tf.math.log(y_true_rce), axis = -1))
        
        return  ce + rce
    return loss

# Train the model

## callbacks

**Note**

Callback functions are very important during training. Here I use the callback functions that already finished by keras. But you can write your own callback function if you want. <br />
custom callbacks in keras : https://keras.io/callbacks/ <br />

<br />
**ReduceLROnPlateau** : This callback function can monitor certain value you want to check and reduce the learning rate when it didnt improve in cetrain epochs. <br />
**ModelCheckpoint** : This callback function also can monitor ctratin value you want and save the weights of your model once this certrain value hit the best performance than before. Or maybe you don't have enough time to train the current model, you can always save the weight and uptrain the model when you are available. <br />
**EarlyStopping** : This callback function will stop the training if your monitor value didnt imporve in certain epochs. <br />

# KFold Cross-Validation(CV)/Ensemble models

**Note**

KFold is a very common cross-validation method. It will split your original training data into several pieces(depends on the n_splits parameters). <br />
And use 1/n_splits portion of data to be the validation data and rest of data to be training data. <br />
But this validation data will keep change until it go through all your original training data, for instance <br />
<br />
original data = [1,2,3] <br />
n_splits = 3 <br />
<br />
**1/3 training:**<br />
validatng data : [1] <br />
training data : [2,3] <br />
<br />
**2/3 training:** <br />
validating data : [2] <br />
training data : [1,3]<br />
<br />
**3/3 training:** <br />
validating data : [3] <br />
training data : [1,2] <br />
<br />
But we also know, it is not a good idea to blend the training data with validating data. <br />
So We can use several models to train on each different spliting situation. <br />
After training, we will have n_splits models, which all of them are training on (1-1/n_splits) of original training data and validating on 1/n_splits of original training data. <br />
So all of these models were training on different(partially different) datasets. <br /> 
Then we can use these models to predict the data we want to predict, thus we get n_splits different prediction results. <br />
In the end, we can blend these results by voting or averaging them and get a much stable result than use only one model. <br />

In [None]:
# Before training, I would like to shuffle the training dataset to make the data more random
# But we also need to shuffle the training label and keep it has same corresponding relation with training dataset.
# So I use permutation in numpy to create the new indices for both new training data & label
permutation = np.random.RandomState(2019).permutation(len(train_x))
train_x = train_x[permutation]
train_y = train_y[permutation]

In [None]:
kf = KFold(n_splits=k_fold_split, random_state=2019)

#list to save all the models we are going to train
model_members = []
check_points_path = []

for idx in range(k_fold_split):
    check_points_path.append(str(idx) + '_model.hdf5')
    
for model_index, (train_indices, valid_indices) in enumerate(kf.split(train_x)):
    
    #data generator for training
    train_aug_gen = InputGenerator(train_x[train_indices], train_y[train_indices], train_aug, batch_size)
    #data generator for validating
    valid_aug_gen = InputGenerator(train_x[valid_indices], train_y[valid_indices], valid_aug, batch_size)
    
    model = build_model()
    
    optimizer = tf.keras.optimizers.RMSprop(lr=lr)
    model.compile(optimizer=optimizer, metrics=['accuracy',Precision,Recall], loss=symmetric_cross_entropy())

    steps_per_epoch = len(train_x[train_indices]) // batch_size
    validation_steps = len(train_x[valid_indices]) // batch_size
    
    reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', patience=3, factor=0.5, min_lr=1e-8, mode='min', verbose=2)
    check_point = tf.keras.callbacks.ModelCheckpoint(monitor='val_loss', filepath = check_points_path[model_index], mode='min', save_best_only=True, verbose=2)
    early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)
    
    training_callbacks = [ check_point, reduce_lr ]
    
    if pretrain_weights_path == None or uptrain == True:
        
        if pretrain_weights_path != None:
            print('*'*10,'Load ',model_index,'-fold pretrain weights','*'*10)
            model.load_weights(pretrain_weights_path[model_index])
            
        
        
        print('*'*30,'Training model ', model_index+1,'*'*30)
        print('Train on ', len(train_indices),' data')
        print('Valid on ', len(valid_indices),' data')

        history = model.fit_generator(generator=train_aug_gen,
                                      steps_per_epoch=steps_per_epoch,
                                      epochs=epochs,
                                      validation_data=valid_aug_gen,
                                      validation_steps=validation_steps,
                                      workers=-1,
                                      verbose=2,
                                      callbacks=training_callbacks)
        
        print('*'*30,'Validating model ', model_index+1,'*'*30)
        val_gen = InputGenerator(train_x[valid_indices], None, albu.Compose([]), 1, training=False)
        
        print('\n\n')
        preds = model.predict_generator(val_gen, workers=-1, verbose=1)
        truths = train_y[valid_indices]
        preds = np.round( np.array(preds) )
        truths = np.array(truths)
        valid_results = []
        for c in range(num_classes):
            valid_results.append((c+1, np.sum(preds[:,c] * truths[:,c])/ np.sum(truths[:,c])))
        
        valid_results_df = pd.DataFrame(valid_results, columns=['Class', 'Valid_accuracy'])
        valid_results_df.to_csv('new_ValidationResults'+str(model_index+1)+'.csv', index=False)
        print(valid_results_df)

        del history, val_gen, train_aug_gen, valid_aug_gen, preds, truths, valid_results_df
        gc.collect()
        
        
    else:
        print('Load the pretrain weights '+str(model_index+1))
        model.load_weights(pretrain_weights_path[model_index])
        
    model_members.append(model)
    del model
    print('\n')

# Test Time Augmentation

**Note**

Here I still use albumentations library to do the TTA. And I created a simple tta_wrapper class. It will wrap the original model you just trained then use it to perform tta. <br />
The concept is quite simple, it took the model then predict the results of original test image and images after some multiplications. <br />
Since the output of the model is the probability of each class in one-hot-encoded form, which already normalized by softmax activation. <br />
So I simply add up the probabilities then average them. Thus I got a new result by averaging the prediction results of several different view of original test image. <br />

In [None]:
tta_aug = albu.Compose([
                    albu.ShiftScaleRotate(scale_limit=0.1, rotate_limit=10, shift_limit=0.10, p=1.0, border_mode=0, value=0)
                    ])

In [None]:
class tta_wrapper():

    def __init__(self,
                 model,
                 normal_generator,
                 aug_generator,
                 repeats = 1
                 ):
        '''
        model : model you trained on your original data
        normal_generator : generator for data without augmentation
        aug_generator : generator for data with augmentation
        repeats : how many times you want to use model to predict on augmentation data
        '''
        self.model = model
        self.normal_generator = normal_generator
        self.aug_generator = aug_generator
        self.repeats = repeats
    
        
    def predict(self):
        
        '''
        return : Averaging results of several different version of original test images
        '''

        batch_label = self.model.predict_generator( normal_generator,
                                                    workers=-1,
                                                    verbose=1)
        
        for idx in range(self.repeats):

            batch_label += self.model.predict_generator( aug_generator,
                                                         workers=-1,
                                                         verbose=1)
        batch_label /= (self.repeats+1)
        return batch_label

# Submission

In [None]:
if submit == True:
    predictions = np.zeros((len(test_x), num_classes))

    for model_index, model in enumerate(model_members):
        print(str(model_index+1) + '_model predicting')
    
        normal_generator = InputGenerator(test_x,
                                          aug = albu.Compose([]),
                                          training=False,
                                          batch_size=250)
        
        aug_generator = InputGenerator(x=test_x,
                                       aug = tta_aug,
                                       training=False,
                                       batch_size=250)
    
        tta_model = tta_wrapper(model=model,
                                normal_generator=normal_generator,
                                aug_generator = aug_generator,
                                repeats=1)
    
        predictions += tta_model.predict()

    predictions = predictions / k_fold_split
    predictions = np.argmax(predictions, axis=1)
    submission = pd.DataFrame({'id' : test_indices, 'label':predictions})
    submission.to_csv('submission.csv', index = False)
    submission.head(5)

# Validation on real world data

In [None]:
valid_labels = valid_df['label']
valid_datas = valid_df.drop(['label'],axis=1).values
valid_datas = valid_datas.reshape(-1,28,28,1)
valid_predictions = np.zeros((len(valid_datas), num_classes))

for model_index, model in enumerate(model_members):
    print(str(model_index+1) + '_model predicting')
    
    normal_generator = InputGenerator(x=valid_datas,
                                      aug = albu.Compose([]),
                                      training=False,
                                      batch_size=160)
        
    aug_generator = InputGenerator(x=valid_datas,
                                   aug = tta_aug,
                                   training=False,
                                   batch_size=160)
    
    tta_model = tta_wrapper(model=model,
                            normal_generator=normal_generator,
                            aug_generator = aug_generator,
                            repeats=1)
    
    valid_predictions += tta_model.predict()
    
valid_predictions = valid_predictions/k_fold_split
valid_predictions = np.argmax(valid_predictions,axis=1)

In [None]:

print('Real world validation accuracy : {}'.format(accuracy_score(valid_labels, valid_predictions)))

In [None]:
print(classification_report(valid_labels, valid_predictions))

In [None]:
plt.figure(figsize=(12,12))
confusion_mat = confusion_matrix(valid_labels, valid_predictions)
sn.heatmap(confusion_mat, annot=True, cmap='YlGnBu')
plt.title('Confusion matrix of Real World validation result')

From the above validation result of real world data, it is very obvious my models didn't perform well on real world data. <br />
We might need to use more data augmentation when training the model to improve the model generalization. <br />
Also from the result we can know one thing, my models were still overfitting to ideal data. <br />
Which means the public testing dataset might have very few portion of real data, or even none. <br />
But maybe there are some real world data in the private testing dataset.<br />
So I think we might need to train the model on public validation dataset as well.<br />
Which may help you to stay in good position after private score is release.

# What might help the kernel to further improve?

1. Transfer learning, you can train the model on other dataset(like original mnist) then use the pretain model to train kannada mnist dataset.
2. More fold
3. Try with different loss function 
4. Dig into the dataset

**Thanks for reading!**