<h1><center><font size="6">Predict Chinese MNIST using Transfer Learnig</font></center></h1>


# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
 - Objectives
 - Data
- <a href='#2'>Prepare the data analysis</a>   
 - Load packages
 - Load the data
 - Image suites
- <a href='#3'>Data exploration</a>   
 - Check for missing data 
 - Explore image data
- <a href='#4'>Characters classification</a>       
 - Split the data
 - Build the model
 - Model evaluation
 - Predicton of test set
      * Prediction using last epoch model
      * Prediction using best model

# <a id='1'>Introduction</a>  


## Objectives

There are two objectives for this Kernel.

First objective is to take us through the steps of a machine learning analysis.   

The second objective is to demonstrate how we can use transfer learning to train a model for image classification.

## Data

We will use a dataset with adnotated images of Chinese numbers, handwritten by a number of 100 volunteers, each providing a number of 10 samples, each sample with a complete set of 15 Chinese characters for numbers.

The Chinese characters are the following:
* 零 - for 0  
* 一 - for 1
* 二 - for 2  
* 三 - for 3  
* 四 - for 4  
* 五 - for 5  
* 六 - for 6  
* 七 - for 7  
* 八 - for 8  
* 九 - for 9  
* 十 - for 10
* 百 - for 100
* 千 - for 1000
* 万 - for 10 thousands
* 亿 - for 100 millions



We start by preparing the analysis (load the libraries and the data), continue with an Exploratory Data Analysis (EDA).

We follow then with features engineering and preparation for creation of a model. The dataset is split in training, validation and test set. 

We run a model using Tensorflow through Keras interface, with GPU acceleration, using as well Dropouts, variable learning speed and early stoping based on variation of validation error accuracy.

At the end, we use the best model to predict for the test set.

<a href="#0"><font size="1">Go to top</font></a>  

# <a id='2'>Prepare the data analysis</a>   


Before starting the analysis, we need to make few preparation: load the packages, load and inspect the data.



## Load packages

We load the packages used for the analysis.


In [None]:
import pandas as pd
import numpy as np
import sys
import os
import random
from pathlib import Path
import imageio
import skimage
import skimage.io
import skimage.transform
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import scipy
from sklearn.model_selection import train_test_split
from sklearn import metrics
from keras import optimizers
from keras.models import Sequential
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Flatten, GlobalAveragePooling2D
%matplotlib inline 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import tensorflow as tf

We also set a number of parameters for the data and model.

In [None]:
NO_EPOCHS = 10
NUM_CLASSES = 15
IMAGE_WIDTH = 64
IMAGE_HEIGHT = 64
IMAGE_CHANNELS = 1
IMAGE_PATH = '..//input//chinese-mnist//data//data//'
RESNET_WEIGHTS_PATH = '/kaggle/input/resnet50/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5'

In [None]:
RANDOM_STATE = 42
TEST_SIZE = 0.2
VAL_SIZE = 0.2

<a href="#0"><font size="1">Go to top</font></a>  


## Load the data  

Let's see first what data files do we have in the root directory.

In [None]:
os.listdir("..//input//chinese-mnist")

There is a dataset file and a folder with images.  

Let's load the dataset file first.

In [None]:
data_df=pd.read_csv('..//input//chinese-mnist//chinese_mnist.csv')

Let's glimpse the data. First, let's check the number of columns and rows.

In [None]:
data_df.shape

There are 15000 rows and 5 columns. Let's look to the data.

In [None]:
data_df.sample(100).head()

The data contains the following values:  

* suite_id - each suite corresponds to a set of handwritten samples by one volunteer;  
* sample_id - each sample wil contain a complete set of 15 characters for Chinese numbers;
* code - for each Chinese character we are using a code, with values from 1 to 15;
* value - this is the actual numerical value associated with the Chinese character for number;  
* character - the Chinese character;  

We index the files in the dataset by forming a file name from suite_id, sample_id and code. The pattern for a file is as following:

> "input_{suite_id}_{sample_id}_{code}.jpg"

<a href="#0"><font size="1">Go to top</font></a>  

# <a id='3'>Data exploration</a>  



Let's start by checking if there are missing data, unlabeled data or data that is inconsistently labeled. 


## Check for missing data 

Let's create a function that check for missing data in the dataset.

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data(data_df)

There is no missing (null) data in the dataset. Still it might be that some of the data labels are misspelled; we will check this when we will analyze each data feature.

<a href="#0"><font size="1">Go to top</font></a>  

## Explore image data  

Let's also check the image data. First, we check how many images are stored in the image folder.

In [None]:
image_files = list(os.listdir(IMAGE_PATH))
print("Number of image files: {}".format(len(image_files)))

Let's also check that each line in the dataset has a corresponding image in the image list.  
First, we will have to compose the name of the file from the indexes.

In [None]:
def create_file_name(x):
    file_name = f"input_{x[0]}_{x[1]}_{x[2]}.jpg"
    return file_name

In [None]:
data_df["file"] = data_df.apply(create_file_name, axis=1)

In [None]:
file_names = list(data_df['file'])
print("Matching image names: {}".format(len(set(file_names).intersection(image_files))))

## Image suites 

Let's check the suites of the images. For this, we will group by `suite`.

In [None]:
print(f"Number of suites: {data_df.suite_id.nunique()}")
print(f"Samples: {data_df.sample_id.unique()}")

We have 100 suites, each with 10 samples.

# <a id='4'>Characters classification</a>

Our objective is to use the images that we investigated until now to correctly identify the Chinese numbers (characters).   

We have a unique dataset and we will have to split this dataset in **train** and **test**. The **train** set will be used for training a model and the test will be used for testing the model accuracy against new, fresh data, not used in training.



## Split the data

First, we split the whole dataset in train and test. We will use **random_state** to ensure reproductibility of results. We also use **stratify** to ensure balanced train/validation/test sets with respect of the labels. 

The train-test split is **80%** for training set and **20%** for test set.


In [None]:
train_df, test_df = train_test_split(data_df, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=data_df["code"].values)

Next, we will split further the **train** set in **train** and **validation**. We want to use as well a validation set to be able to measure not only how well fits the model the train data during training (or how well `learns` the training data) but also how well the model is able to generalize so that we are able to understands not only the bias but also the variance of the model.  

The train-validation split is **80%** for training set and **20%** for validation set.

In [None]:
train_df, val_df = train_test_split(train_df, test_size=VAL_SIZE, random_state=RANDOM_STATE, stratify=train_df["code"].values)

Let's check the shape of the three datasets.

In [None]:
print("Train set rows: {}".format(train_df.shape[0]))
print("Test  set rows: {}".format(test_df.shape[0]))
print("Val   set rows: {}".format(val_df.shape[0]))

We are now ready to start building our first model.

## Build the model


Next step in our creation of a predictive model.  

Let's define few auxiliary functions that we will need for creation of our models.

A function for reading images from the image files, scale all images to 100 x 100 x 3 (channels).

In [None]:
def read_image(file_name):
    image = skimage.io.imread(IMAGE_PATH + file_name)
    image = skimage.transform.resize(image, (IMAGE_WIDTH, IMAGE_HEIGHT), mode='reflect')
    return image[:,:]

A function to create the dummy variables corresponding to the categorical target variable.

In [None]:
def categories_encoder(dataset, var='character'):
    X = np.stack(dataset['file'].apply(read_image))
    # we just copy the B&W image 3 times to create the RGB equivalent
    X = np.repeat(X[..., np.newaxis], 3, -1)
    y = pd.get_dummies(dataset[var], drop_first=False)
    return X, y

Let's populate now the train, val and test sets with the image data and create the  dummy variables corresponding to the categorical target variable, in our case `subspecies`.

In [None]:
X_train, y_train = categories_encoder(train_df)
X_val, y_val = categories_encoder(val_df)
X_test, y_test = categories_encoder(test_df)

In [None]:
print(f"train: {X_train.shape}, {y_train.shape}; valid: {X_val.shape}, {y_val.shape}; test: {X_test.shape}, {y_test.shape}")

In [None]:
x1 = X_train[1234,:]
print(x1.shape)
plt.imshow(x1[:,:,0])
plt.show()
plt.imshow(x1[:,:,1])
plt.show()
plt.imshow(x1[:,:,2])
plt.show()

Now we are ready to start creating our model.  
We will use the <a href="https://keras.io/api/applications/resnet/">ResNet50</a> model from Keras library.
**ResNet50** (short for Residual Networks) is a classic neural network used as a backbone for many computer vision tasks. This model was the winner of **ImageNet** challenge in **2015**. The fundamental breakthrough with ResNet was it allowed us to train extremely deep neural networks successfully.

In [None]:
model = Sequential()
model.add(tf.keras.applications.ResNet50(include_top=False, pooling='max', weights='imagenet', input_shape=(64,64,3)))
model.add(Dense(NUM_CLASSES, activation='softmax'))
# ResNet-50 model is already trained, should not be trained
model.layers[0].trainable = False

In [None]:
model.summary()

We are using the predefined epoch number for this experiment (50 steps).

We are using as well a learning function with variable learning rate (depends on the epoch number). 

At each training epoch, we evaluate the validation error and, based on its evolution, we decide if we stop the training or continue (with a prededined `patience` factor - i.e. we only stop if validation is not improving for a certain number of steps (we set the patience to 5 steps). If at a certain step the validation error is improving, we save the current model. We then will load the best model and use it for prediction of test set.

In [None]:
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping, ReduceLROnPlateau, LearningRateScheduler
PATIENCE = 10
VERBOSE = 1
BATCH_SIZE = 128

annealer = LearningRateScheduler(lambda x: 1e-2 * 0.99 ** (x+NO_EPOCHS))
earlystopper = EarlyStopping(monitor='loss', patience=PATIENCE, verbose=VERBOSE)
checkpointer = ModelCheckpoint('best_model.h5',
                                monitor='val_accuracy',
                                verbose=VERBOSE,
                                save_best_only=True,
                                save_weights_only=True)

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
train_model  = model.fit(X_train, y_train,
                  batch_size=BATCH_SIZE,
                  epochs=100,
                  verbose=1,
                  validation_data=(X_val, y_val),
                  callbacks=[earlystopper, checkpointer, annealer])

<a href="#0"><font size="1">Go to top</font></a>  


## <a id='42'>Model evaluation</a> 


Let's start by plotting the loss error for the train and validation set. 
We define a function to visualize these values.

In [None]:
def create_trace(x,y,ylabel,color):
        trace = go.Scatter(
            x = x,y = y,
            name=ylabel,
            marker=dict(color=color),
            mode = "markers+lines",
            text=x
        )
        return trace
    
def plot_accuracy_and_loss(train_model):
    hist = train_model.history
    acc = hist['accuracy']
    val_acc = hist['val_accuracy']
    loss = hist['loss']
    val_loss = hist['val_loss']
    epochs = list(range(1,len(acc)+1))
    #define the traces
    trace_ta = create_trace(epochs,acc,"Training accuracy", "Green")
    trace_va = create_trace(epochs,val_acc,"Validation accuracy", "Red")
    trace_tl = create_trace(epochs,loss,"Training loss", "Blue")
    trace_vl = create_trace(epochs,val_loss,"Validation loss", "Magenta")
    fig = tools.make_subplots(rows=1,cols=2, subplot_titles=('Training and validation accuracy',
                                                             'Training and validation loss'))
    #add traces to the figure
    fig.append_trace(trace_ta,1,1)
    fig.append_trace(trace_va,1,1)
    fig.append_trace(trace_tl,1,2)
    fig.append_trace(trace_vl,1,2)
    #set the layout for the figure
    fig['layout']['xaxis'].update(title = 'Epoch')
    fig['layout']['xaxis2'].update(title = 'Epoch')
    fig['layout']['yaxis'].update(title = 'Accuracy', range=[0,1])
    fig['layout']['yaxis2'].update(title = 'Loss', range=[0,3])
    #plot
    iplot(fig, filename='accuracy-loss')

plot_accuracy_and_loss(train_model)

<a href="#0"><font size="1">Go to top</font></a>  


## <a id='43'>Prediction of test set</a> 


Let's continue by evaluating the **test** set **loss** and **accuracy**. We will use here the test set.

### Prediction using last epoch model

In [None]:
score = model.evaluate(X_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Let's check also the test accuracy per class.

In [None]:
def test_accuracy_report(model):
    predicted = model.predict(X_test)
    test_predicted = np.argmax(predicted, axis=1)
    test_truth = np.argmax(y_test.values, axis=1)
    print(metrics.classification_report(test_truth, test_predicted, target_names=y_test.columns)) 
    test_res = model.evaluate(X_test, y_test.values, verbose=0)
    print('Loss function: %s, accuracy:' % test_res[0], test_res[1])

In [None]:
test_accuracy_report(model)

### Prediction using best model

In [None]:
model_optimal = model
model_optimal.load_weights('best_model.h5')
score = model_optimal.evaluate(X_test, y_test, verbose=0)
print(f'Best validation loss: {score[0]}, accuracy: {score[1]}')

test_accuracy_report(model_optimal)