<h1><center><font size="6">Chinese MNIST Exploratory Data Analysis</font></center></h1>

<center><img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F769452%2Ffae77a81c057fe419de60f5e2b20be25%2Fchinese_mnist_profile_small.png?generation=1596963542354014&alt=media"></img></center>
 
 

# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Prepare the data analysis</a>  
 - <a href='#21'>Load packages</a>  
 - <a href='#21'>Load the data</a>  
 - <a href='#21'>Preprocessing data</a>  
- <a href='#3'>Data exploration</a>   
 - <a href='#31'>Check for missing data</a>  
 - <a href='#32'>Explore image data</a>  
 - <a href='#33'>Suits, samples, characters distribution</a>  
- <a href='#4'>Conclusions</a>      

# <a id='1'>Introduction</a>  


In this Kernel, we will explore a dataset with adnotated images of Chinese numbers, handwritten by a number of 100 volunteers, each providing a number of 10 samples, each sample with a complete set of 15 Chinese characters for numbers.

The Chinese characters are the following:
* 零 - for 0  
* 一 - for 1
* 二 - for 2  
* 三 - for 3  
* 四 - for 4  
* 五 - for 5  
* 六 - for 6  
* 七 - for 7  
* 八 - for 8  
* 九 - for 9  
* 十 - for 10
* 百 - for 100
* 千 - for 1000
* 万 - for 10 thousands
* 亿 - for 100 millions


The objective of the Kernel is to take us through the first steps of a machine learning analysis. We start by preparing the analysis (load the libraries and the data), continue with an Exploratory Data Analysis (EDA) where we highlight various data features, spending some time to try to understand the data.

The first step is to prepare the data analysis.

<a href="#0"><font size="1">Go to top</font></a>  

# <a id='2'>Prepare the data analysis</a>   


Before starting the analysis, we need to make few preparation: load the packages, load and inspect the data.



# <a id='21'>Load packages</a>

We load the packages used for the analysis.


In [None]:
import pandas as pd
import numpy as np
import sys
import os
import cv2 as cv
import matplotlib.pyplot as plt
import seaborn as sns
import skimage
import skimage.io

We also set the image path.

In [None]:
IMAGE_PATH = '..//input//chinese-mnist//data//data//'

<a href="#0"><font size="1">Go to top</font></a>  


# <a id='22'>Load the data</a>  

Let's see first what data files do we have in the root directory.

In [None]:
os.listdir("..//input//chinese-mnist")

There is a dataset file and a folder with images.  

Let's load the dataset file first.

In [None]:
data_df=pd.read_csv('..//input//chinese-mnist//chinese_mnist.csv')

Let's glimpse the data. First, let's check the number of columns and rows.

In [None]:
data_df.shape

There are 15000 rows and 5 columns. Let's look to the data.

In [None]:
data_df.sample(100).head()

The data contains the following values:  

* suite_id - each suite corresponds to a set of handwritten samples by one volunteer;  
* sample_id - each sample wil contain a complete set of 15 characters for Chinese numbers;
* code - for each Chinese character we are using a code, with values from 1 to 15;
* value - this is the actual numerical value associated with the Chinese character for number;  
* character - the Chinese character;  

We index the files in the dataset by forming a file name from suite_id, sample_id and code. The pattern for a file is as following:

> "input_{suite_id}_{sample_id}_{code}.jpg"

<a href="#0"><font size="1">Go to top</font></a>  

# <a id='3'>Data exploration</a>  



Let's start by checking if there are missing data, unlabeled data or data that is inconsistently labeled. 


## <a id='31'>Check for missing data</a>  

Let's create a function that check for missing data in the dataset.

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data(data_df)

There is no missing (null) data in the dataset. Still it might be that some of the data labels are misspelled; we will check this when we will analyze each data feature.

<a href="#0"><font size="1">Go to top</font></a>  

## <a id='32'>Explore image data</a>  

Let's also check the image data. First, we check how many images are stored in the image folder.

In [None]:
image_files = list(os.listdir(IMAGE_PATH))
print("Number of image files: {}".format(len(image_files)))

Let's also check that each line in the dataset has a corresponding image in the image list.  
First, we will have to compose the name of the file from the indexes.

In [None]:
def create_file_name(x):
    file_name = f"input_{x[0]}_{x[1]}_{x[2]}.jpg"
    return file_name

In [None]:
data_df["file"] = data_df.apply(create_file_name, axis=1)

In [None]:
data_df.head()

In [None]:
file_names = list(data_df['file'])
print("Matching image names: {}".format(len(set(file_names).intersection(image_files))))

Let's also check the image sizes.

In [None]:
def read_image_sizes(file_name):
    image = skimage.io.imread(IMAGE_PATH + file_name)
    return list(image.shape)

In [None]:
m = np.stack(data_df['file'].apply(read_image_sizes))
df = pd.DataFrame(m,columns=['w','h'])
data_df = pd.concat([data_df,df],axis=1, sort=False)

Let's check the distribution of images width and height.

In [None]:
print(f"Images widths #: {data_df.w.nunique()},  heights #: {data_df.h.nunique()}")
print(f"Images widths values: {data_df.w.unique()},  heights values: {data_df.h.unique()}")

Let's also glimpse the dataframe with the new columns.

In [None]:
data_df.head()

## <a id='33'>Suites, Samples, Characters distribution</a>  

Let's check the suites of the images. For this, we will group by `suite`.

In [None]:
print(f"Number of suites: {data_df.suite_id.nunique()}")
print(f"Samples: {data_df.sample_id.nunique()}: {list(data_df.sample_id.unique())}")
print(f"Characters codes: {data_df.code.nunique()}: {list(data_df.code.unique())}")
print(f"Characters: {data_df.character.nunique()}: {list(data_df.character.unique())}")
print(f"Numbers: {data_df.value.nunique()}: {list(data_df.value.unique())}")

We have 100 suites, each with 10 samples.

In [None]:
def plot_count(feature, title, df, size=1):
    f, ax = plt.subplots(1,1, figsize=(4*size,4))
    total = float(len(df))
    g = sns.countplot(df[feature], order = df[feature].value_counts().index[:20], palette='Set2')
    g.set_title("Number and percentage of {}".format(title))
    if(size > 2):
        plt.xticks(rotation=90, size=8)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(100*height/total),
                ha="center") 
    plt.show()  

In [None]:
plot_count("code", "character code", data_df, size=3)

In [None]:
plot_count("value", "number value", data_df, size=3)

In [None]:
print(f"frequence of each character:")
data_df.character.value_counts()

In [None]:
def show_images(df, isTest=False):
    f, ax = plt.subplots(10,15, figsize=(15,10))
    for i,idx in enumerate(df.index):
        dd = df.iloc[idx]
        image_name = dd['file']
        image_path = os.path.join(IMAGE_PATH, image_name)
        img_data = cv.imread(image_path)
        ax[i//15, i%15].imshow(img_data)
        ax[i//15, i%15].axis('off')
    plt.show()

We show here the samples drawn by volunteer number 1.

In [None]:
df = data_df.loc[data_df.suite_id==1].sort_values(by=["sample_id","value"]).reset_index()
show_images(df)

And here are the samples drawn by volunteer number 37.

In [None]:
df = data_df.loc[data_df.suite_id==37].sort_values(by=["sample_id","value"]).reset_index()
show_images(df)

For volunteer number 75:

In [None]:
df = data_df.loc[data_df.suite_id==75].sort_values(by=["sample_id","value"]).reset_index()
show_images(df)

Let's look now to a selection of writings for number 0.

In [None]:
df = data_df.loc[data_df.code==1].sample(150).reset_index()
show_images(df)

Let's see now a collection of writings for number 4.

In [None]:
df = data_df.loc[data_df.code==5].sample(150).reset_index()
show_images(df)

<a href="#0"><font size="1">Go to top</font></a>  

# <a id='6'>Conclusions</a>  

We analyzed the dataset, focusing on understanding the data distribution. In the next Notebooks, we will see how we can use this data to train a model to classify new images by character (number value, code or an echivalent label associated).


<a href="#0"><font size="1">Go to top</font></a>