# Classifying Chinese Digits 

> The MNIST dataset is one of the classics of machine learning. It is the gateway to deep learning for many, and undoubtly changed how people approach problems in computing. 
> 
> This notebook takes a look at a variation of the beloved MNIST dataset. Instead of Arabic numerals, we are classifying Chinese digits. 
> 
> It will include characters from 0 to 100 million. Interestingly, the Chinese counting system is a bit different. Instead saying ten thousand, there is a different unit called **万** (wan). At the same time, there is a separate character **亿** that describes one hundred million. 
> 
**I hope this notebook is helpful, feel free to provide me with feedbacks and suggestions.**

![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F9%2F9f%2FWang_Xianzi_Imitation_by_Tang_Dynasty.JPG%2F1920px-Wang_Xianzi_Imitation_by_Tang_Dynasty.JPG&f=1&nofb=1)

Image source: Wikipedia

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import os 
import cv2
import re
from sklearn.model_selection import train_test_split

read the csv file that contains all the instructions of the training data layout

In [None]:
index_file = pd.read_csv('../input/chinese-mnist/chinese_mnist.csv')
index_file

We are told that the images are named based on 

suite_id: 1, sample_id: 3, code: 4  forming input_1_3_4.jpg  

In [None]:
labels = sorted(index_file.value.unique())
labels

Get the image data based on the rules specified in the csv file 

In [None]:
data = []
width = 60
height = 60

for dirname, _, filenames in os.walk('/kaggle/input/chinese-mnist/data/data/'):
    for filename in filenames:
        # apply regular expression to find all the numbers
        comb = re.findall('[0-9]+', filename)
        comb = [int(i) for i in comb]
        if(len(comb)==3):
            # convert the image into an array and resize it
            label = labels.index(index_file[index_file.suite_id==comb[0]][index_file.sample_id==comb[1]][index_file.code==comb[2]].value.values[0])
            image_data = cv2.imread(os.path.join(dirname, filename), cv2.IMREAD_GRAYSCALE)
            resized_data = cv2.resize(image_data, (width, height))
            data.append([resized_data, label])
        else:
            print('Incompatible file format')
            break; 

visualize chinese digits 

In [None]:
plt.figure(figsize=(19, 16))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(cv2.cvtColor(data[i][0], cv2.COLOR_BGR2RGB))
    plt.xlabel(labels[data[i][1]])

Preparing our data for training 

In [None]:
X = [] 
y = [] 

for feature, label in data:
    X.append(feature)
    y.append(label)

X = np.array(X).reshape(-1, width, height, 1)
y = np.array(y)

split into training and testing sets 

In [None]:
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=69)

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, MaxPooling2D, Conv2D, Activation, Flatten, Dropout, BatchNormalization
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers

Building a convolutional neural network 

In [None]:
model = Sequential()

model.add(Conv2D(256, (3, 3), padding='same', input_shape=X_train.shape[1:]))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization(axis=1))

model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization(axis=1))

model.add(Flatten())  # this converts our 3D feature maps to 1D feature vectors

model.add(Dense(32))
model.add(Activation('relu'))
model.add(Dropout(0.3))
model.add(BatchNormalization(axis=1))

model.add(Dense(len(labels)))
model.add(Activation('softmax'))

early_stop = tf.keras.callbacks.EarlyStopping(patience=3, monitor='val_loss', restore_best_weights=True)
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(X_train, y_train, batch_size=70, epochs=100, validation_split=0.15, callbacks=[early_stop])

Evaluating the model 

In [None]:
model.evaluate(X_test, y_test)