In [1]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/digit-recognizer/sample_submission.csv
/kaggle/input/digit-recognizer/train.csv
/kaggle/input/digit-recognizer/test.csv


This is the introduction to Computer Vision for most people (including me) so I will try to keep it simple. First I will make a baseline model to set as a bench mark, using an ordinary classifier. Then I will make a Convolutional Neural Network model to see how much better it performs than the benchmark.

In [2]:
import tensorflow as tf
import tensorflow_decision_forests as tfdf

from sklearn.preprocessing import MinMaxScaler

from keras.utils.np_utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv2D, MaxPool2D, BatchNormalization, Dropout
from keras.optimizers import SGD

In [3]:
train_data = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')
test_data = pd.read_csv('/kaggle/input/digit-recognizer/test.csv')

In [4]:
train_data.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Baseline

The baseline I will build is the Gradient Boosting algorithm of Tensorflow Decision Forest. This is better than Multi-layer Perceptron which gives an accuracy of 0.95.

First, transform the train and test datasets to tf dataset, with the appropriate label for train. Then initialize a Gradient Boosting model wih the first benchmark hyperparameters and fit on the train, predict on the test. The predictions will be as probabilities so convert them to hard predictions. Preprocessing is not required since all features are categorical, there are no missing values and tree based models don't need feature scaling.

In [5]:
# train = tfdf.keras.pd_dataframe_to_tf_dataset(train_data, label='label')
# test = tfdf.keras.pd_dataframe_to_tf_dataset(test_data)

In [6]:
# clf = tfdf.keras.GradientBoostedTreesModel(hyperparameter_template="benchmark_rank1")
# clf.fit(x=train)
# predictions = clf.predict(test)
# n_predictions = np.argmax(predictions, axis=1)

# score = 0.97046

It gives a score of 0.97046 which is a good benchmark. Let's build a CNN and see how much better it performs.

# Preprocessing

First, scale all the features to between 0 and 1 using MinMaxScaler or just dividing by 255 in this case. Scaling affects Neural Networks significantly.

Then reshape the data to (-1, 28, 28, 1) because we know they are images with width 28, length 28 and depth 1 pixel. The first parameter -1 is for automatically determining the number of samples in the dataset. One Hot Encode the labels using the to_categorical method.

In [7]:
x_train = train_data.drop(['label'], axis=1)
y_train = train_data['label']
x_test = test_data
scaler = MinMaxScaler()

In [8]:
y_train.value_counts()

1    4684
7    4401
3    4351
9    4188
2    4177
6    4137
0    4132
4    4072
8    4063
5    3795
Name: label, dtype: int64

Good news - all classes have roughly the same number of samples so we don't have to handle class imbalance.

In [9]:
x_train = pd.DataFrame(scaler.fit_transform(x_train), columns=scaler.feature_names_in_)
x_test = pd.DataFrame(scaler.transform(x_test), columns=scaler.feature_names_in_)

x_train = x_train.values.reshape(-1,28,28,1)
x_test = x_test.values.reshape(-1,28,28,1)
y_train = to_categorical(y_train, num_classes=10)

# Model

Now we are ready to build the CNN model. Initialize a sequential model and add layers to it. The model has two components: the feature extraction front end comprised of convolutional and pooling layers, and the classifier backend that will make the prediction.

For the feature extraction front end, add convolution layers with increasing number of filters (32, 64, 128) and modest kernel size of (5, 5). Each convolution layer should be followed by batch normalization for standardizing the outputs which results in stabilizing and accelerating the learning process, and a pooling layer. The first convolution layer is the input layer so the input data shape of (28, 28, 1) should be specified for it. The filter maps can then be flattened to provide features to the classifier.

For the classifier back end, we know that there are 10 classes so the output layer must have 10 nodes and softmax activation in order to predict the probability distribution of an image belonging to each of the 10 classes. Between the feature extractor and the output layer, we can add a dense layer to interpret the features, in this case with 128 nodes.

Using He weight initialization and ReLU activation function in all the layers (except the last output layer) is a good practice.

In [10]:
model = Sequential()

model.add(Conv2D(filters=32, kernel_size=(5,5), kernel_initializer='he_uniform', padding='Same', activation='relu', input_shape=(28,28,1)))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2,2)))

model.add(Conv2D(filters=64, kernel_size=(5,5), kernel_initializer='he_uniform', padding='Same', activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2,2)))

model.add(Conv2D(filters=128, kernel_size=(5,5), kernel_initializer='he_uniform', padding='Same', activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(2,2)))


model.add(Flatten())
model.add(Dense(128, kernel_initializer='he_uniform', activation='relu'))
model.add(BatchNormalization())


model.add(Dense(10, activation='softmax'))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 28, 28, 32)        832       
                                                                 
 batch_normalization (BatchN  (None, 28, 28, 32)       128       
 ormalization)                                                   
                                                                 
 max_pooling2d (MaxPooling2D  (None, 14, 14, 32)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 14, 14, 64)        51264     
                                                                 
 batch_normalization_1 (Batc  (None, 14, 14, 64)       256       
 hNormalization)                                                 
                                                        

Before fitting the model on train data, compile it with an appropriate optimizer, loss function and metric. I will use a custom SGD optimizer with momentum 0.9 but you can also use a simple adam optimizer with `optimizer='adam'`. Categorical Crossentropy is a good loss function for this problem since it's a multi class classification. Accuracy is a suitable metric since all classes have roughly the same number of samples.

In [11]:
model.compile(optimizer=SGD(learning_rate=0.01, momentum=0.9,), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, validation_split=0.1, batch_size=50, epochs=50, verbose=2)

predictions = model.predict(x_test)
n_predictions = np.argmax(predictions, axis=1)

Epoch 1/50
756/756 - 89s - loss: 0.0980 - accuracy: 0.9703 - val_loss: 0.0438 - val_accuracy: 0.9881 - 89s/epoch - 118ms/step
Epoch 2/50
756/756 - 88s - loss: 0.0305 - accuracy: 0.9913 - val_loss: 0.0303 - val_accuracy: 0.9905 - 88s/epoch - 117ms/step
Epoch 3/50
756/756 - 88s - loss: 0.0145 - accuracy: 0.9960 - val_loss: 0.0358 - val_accuracy: 0.9888 - 88s/epoch - 117ms/step
Epoch 4/50
756/756 - 88s - loss: 0.0078 - accuracy: 0.9983 - val_loss: 0.0303 - val_accuracy: 0.9898 - 88s/epoch - 116ms/step
Epoch 5/50
756/756 - 87s - loss: 0.0044 - accuracy: 0.9990 - val_loss: 0.0229 - val_accuracy: 0.9924 - 87s/epoch - 116ms/step
Epoch 6/50
756/756 - 87s - loss: 0.0026 - accuracy: 0.9997 - val_loss: 0.0246 - val_accuracy: 0.9919 - 87s/epoch - 115ms/step
Epoch 7/50
756/756 - 87s - loss: 0.0017 - accuracy: 0.9998 - val_loss: 0.0233 - val_accuracy: 0.9924 - 87s/epoch - 115ms/step
Epoch 8/50
756/756 - 86s - loss: 8.4187e-04 - accuracy: 1.0000 - val_loss: 0.0228 - val_accuracy: 0.9926 - 86s/epoch -

# Submission

Observe the sample submission and submit your predictions in the same format.

In [12]:
pd.read_csv('/kaggle/input/digit-recognizer/sample_submission.csv')

Unnamed: 0,ImageId,Label
0,1,0
1,2,0
2,3,0
3,4,0
4,5,0
...,...,...
27995,27996,0
27996,27997,0
27997,27998,0
27998,27999,0


In [13]:
df = pd.DataFrame()
df['ImageId'] = range(1, 28001)
df['Label'] = n_predictions
df.to_csv('submission.csv', index=False)