# Introduction

For a change of pace I'd like to try an unusual approach to this common challenge. Looking at how methods normally reserved for tabular data problems perform on the dataset.

It should be noted that all of these methods are likely to face a similar issue, that all columns are equally important. In image recognition it is often very helpful that pixels nearer to a selected pixel are assigned a higher importance in regard to that selected pixel than pixels far away from it. This is not desirable in tabular data and hence there is an issue, still it is interesting to see how far we might get.  

In [1]:
# Import Libraries

import numpy as np
import pandas as pd 
from PIL import Image
from matplotlib import pyplot as plt
import seaborn as sns
import os 

Let's look at the images

In [1]:
# load data
df = pd.read_csv(r"../input/digit-recognizer/train.csv",dtype = np.float32)

# Split data into features X and labels y
X = df.iloc[:,1:].values / 255 
y = df.iloc[:,0].values

fig,ax = plt.subplots(2,5)
for i in range(10):
    nparray = X[i].reshape(28,28)
    image = Image.fromarray(nparray * 255)
    ax[i%2][i//2].imshow(image)
fig.show()

Let's look at the distribution of classes in the dataset

In [1]:
sns.histplot(y)

Roughly equal which is good. Interestingly I always figured MNIST would be completely balanced.

# Create Decision Tree and Visualize Result

In [1]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

tree_set = df.copy()
target = tree_set.iloc[:,0]
tree_set_X = tree_set.iloc[:,1:] 

clf = DecisionTreeClassifier(max_depth=4)
clf.fit(tree_set_X, target)
clf.score(tree_set_X, target)

With a max depth of 4 I find even this accuracy kind of crazy. Let's see it:

In [1]:
lst = tree_set_X.columns.tolist()
text_representation = tree.export_text(clf, feature_names=lst)
print(text_representation)

I'd say this is indicative that a few cental pixels can be definitive in this dataset in regards to which number they are.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

train_ratio = 0.90

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=1 - train_ratio, stratify = y, random_state = 0)

scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)

In [1]:
from sklearn.ensemble import ExtraTreesClassifier

clf = ExtraTreesClassifier(random_state=1)

clf.fit(X_train, y_train)
print(clf.score(X_val, y_val))

Now let's try it in binary

In [1]:
from sklearn.preprocessing import Binarizer

X_train_bin = Binarizer().fit_transform(X_train)
X_val_bin = Binarizer().fit_transform(X_val)

fig,ax = plt.subplots(2,5)
for i in range(10):
    nparray = X_train_bin[i].reshape(28,28)
    image = Image.fromarray(nparray * 255)
    ax[i%2][i//2].imshow(image)
    ax[i%2][i//2].title.set_text(round(y_train[i]))
fig.show()

In [1]:
et_clf = ExtraTreesClassifier(random_state=1)

et_clf.fit(X_train_bin, y_train)
print(et_clf.score(X_val_bin, y_val))

That's getting better in terms of accuracy. Let's try and see if we can find the bad images.

In [1]:
bad_image_list = []
bad_image_list_y = []
bad_predict = []
for _X, _y in zip(X_val_bin, y_val):
    if et_clf.predict(_X.reshape(1, -1)) != _y:
        bad_image_list.append(_X)
        bad_image_list_y.append(_y)
        bad_predict.append(et_clf.predict(_X.reshape(1, -1)).item())
        if len(bad_image_list) > 9:
            break
    
fig,ax = plt.subplots(2,5)
for i in range(10):
    nparray = bad_image_list[i].reshape(28,28)
    image = Image.fromarray(nparray * 255)
    ax[i%2][i//2].imshow(image)
    ax[i%2][i//2].title.set_text("Real: " + str(round(bad_image_list_y[i])))
fig.show()

print("Bad predictions:")
print(bad_predict)

Let's see a confusion matrix of this

In [1]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(y_val, et_clf.predict(X_val_bin))
sns.heatmap(conf_mat, vmax=10, cmap="Blues")
plt.show()

# Stacking Models

We can try stacking models for better results.

In [1]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(random_state=1)

rf_clf.fit(X_train_bin, y_train)
print(rf_clf.score(X_val_bin, y_val))

In [1]:
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(random_state=1)

dt_clf.fit(X_train_bin, y_train)
print(dt_clf.score(X_val_bin, y_val))

In [1]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

estimators = [
     ('rf', RandomForestClassifier(random_state=1)),
     ('et', ExtraTreesClassifier(random_state=1))
]

stack_clf = StackingClassifier(
    estimators=estimators, final_estimator=LogisticRegression()
)

stack_clf.fit(X_train_bin, y_train)
print(stack_clf.score(X_val_bin, y_val))

# Create Submission

In [1]:
test_df = pd.read_csv("../input/digit-recognizer/test.csv")
test_X = scaler.transform(test_df.values)
test_X = Binarizer().fit_transform(test_X)
predicted = stack_clf.predict(test_X)

Check output visually

In [1]:
fig,ax = plt.subplots(2,5)
for i in range(10):
    nparray = test_X[i].reshape(28,28)
    image = Image.fromarray(nparray * 255)
    ax[i%2][i//2].imshow(image)
    ax[i%2][i//2].title.set_text(round(predicted[i]))
fig.show()

In [1]:
submission = pd.read_csv("../input/digit-recognizer/sample_submission.csv")
print(submission.head())
submission["Label"] = np.rint(predicted)
submission = submission.astype(int)
submission.to_csv("submission.csv", index=False)
print(submission.head())