# Kaggle Digit Recognition

This is one of the problem proposed on Kaggle: handwritten digit recognition using the MNIST data. We use scikit-learn.

In [None]:
import pandas as pd
from time import clock

We begin by loading the both the training and the testing data.

In [None]:
# Read training data
start = clock()

train_frame = pd.read_csv('data/train.csv')
label = train_frame['label'].values
train = train_frame.iloc[:,1:].values

print('Loaded {:d} train entries in {:.0f} seconds.'.format(len(train), clock() - start))

# Train on fewer entries
# label = label[0::10]
# train = train[0::10]

# Read test data 
start = clock()

test_frame = pd.read_csv('data/test.csv')
test = test_frame.values

print('Loaded {:d} test entries in {:.0f} seconds.'.format(len(test), clock() - start))

To make sure that we load the data correctly, we can take a random digit in the training set to visualize.

In [None]:
# Select a random entry

from random import randint  

i = randint(0,len(train)-1)
print("Displayed train entry {:d} labelled {:d}.".format(i, label[i]))

# Plot using matplotlib

import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline  
    
train_square = train.reshape(-1,28,28)
plt.imshow(train_square[i], cmap=cm.binary)
plt.axis('off')
plt.show()

# Select Classifier

A first demonstration can be quickly done with a random forest.

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100)

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier(n_estimators = 100)

Another choice is to first preprocess the data with PCA, say, and then pipeline this with SVC.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.svm import SVC

pca = PCA(n_components = 35, whiten = True)
clf = make_pipeline(pca, SVC())

## Validate, Train, Extrapolate
We can use cross-validation to get an idea of how well the classifier generalizes.

In [None]:
from sklearn.cross_validation import cross_val_score
start = clock()

scores = cross_val_score(clf, train, label)

print("Performed {:d}-fold cross validation in {:.0f} seconds with accuracy {:0.4f} +/- {:0.4f}.".format(
    len(scores), clock() - start, scores.mean(), scores.std()))

Result
- Random Forest with 1000 estimator performed 3-fold cross validation in 655 seconds with accuracy 0.9645 +/- 0.0022. 
- ExtraTrees performed 3-fold cross validation in 49 seconds with accuracy 0.9656 +/- 0.0006.
- PCA+SVM performed 3-fold cross validation in 132 seconds with accuracy 0.9777 +/- 0.0013.

We are now ready to fit the classifier to the training data, predict/extrapolate to the test data, and save the results.

In [None]:
# Fit training data

start = clock()
clf.fit(train, label)
print("Fitted training data in {:.0f} seconds.".format(clock() - start))

# Extrapolate to test data

start = clock()
predict = clf.predict(test)
print("Extrapolated to test data in {:.0f} seconds.".format(clock() - start))

# Save results

test_frame['ImageId'] = range(1,len(test)+1)
test_frame['Label'] = predict
test_frame.to_csv('predict.csv', cols = ('ImageId', 'Label'), index = None)

If we used PCA+SVC, PCA can tell us how much of the variance is explained.

In [None]:
variance = sum(clf.named_steps['PCA'].pca.explained_variance_ratio_)
print("PCA uses {:d} components explaining {:.0%} of the variance.".format(n_comp, variance))

## Improvements
We can attempt to select SVC's parameters by optimizing with grid search.

In [None]:
# Transform data
from sklearn.decomposition import PCA

n_comp = 35
pca = PCA(n_components = n_comp, whiten = True)

start = clock()
pca.fit(train)
train_transformed = pca.transform(train)

print("Transformed data in {:.0f} seconds using {:d} components explaining {:.0%} of the variance.".format(
        clock() - start, n_comp, sum(pca.explained_variance_ratio_)))

# Select classifier
from sklearn.svm import SVC

algo = 'rbf'
tol = 0.01
clf = SVC(kernel = algo, tol = tol, shrinking = True)

# Search on fewer entries
label_few = label[0::10]
train_few = train_transformed[0::10]

# Parameter space to search for SVC
from numpy import logspace
params = [{'C': logspace(-1, 3), 'gamma': logspace(-4, -1)}]
    
# Run exhaustive grid search
from sklearn.grid_search import GridSearchCV
start = clock()

gs = GridSearchCV(estimator = clf, param_grid = params, n_jobs = 2)
gs.fit(train_few, label_few)

print("Parameter optimi zed {} yielding {:.4f} in {:.0f} seconds.".format(
        gs.best_params_, gs.best_score_, clock() - start))

Here are some choices of parameters for SVC that seem to work reasonably well.
- C = 4.2919342601287758 and gamma = 0.028117686979742307 gives 0.8857 in 6 seconds.
- C = 1.3894954943731375 and gamma = 0.042919342601287783 gives 0.9502 in 27 seconds.

Scikit-learn also has neural networks available.

In [None]:
# MLPClassifier requires 0.18dev+ and is not available in 0.17
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(algorithm = 'l-bfgs', alpha = 1e-5, hidden_layer_sizes = (5, 2), random_state = 1)