### Applying kNN and PCA to the [Kaggle Digit Recognizer](https://www.kaggle.com/c/digit-recognizer) Contest Using MNIST Data

Ben Van Dyke, January 2014

This IPython notebook shows my initial solution to the Kaggle Digit Recognizer Contest. I use various sklearn packages to perform PCA to reduce dimensionality, normalize the training and test data, perform cross validation on the training data and finally classify the test data. My submission in the contest ended up with a 0.96786 score, better than the benchmark kNN score. The performance is great considering the simplicity and readability of this implementation.

In [7]:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn import cross_validation
%matplotlib inline
import matplotlib.pyplot as plt
from __future__ import print_function

#### Preprocessing

In [6]:
# load the train and test data
train = np.loadtxt('train.csv',delimiter=',',skiprows=1)
test = np.loadtxt('test.csv',delimiter=',',skiprows=1)

FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'

In [None]:
train

In [None]:
np.shape(train)

In [None]:
# separate labels from training data
train_data = train[:,1:]
train_labels = train[:,0]

In [None]:
# select number of components to extract
pca = PCA(n_components=40)

In [None]:
# fit to the training data
pca.fit(train_data)

In [None]:
# determine amount of variance explained by components
np.sum(pca.explained_variance_ratio_)

In [None]:
# plot the explained variance
plt.plot(pca.explained_variance_ratio_)
plt.title('Variance Explained by Extracted Componenent')
plt.show()

With 40 components extracted, about 79% of the total variance in the dataset is explained. 

In [None]:
# extract the features
train_ext = pca.fit_transform(train_data)
print(train_ext.shape)

Here is the impact of the feature extraction, now the training data is 40 columns wide.

In [None]:
# transform the test data using the existing parameters
test_ext = pca.transform(test)
print(test_ext.shape)

Because we are using a nearest neighbors classifier based on distance, the data needs to be normalized.

In [None]:
min_max_scaler = MinMaxScaler()

In [None]:
train_norm = min_max_scaler.fit_transform(train_ext)
test_norm = min_max_scaler.fit_transform(test_ext)

In [None]:
test_norm

#### Training

In [None]:
# fit the model to the training data using defaults
# n_neighors = 5
knn = KNeighborsClassifier()
knn.fit(train_norm, train_labels)

In [None]:
cross_validation.cross_val_score(knn, train_norm, train_labels, cv=5)

Performing the five-fold cross-validation provides a look at the possible performance on unobserved data drawn from the same population. In this case, the classifier performed well, about 97% accuracy across the folds.

#### Predicting

In [None]:
# predict the test classes
pred = knn.predict(test_norm)

In [None]:
 # write to a file
save = pred.round()
ind = np.arange(1,len(pred) + 1)
new_save = np.column_stack((ind, save))
np.savetxt('knnpca.csv',new_save,delimiter=',',fmt='%0.0f')