We're going to create an entry for the [Digit Recognizer Kaggle Competition](https://www.kaggle.com/c/digit-recognizer)

The data csvs can be downloaded [here](https://www.kaggle.com/c/digit-recognizer/data)

Import the needed libraries

In [None]:
import pandas as pd    # data formatting
import numpy as np     # numeric library
from sklearn.neighbors import KNeighborsClassifier  # machine learning
from sklearn.metrics import confusion_matrix
import random

In [None]:
%matplotlib inline
from matplotlib import pylab, pyplot  # plotting

Import the data from csv files into Pandas DataFrames. A DataFrame is similar to an Excel spreadsheet; it is made up of columns and rows.

In [None]:
train = pd.read_csv('train.csv', header=0)
test = pd.read_csv('test.csv', header=0)

Let's see what the data looks like

In [None]:
train.shape

This dataframe has 42,000 rows and 785 columns.  Each row corresponds to 1 digit.  In the 0th column is a label (0-9) saying what digit it is.  The other 784 columns each represent a single pixel.  

The 28x28 square of the image has been *unrolled* (or *reshaped*) into a single long row 1x784.  There are 42,000 handwritten digits represented here!

In [None]:
train.head()

In [None]:
train.iloc[41990:41999,0:10]

In [None]:
test.head()

In [None]:
train.describe()

Let's visualize a single row:

In [None]:
print "this digit is", train.iloc[25,0]
digit = np.reshape(train.iloc[25,1:785], (28,28))
pylab.imshow(digit, cmap='gray')

It is very slow to work with 42,000 rows of data.  Let's take a subset.  Also, let's use part of the labeled data as our test set, so that we can calculate our accuracy

In [None]:
rows = random.sample(train.index, 5000)
train_small = train.ix[rows[:4000]]
test_small = train.ix[rows[4000:]]

In [None]:
train_small.head()

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)

In [None]:
knn.fit(train_small.iloc[:,1:], train_small.iloc[:,0])

In [None]:
predictions = knn.predict(test_small.iloc[:,1:])

In [None]:
test_labels = test_small.iloc[:,0].values

In [None]:
predictions[:10]

In [None]:
test_labels[:10]

In [None]:
accuracy = sum(predictions == test_labels)/float(len(predictions))
print accuracy

Which ones are we getting wrong?

In [None]:
wrong = np.where(predictions != test_labels)

In [None]:
wrong

In [None]:
print "predicted: ", predictions[920], "answer: ", test_labels[920]
digit = np.reshape(test_small.iloc[920,1:785], (28,28))
pylab.imshow(digit, cmap='gray')

What are the nearest neighbors to that wrong one?

In [None]:
dist, ind = knn.kneighbors(test_small.iloc[920,1:785])
print ind

In [None]:
print "this is a ", train_small.iloc[3054,0]
digit = np.reshape(train_small.iloc[3054,1:785], (28,28))
pylab.imshow(digit, cmap='gray')

Confusion Matrix: See how data is labeled and mislabeled, by category.

In [None]:
cm = confusion_matrix(test_labels, predictions)
pyplot.matshow(cm)
pyplot.title('Confusion matrix')
pyplot.colorbar()
pyplot.ylabel('True label')
pyplot.xlabel('Predicted label')
pyplot.show()

Since we're usually right, it's hard to see what we're getting wrong (shades of blue).  Let's ignore the cases where we are right:

In [None]:
np.fill_diagonal(cm, 0)
pyplot.matshow(cm)
pyplot.title('Confusion matrix')
pyplot.colorbar()
pyplot.ylabel('True label')
pyplot.xlabel('Predicted label')
pyplot.show()

To enter the Kaggle Competition:

In [None]:
kaggle_predictions = knn.predict(test)

In [None]:
kaggle_predictions_df = pd.DataFrame({'Label': kaggle_predictions, 'ImageId': range(1,1+len(kaggle_predictions))})

In [None]:
kaggle_predictions_df.to_csv("predictions.csv", index=False)