# Recognize hand-written digits with Machine Learning

In this tutorial I'm going to demonstrate how to build up a classifier for hand-written digits using the algorithm of Random Forest, how to validate the model and how to tune the model for better performance.

To run this ipython notebook you'll need to install Anaconda which is an open data science platform powered by Python and can be found from the link below:
https://www.continuum.io/downloads

The MNIST sample data can be dowloaded from Kaggle, the largest data science community in the world, where you can also find numerous of tutorials, starter scripts, instructions and much more.

https://www.kaggle.com/c/digit-recognizer

In [None]:
import pandas as pd # Dataframe
from sklearn.ensemble import RandomForestClassifier # Classification algorithm - random forest
from sklearn import metrics, grid_search
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split
import numpy as np
import math
import random as rd
import pylab as pl
import matplotlib.pyplot as plt
%matplotlib inline


# Load Data

In [None]:
mnist = pd.read_csv('train.csv')
print ("Loading finished.")
print ("Data size:", mnist.shape)

# Data Description

* The dataset (train.csv) contains 42,000 rows.
* Each row contains 785 integers.
    * The first integer is called label, standing for the actual number the image is.
    * Pixel0-pixel783 are grayscales of a 28*28 matrix.

In [None]:
mnist.head(5)

# Visualize the digits

Here we are using matplotlib to plot the digits stored in pixels.

In [None]:
images = [ img.reshape(28, 28) for img in mnist.drop('label',axis=1).values]
images = np.array(images)
labels = mnist.label.values

plt.figure(figsize=(10,10), dpi=600)
for i in range(64):
    plt.subplot(8,8,(i+1))
    plt.subplots_adjust(left=None, bottom=None, right=None, top=1, wspace=None, hspace=None)
    plt.title("Label: %d" % (labels[i]))
    plt.axis("off")
    pl.imshow(images[i],cmap=pl.cm.gray_r)
pl.show()

# Let's take a closer look at the 12th digit:

In [None]:
i = 11
plt.figure(figsize=(5,5), dpi=28*28)
plt.title("Label: %d" % (labels[i]))
plt.imshow(images[i],cmap=pl.cm.gray_r)

# Data split
We'll split the data into two parts: training and test.
* Training data will be used to "train" the machine to learn how to recognize the digits - 32,000 records.
* Test data will be used to validate the accuracy of the model - 10,000 records.

In [None]:
train_x = mnist.drop('label',axis=1)[:32000].values
train_y = mnist.label[:32000]
test_x = mnist.drop('label',axis=1)[32000:].values
test_y = mnist.label[32000:]

# Model building

## Train the model

In [None]:
clf = RandomForestClassifier()
clf.fit(train_x,train_y)

## Make predictions

In [None]:
predictions = clf.predict(test_x)

## Validate results

In [None]:
print("Confusion matrix:\n%s" % metrics.confusion_matrix(predictions, test_y))
print("Accuracy score: %f" % metrics.accuracy_score(predictions, test_y))

## Visualize incorrectly predicted digits

The first digit on top of each image is the actual number and the second one is what's predicted by the model.

In [None]:
incorrect_images = []
incorrect_labels = []
incorrect_predictions = []
incorrect_data = []

actual_label = 7
pred_label = 9

for (image, label, prediction) in zip(test_x, test_y, predictions):
    if label==actual_label and prediction==pred_label:
        incorrect_data.append(image)
        incorrect_images.append(image.reshape(28,28))
        incorrect_labels.append(label)
        incorrect_predictions.append(prediction)

incorrect_images = np.array(incorrect_images)

plt.figure(figsize=(20,10), dpi=600)
for i in range(min([len(incorrect_images),10])):
    plt.subplot(1,min([len(incorrect_images),10]),(i+1))
    plt.title("%d : %d" % (incorrect_labels[i], incorrect_predictions[i]))
    pl.imshow(incorrect_images[i],cmap=pl.cm.gray_r)
pl.show()

# Tune the model
One algorithm may have many parameters and untuned parameters may impact the results significantly. Paramter tuning is one of the biggest challenges in practical machine learning. There are typically two approaches for tuning:
* Automated tuning. For instance, grid search and bayersian optimazition
    * Requires less relatively less knowlege and experience.
    * Time consuming.
* Manual tuning
    * More knowlege and experience required.
    * Time efficient.
    
    
Here we'll use grid search for automated parameters tuning. What grid search does is to firstly create a space of parameter combinations then train/validate the model for each combinations and finally pick up the best-performed one.

There are three parameters we will be tuning for Random Forest:

* n_estimators - The number of trees in the "forest". Generally the larger it is, the better the performance will be.
* criterion - The function to measure the quality of a branch split of decision trees.
* max_depth - The maximum depth of the tree. Larger number indicates more complexities and greater chance of overfitting.

In [None]:
def search_model(train_x, train_y, est, param_grid, n_jobs, cv):
    model = grid_search.GridSearchCV(estimator  = est,
                                     param_grid = param_grid,
#                                      scoring    = 'roc_auc',
                                     verbose    = 10,
                                     n_jobs  = n_jobs,
                                     iid        = True,
                                     refit    = True,
                                     cv      = cv)
    # Fit Grid Search Model
    model.fit(train_x, train_y)
    print("Best score: %0.3f" % model.best_score_)
    print("Best parameters set:", model.best_params_)
    return model

param_grid = {'n_estimators': [10,50,100]
                , 'criterion': ['gini','entropy']
                , 'max_depth': [10,20,30]
              }
model = search_model(train_x
                                         , train_y
                                         , RandomForestClassifier()
                                         , param_grid
                                         , n_jobs=1
                                         , cv=3)   



# Validate the tuned model

In [None]:
tuned_predictions = model.predict(test_x)
print("Confusion matrix:\n%s" % metrics.confusion_matrix(tuned_predictions, test_y))
print("Accuracy score of tuned model: %f" % metrics.accuracy_score(tuned_predictions, test_y))
print("Accuracy score of default model: %f" % metrics.accuracy_score(tuned_predictions, test_y))

## What's improved?

We'll make predictions for those digits that were incorrectly predicted by the untuned model, with the tuned model, then plot them to see what's improved and what's not.

In [None]:
tuned_val_predictions = model.predict(incorrect_data)
plt.figure(figsize=(20,10), dpi=600)
for i in range(10):
    plt.subplot(1,10,(i+1))
    plt.title("Before: %d, after: %d" % (incorrect_predictions[i],tuned_val_predictions[i]))
    pl.imshow(incorrect_images[i],cmap=pl.cm.gray_r)
pl.show()