# Digit Recognition

In this project, I will work with the MNIST dataset which includes a large number of images with each image representing a digit from `0` to `9`. The aim of this project is to train a model on the data and correctly classify the images. As an extension, I will work with `GridSearchCV` to tune parameters and extract the best parameters for the model.

## Import libraries and dataset

I import `numpy`, and `pandas` to work with data. Next, I use `matplotlib` to read images and draw visualizations. Finally, I use `sklearn` to import necessary subpackges to get the model to train on the data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import rainbow
%matplotlib inline

from sklearn.datasets import fetch_openml
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

We can read the dataset from the files already present in Kaggle.

In [None]:
train_data = pd.read_csv('../input/mnist_train.csv')
test_data = pd.read_csv('../input/mnist_test.csv')

In [None]:
print("Training data:")
print("Shape: {}".format(train_data.shape))
print("Total images: {}".format(train_data.shape[0]))

print("Testing data:")
print("Shape: {}".format(test_data.shape))
print("Total images: {}".format(test_data.shape[0]))

There are **60,000 training images** and **10,000 testing images**. The dataset includes the class inside the column name `label`. I'll now separate the features and labels for both training and testing data.

In [None]:
train_y = train_data['label']
train_X = train_data.drop(columns = ['label'])

test_y = test_data['label']
test_X = test_data.drop(columns = ['label'])

## Exploring data

Here, I'll keep the testing dataset aside and work with the training data only. I'll now use test data only during final model evaluation.

### Analysing class distribution

First, I count the labels in the training set and check if the classes include comparatively equal count of images.

In [None]:
train_labels = train_y.value_counts()
plt.figure(figsize = (12, 8))
cmap = rainbow(np.linspace(0, 1, train_labels.shape[0]))
plt.bar(train_labels.index.values, train_labels, color = cmap)
plt.xticks(train_labels.index.values)
plt.xlabel('Digits')
plt.ylabel('Count of images')
plt.title('Count of images for each digit (0 - 9)')

While there is some variation across count of images for each class, the variation can be ignored and we can proceed to analyse the data further as it won;t hugely impact the training of the model.

### Viewing the training data images

I'll now randomly select 10 images from the training data and view them.

In [None]:
np.random.seed(0)
plt.figure(figsize = (20, 8))
for i in range(10):
    index = np.random.randint(train_X.shape[0])
    image_matrix = train_X.iloc[index].values.reshape(28, 28)
    plt.subplot(2, 5, i+1)
    plt.imshow(image_matrix, cmap=plt.cm.gray)

While it's easy to recognize the digit represented by each image, one must note that not each digit's image matches the other. The three images of `4` is different from one another in the form of style as well as thickness. This would make learning for the machine learning challenging. Let's see how Random Forest performs on the dataset.

## Applying Machine Learning

After analysing the data, I'll now apply **Random Forest Classifier** using the default parameter values.

In [None]:
random_forest_classifier = RandomForestClassifier()
random_forest_classifier.fit(train_X, train_y)

## Classifying and analysing

Now, I'll use the trained model to classify the testing images and analyse the results. First, I'll use the `predict` method to predict the results on the test data.

In [None]:
pred_y = random_forest_classifier.predict(test_X)

### Metrics

I'll use **accuracy_score** and **confusion_matrix** to analyse the classification done by the model.

In [None]:
print("Accuracy: {}%".format(accuracy_score(test_y, pred_y)*100))
print("Confusion Matrix:")
print("{}".format(confusion_matrix(test_y, pred_y)))

### Analysing

Taking a look at the metrics, the model has very well performed on the test data and has acheived an **accuracy of 94.42%**. The confusion matrix also displays the same trend where majority classes are correctly identified as can be seen with the values across the diagonal.

Let's also take a look at 10 examples from the test data and see how correctly our model predicts.

In [None]:
np.random.seed(0)
plt.figure(figsize = (20, 8))
for i in range(10):
    index = np.random.randint(test_X.shape[0])
    image_matrix = test_X.iloc[index].values.reshape(28, 28)
    plt.subplot(2, 5, i+1)
    plt.imshow(image_matrix, cmap=plt.cm.gray)
    plt.title("Model predicted number: {}".format(random_forest_classifier
                                                  .predict(test_X.iloc[index].values.reshape(1, -1))[0]))

It's amazing how the model is even able to predict the 9th image as `2` which is very difficult to predict even for a human. However, it incorrectly predicts the 8th image as `3` while it clearly is `5`.

# Parameter Tuning

In this part, I will use `GridSearchCV` to identify the best parameters for the estimator and then use it to check if the accuracy for test data has improved or not.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 50, 100],
    'min_samples_split': [2, 4],
    'max_features': ['sqrt', 'log2']
}

grid = GridSearchCV(random_forest_classifier, param_grid = param_grid, cv = 5, verbose = 5, n_jobs = -1)
grid.fit(train_X, train_y)

The `grid` shall now have the best estimator parameters for **Random Forest Classifier**. I will now save the model in the variable `best_estimator`.

In [None]:
best_estiomator = grid.best_estimator_

As my model is now ready, I'll simply use it to predict the results on the test data and calculate both `accuracy_score` and `confusion_matrix`.

In [None]:
best_pred_y = best_estiomator.predict(test_X)
print("Accuracy: {}%".format(accuracy_score(test_y, best_pred_y)*100))
print("Confusion Matrix:")
print("{}".format(confusion_matrix(test_y, best_pred_y)))

As we can see, just by selecting the best combination of parameters for our model, I was able to improve the accuracy of the **Random Forest Classifier** to **97.08%** from **94.42%**.