<a href="https://colab.research.google.com/github/stanleykywu/ds-intro/blob/main/Digits_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Digits Classifier Example Notebook

## Dependencies

Run the installation code block ONLY if you are running this in a Google Colab. Nothing will if you run it locally but hopefully you wouldn't need to since all packages will already have been installed

In [None]:
%pip install sklearn

### Import necessary functions

In [2]:
import matplotlib.pyplot as plt
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split

## Taking a look at our Image Dataset

In [None]:
digits = datasets.load_digits()

_, axes = plt.subplots(nrows=2, ncols=5, figsize=(20, 6))
for r, row in enumerate(axes):
    for c, col in enumerate(row):
        image = digits.images[r * 5 + c]
        label = digits.target[r * 5 + c]
        axes[r][c].imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
        axes[r][c].set_title("Digit: %i" % label)
        axes[r][c].axis('off')

## Data Cleaning

We have 1797 images, and they are all 8x8 matrices representing our handwritten images.

In [None]:
print(digits.images.shape)

2-D data is sometimes annoying and hard to deal with. So we flatten the images from 2-dimensional 8x8 matrices to 1-dimensional vectors with $8 \cdot 8 = 64$ values

In [None]:
# flatten the images to turn 
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
print(data.shape)

We also have the corresponding labels for each image

In [None]:
labels = digits.target

# Print first 10 labels to see
print(labels[:10])

We split our data in train and test, X corresponding to the input flattened images, and the y corresponding to the labels. We train our model on the training and see how well it does on testing, i.e. data it has never seen before.

In [17]:
# Split data into 50% train and 50% test subsets
X_train, X_test, y_train, y_test = train_test_split(
    data, labels, test_size=0.5, shuffle=True
)

## Creating our Classifier

We use the built in SVC classifier from Scikit-learn to train/fit.

In [None]:
# Create a classifier: a support vector classifier
clf = svm.SVC(gamma=0.001)

# Learn the digits on the train subset
clf.fit(X_train, y_train)

We now use our trained classifier to generate label predictions for our witheld testing data

In [19]:
# Predict the value of the digit on the test subset
predicted = clf.predict(X_test)

## Model Evaluation

How well did we do?

In [None]:
_, axes = plt.subplots(nrows=2, ncols=5, figsize=(20, 6))
for r, row in enumerate(axes):
    for c, col in enumerate(row):
        image = X_test[r * 5 + c].reshape((8, 8))
        label = y_test[r * 5 + c]
        axes[r][c].imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
        axes[r][c].set_title("Digit: %i" % label)
        axes[r][c].axis('off')

Let's use Scikit-Learn's classification report to see how we did on our classification task in more depth

In [None]:
print(
    f"Classification report for classifier {clf}:\n"
    f"{metrics.classification_report(y_test, predicted)}\n"
)