1. Try to build a classifier for the MNIST dataset that achieves over 97% accuracy
on the test set. Hint: the KNeighborsClassifier works quite well for this task;
you just need to find good hyperparameter values (try a grid search on the
weights and n_neighbors hyperparameters).



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
X, y = mnist

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, stratify=y)

In [10]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn = KNeighborsClassifier(n_jobs=-1)
params = {
    'n_neighbors': np.arange(5, 30, 3)
}
grid = GridSearchCV(knn, params)
grid.fit(X_train, y_train)

In [13]:
knn = KNeighborsClassifier(**grid.best_params_, n_jobs=-1)
knn.fit(X_train, y_train)

In [14]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, knn.predict(X_test))

0.9683

2. Write a function that can shift an MNIST image in any direction (left, right, up,
or down) by one pixel.5 Then, for each image in the training set, create four shifted
copies (one per direction) and add them to the training set. Finally, train your
best model on this expanded training set and measure its accuracy on the test set.
You should observe that your model performs even better now! This technique of
artificially growing the training set is called data augmentation or training set
expansion.

In [None]:
from scipy.ndimage import shift

X_reshaped = X.reshape(-1, 28, 28)

X_augmented = list(X_reshaped)
y_augmented = list(y)

def shift_image(image, dx, dy):
    shifted_image = shift(image, [dy, dx], mode='constant', cval=0)
    return shifted_image

for image, label in zip(X_reshaped, y):
    # Shift up
    X_augmented.append(shift_image(image, 0, -1))
    y_augmented.append(label)
    # Shift down
    X_augmented.append(shift_image(image, 0, 1))
    y_augmented.append(label)
    # Shift left
    X_augmented.append(shift_image(image, -1, 0))
    y_augmented.append(label)
    # Shift right
    X_augmented.append(shift_image(image, 1, 0))
    y_augmented.append(label)

X_augmented = np.array(X_augmented)
y_augmented = np.array(y_augmented)

print(f"Original dataset size: {len(X)}")
print(f"Augmented dataset size: {len(X_augmented)}")

Original dataset size: 70000
Augmented dataset size: 350000


In [21]:
X_augmented = X_augmented.reshape((350000, -1))

In [22]:
knn = KNeighborsClassifier(**grid.best_params_, n_jobs=-1)
knn.fit(X_augmented, y_augmented)

In [23]:
accuracy_score(y_test, knn.predict(X_test))

0.9973

4. Build a spam classifier (a more challenging exercise): \
• Download examples of spam and ham from Apache SpamAssassin’s public
datasets. \
• Unzip the datasets and familiarize yourself with the data format. \
• Split the datasets into a training set and a test set. \
• Write a data preparation pipeline to convert each email into a feature vector. \
Your preparation pipeline should transform an email into a (sparse) vector that
indicates the presence or absence of each possible word. For example, if all
emails only ever contain four words, “Hello,” “how,” “are,” “you,” then the email
“Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1]
(meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is
present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of
each word. \
You may want to add hyperparameters to your preparation pipeline to control
whether or not to strip off email headers, convert each email to lowercase,
remove punctuation, replace all URLs with “URL,” replace all numbers with
“NUMBER,” or even perform stemming (i.e., trim off word endings; there are
Python libraries available to do this).
Finally, try out several classifiers and see if you can build a great spam classifier,
with both high recall and high precision.