## Setup

In [1]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_openml
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from tqdm import tqdm
from scipy.ndimage import shift

In [2]:
mnist = fetch_openml("mnist_784", as_frame=False)
print(mnist.DESCR)

  warn(


**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges  
**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown  
**Please cite**:  

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples  

It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image b

In [3]:
X, y = mnist.data, mnist.target

In [4]:
X_train, X_test, y_train, y_test = X[:60_000], X[60_000:], y[:60_000], y[60_000:]

In [5]:
print(f"X_train.shape: {X_train.shape}")
print(f"X_test.shape:  {X_test.shape}")
print(f"y_train.shape: {y_train.shape}")
print(f"y_test.shape:  {y_test.shape}")

X_train.shape: (60000, 784)
X_test.shape:  (10000, 784)
y_train.shape: (60000,)
y_test.shape:  (10000,)


## Question 1

Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set. Hint: the `KNeighborsClassifer` works quite well for this task; you just need to find good hyperparameter values (try grid search on the `weights` and `n_neighbors` hyperparameters).

In [11]:
param_grid = [
    {"weights": ["uniform", "distance"], "n_neighbors": [3, 4, 5, 6]}
]

knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, param_grid, scoring="accuracy", cv=3, verbose=3)
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV 1/3] END ....n_neighbors=3, weights=uniform;, score=0.969 total time=  16.8s
[CV 2/3] END ....n_neighbors=3, weights=uniform;, score=0.968 total time=  16.6s
[CV 3/3] END ....n_neighbors=3, weights=uniform;, score=0.968 total time=  16.2s
[CV 1/3] END ...n_neighbors=3, weights=distance;, score=0.970 total time=  16.3s
[CV 2/3] END ...n_neighbors=3, weights=distance;, score=0.969 total time=  16.4s
[CV 3/3] END ...n_neighbors=3, weights=distance;, score=0.969 total time=  16.1s
[CV 1/3] END ....n_neighbors=4, weights=uniform;, score=0.966 total time=  16.4s
[CV 2/3] END ....n_neighbors=4, weights=uniform;, score=0.966 total time=  17.3s
[CV 3/3] END ....n_neighbors=4, weights=uniform;, score=0.967 total time=  19.9s
[CV 1/3] END ...n_neighbors=4, weights=distance;, score=0.971 total time=  17.0s
[CV 2/3] END ...n_neighbors=4, weights=distance;, score=0.970 total time=  17.7s
[CV 3/3] END ...n_neighbors=4, weights=distance;,

In [12]:
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best estimator:  {grid_search.best_estimator_}")
print(f"Best score:      {grid_search.best_score_}")

Best parameters: {'n_neighbors': 4, 'weights': 'distance'}
Best estimator:  KNeighborsClassifier(n_neighbors=4, weights='distance')
Best score:      0.9703500000000002


In [17]:
best_knn_clf = grid_search.best_estimator_
cross_val_score(best_knn_clf, X_train, y_train, cv=3, scoring="accuracy")

array([0.9709 , 0.9698 , 0.97035])

In [18]:
best_accuracy = best_knn_clf.score(X_test, y_test)
best_accuracy

0.9714

## Question 2

Write a function that can shift an MNIST image in any direction (left, right, up, or down) by one pixel. Then, for each image in the training set, create four shifted copies (one per direction) and add them to the training set. Finally, train your best model on this expanded training set and measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing the training set is called data augmentation or training set expansion.

In [6]:
def shift_image(image, dx, dy):
    image = image.reshape((28, 28))
    shifted_image = shift(image, [dx, dy], cval=0, mode="constant")
    return shifted_image.reshape([-1])

def augment(X, y):
    X_augmented = [image for image in X]
    y_augmented = [label for label in y_train]
    
    for dx, dy in ((-1, 0), (1, 0), (0, -1), (0, 1)):
        for image, label in tqdm(zip(X, y), total=len(X)):
            X_augmented.append(shift_image(image, dx, dy))
            y_augmented.append(label)

    X_augmented = np.array(X_augmented)
    y_augmented = np.array(y_augmented)
    
    return X_augmented, y_augmented

In [7]:
X_train_augmented, y_train_augmented = augment(X_train, y_train)

100%|██████████| 60000/60000 [00:08<00:00, 7174.48it/s]
100%|██████████| 60000/60000 [00:08<00:00, 7259.97it/s]
100%|██████████| 60000/60000 [00:08<00:00, 7258.43it/s]
100%|██████████| 60000/60000 [00:08<00:00, 7272.91it/s]


In [8]:
rng = np.random.default_rng(seed=42)
shuffle_idx = rng.permutation(len(X_train_augmented))

X_train_augmented = X_train_augmented[shuffle_idx]
y_train_augmented = y_train_augmented[shuffle_idx]

In [9]:
# best_weights = grid_search.best_params_["weights"]
# best_n_neighbors = grid_search.best_params_["n_neighbors"]

best_weights = "distance"
best_n_neighbors = 4

knn_clf = KNeighborsClassifier(weights=best_weights, n_neighbors=best_n_neighbors)
knn_clf.fit(X_train_augmented, y_train_augmented)
knn_clf.score(X_test, y_test)

0.9763

## Question 3

Tackle the Titanic dataset. A great place to start is on Kaggle.