# MNIST and KNN

This notebook downloads the MNIST dataset and models the dataset using the entire training data and also samples of the data. Pytorch is used to download the data for convenince

MNIST data link and other KNN results: http://yann.lecun.com/exdb/mnist/

In [52]:
import torch
import torchvision
import numpy as np
import pandas as pd

### Download data

Download the data from pytorch (for convenience) then turn into numpy arrays to be used with KNN and Scikit below

In [5]:
train_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('./files/', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=60000, shuffle=True)

test_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('./files/', train=False, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=10000, shuffle=True)

In [6]:
_test = enumerate(test_loader)
_, (test_data, test_targets) = next(_test)

_train = enumerate(train_loader)
_, (train_data, train_targets) = next(_train)

train_X = train_data.numpy().reshape([60000, 28*28])
train_y = train_targets.numpy()

test_X = test_data.numpy().reshape([10000, 28*28])
test_y = test_targets.numpy()

In [41]:
print("Train X: ", train_X.shape)
print("Train y: ", train_y.shape)
print("Test X: ", test_X.shape)
print("Test y: ", test_y.shape)

Train X:  (60000, 784)
Train y:  (60000,)
Test X:  (10000, 784)
Test y:  (10000,)


### Model

Use scikit-learn and KNN to model

In [43]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [44]:
model = KNeighborsClassifier(n_neighbors=10)
model.fit(train_X, train_y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

In [46]:
out = model.predict(test_X)

score = accuracy_score(test_y, out)

In [47]:
score

0.9665

### Model with samples

In [59]:
perc_val = .1 # sample % of the dataset
num_iter = 10
num_pts = int(train_X.shape[0] * perc_val)

scores = []

for i in range(num_iter):
    
    # sample data
    sample_idx = np.random.choice(train_X.shape[0], num_pts, replace=False)
    x_sample = train_X[sample_idx]
    y_sample = train_y[sample_idx]
    
    # model sample
    model = KNeighborsClassifier(n_neighbors=10)
    model.fit(x_sample, y_sample)
    
    # test
    out = model.predict(test_X)
    score = accuracy_score(test_y, out)
    
    print(f"Model {i} has accuracy of {score}")
    scores.append(score)

    
print(f"Average score for models trained on {perc_val * 100}% of data: {np.mean(scores)}")

Model 0 has accuracy of 0.9308
Model 1 has accuracy of 0.9303
Model 2 has accuracy of 0.9313
Model 3 has accuracy of 0.9339
Model 4 has accuracy of 0.9309
Model 5 has accuracy of 0.9337
Model 6 has accuracy of 0.9313
Model 7 has accuracy of 0.9338
Model 8 has accuracy of 0.9358
Model 9 has accuracy of 0.9334
Average score for models trained on 10.0% of data: 0.93252
