In [94]:
# Imports
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [95]:
# Loading dataset
mnist = fetch_openml('mnist_784')
# View the shape of the dataset
mnist.data.shape

(70000, 784)

In [96]:
# Setting features
X = mnist.data
# Setting target
y = mnist.target
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [97]:
# Creating StandardScaler object
scaler = StandardScaler()
# Creating PCA object
pca = PCA(n_components=.95)
# Creating pipeline for StandardScaler and PCA
transformer = make_pipeline(scaler, pca)

In [98]:
# Creating first KNN model with PCA pipeline
knn1 = make_pipeline(transformer, KMeans())
# Fitting first KNN model
knn1.fit(X_train, y_train)

---
---

> For this assignment, VSCode doesn't have the %%time function since the jupyter notebook extension comes with execution time built into the cell as default. I commented it out for you so that you can run it and see the result on your end. 
> 
> - I know that jupyter notebook is required but I have dojo-env loaded and anaconda. Also, I like VSCode using my computers resources (they tend to be faster than client-server IDE's). So trying to be honest I'm going to be stubborn about switching.

---
---

In [99]:
#%%time
# Creating prediction seperately as task requires
preds_yes = knn1.predict(X_test)

In [100]:
# Printing scores
print(knn1.score(X_train, preds_yes))
print(knn1.score(X_test, preds_yes))
silhouette_score(X_test, preds_yes)

-30466973.666883916
-14432980.628896967


0.06022245112051501

In [101]:
# Creating second KNN model without PCA pipeline
knn2 = make_pipeline(scaler, KMeans())
knn2.fit(X_train, y_train)

In [102]:
#%%time
# Creating prediction seperately as task requires
preds_no = knn2.predict(X_test)

In [103]:
# Printing scores
print(knn2.score(X_train, preds_no))
print(knn2.score(X_test, preds_no))
silhouette_score(X_test, preds_no)

-32339231.471531514
-15637339.125852646


0.06021882499177731

---

#### a. Which model performed the best on the test set?

The model with the best preformance would definitely have to be the `PCA` knn model. Although I can't explain the scores but the test score for `knn1` is `-14432980.62`. `knn2` test score is lower with `-15637339.12`. Therefore, `knn1` has the best performance.

---

#### b. Which model was the fastest at making predictions?

Even though the `knn1` model performed better. `knn2` prediction speed was better than `knn1`. 

`knn1 = 0.2s`
`knn2 = 0.1s`

---