# Set up

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Exercise: *Load the MNIST dataset*

In [6]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784", as_frame=False, parser="auto")

Exercise: *and split it into a training set and a test set (take the first 60,000 instances for training, and the remaining 10,000 for testing).*

In [7]:
X_train = mnist.data[:60_000]
y_train = mnist.target[:60_000]

X_test = mnist.data[60_000:]
y_test = mnist.target[60_000:]

Exercise: *Train a random forest classifier on the dataset and time how
long it takes.*

In [8]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42, n_jobs=-1)
%time forest_clf.fit(X_train, y_train)

CPU times: user 2min 53s, sys: 582 ms, total: 2min 53s
Wall time: 16.7 s


Exercise: *then evaluate the resulting model on the test set*

In [9]:
from sklearn.metrics import accuracy_score

y_predict = forest_clf.predict(X_test)
accuracy_score(y_predict, y_test)

0.9705

Exercise: *use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%*

In [10]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)
X_train_reduced = pca.fit_transform(X_train)

Exercise: *Train a new random forest classifier on the reduced dataset and see how long it
takes*

In [11]:
forest_clf_with_pca = RandomForestClassifier(random_state=42, n_jobs=-1)
%time forest_clf_with_pca.fit(X_train_reduced, y_train)

CPU times: user 7min 55s, sys: 640 ms, total: 7min 56s
Wall time: 44.7 s


Now training is about 3 times slower! How can that be? Well, we saw earlier in this chapter, **dimensionality reduction doesn't always leads to faster training time**: It depends on the dataset, the model and the training algorithm.

Exercise: *Evaluate the classifier on the test set*

In [12]:
X_test_reduced = pca.transform(X_test)

y_predict = forest_clf_with_pca.predict(X_test_reduced)
accuracy_score(y_predict, y_test)

0.9481

It's common for dimensionality reduction algorithm to hurt the model's performance, as we do lose some potential useful signal in the process. But in this case, the performance hurt is rather severe. So PCA really did not help: It slowed down training and decrease performance.

Exercise: *Try again with an `SGDClassifier`*

In [13]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
%time sgd_clf.fit(X_train, y_train)

CPU times: user 2min 49s, sys: 154 ms, total: 2min 50s
Wall time: 2min 50s


In [14]:
y_predict = sgd_clf.predict(X_test)
accuracy_score(y_predict, y_test)

0.874

In [15]:
sgd_clf_with_pca = SGDClassifier(random_state=42)
%time sgd_clf_with_pca.fit(X_train_reduced, y_train)

CPU times: user 42.7 s, sys: 317 ms, total: 43 s
Wall time: 42.7 s


Applying PCA beforehand leads to 4x speedup! Let's see what its accuracy is.

In [16]:
y_predict = sgd_clf_with_pca.predict(X_test_reduced)
accuracy_score(y_predict, y_test)

0.8959

Fantastic! Reducing dimensionality not only gives us a 4x speedup, but also improve performance a little bit.

So there you have it: PCA can gives a good speedup, and if you're lucky you can get a little performance boost. However, this is not guaranteed, it all depends on the model and the dataset!