Reproduced from [PCA using Python (scikit-learn)](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60)

# PCA for Machine Learning

One of the most important applications of PCA is for **speeding up machine learning algorithms**. Using the IRIS dataset would be impractical here as the dataset only has 150 rows and only 4 feature columns. The MNIST database of handwritten digits is more suitable as it has 784 feature columns (784 dimensions), a training set of 60,000 examples, and a test set of 10,000 examples.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px  #if you don't have this, install first

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings("ignore")

### Download and Load the (image) Data

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

In [None]:
mnist.data.shape

The images that you downloaded are contained in mnist.data and has a shape of (70000, 784) meaning there are 70,000 images with **784 dimensions** (784 features).
The labels (the integers 0–9) are contained in mnist.target. The features are 784 dimensional (28 x 28 images) and the **labels are simply numbers from 0–9**.

Predicting the numbers from 0 to 9.

<img src='https://miro.medium.com/max/530/1*VAjYygFUinnygIx9eVCrQQ.png' width = 300>

### Split Data into Training and Test Sets

In [None]:
from sklearn.model_selection import train_test_split
# test_size: what proportion of original data is used for test set
train_img, test_img, train_lbl, test_lbl = train_test_split( mnist.data, mnist.target, test_size=1/7.0, random_state=0)

### Standardize the Data

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(train_img)
# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)

### Import and Apply PCA
Notice the code below has .95 for the number of components parameter. It means that scikit-learn choose the minimum number of principal components such that 95% of the variance is retained.

In [None]:
from sklearn.decomposition import PCA
# Make an instance of the Model
pca = PCA(.95)

In [None]:
# you are fitting PCA on the training set only
pca.fit(train_img)

In [None]:
pca.n_components_ 

In [None]:
print(pca.explained_variance_ratio_)            # explained variance of each component
print(pca.explained_variance_ratio_.cumsum())   # cumulative sum

In [None]:
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)

total_var = pca.explained_variance_ratio_.sum() * 100

px.area(
    x=range(1, exp_var_cumul.shape[0] + 1),
    y=exp_var_cumul,
    title=f'Total Explained Variance: {total_var:.2f}%',
    labels={"x": "# Components", "y": "Explained Variance"}
)

In [None]:
# Apply the mapping (transform) to both the training set and the test set.
train_img = pca.transform(train_img)
test_img = pca.transform(test_img)

### Apply Logistic Regression to the Transformed Data

In [None]:
from sklearn.linear_model import LogisticRegression

logisticRegr = LogisticRegression(solver = 'lbfgs', max_iter=2000)
logisticRegr.fit(train_img, train_lbl)
logisticRegr.score(test_img, test_lbl)

### Timing of Fitting Logistic Regression after PCA

The whole point of this section of the tutorial was to show that you can use PCA to speed up the fitting of machine learning algorithms. The table below shows how long it took to fit logistic regression on [the author's](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60) MacBook after using PCA (retaining different amounts of variance each time).

<img src='https://miro.medium.com/max/576/1*xKUK0wLnLHAJYS1zbt-7wA.png'>