<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/8-dimensionality-reduction/01_pca_fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Principal Component Analysis Fundamentals

**Principal Component Analysis** (PCA) is by far the most popular dimensionality reduction
algorithm. First it identifies the hyperplane that lies closest to the data, and then
it projects the data onto it.

<img src='https://github.com/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/8-dimensionality-reduction/images/0.png?raw=1' width='600'/>

**Preserving the Variance**

Before you can project the training set onto a lower-dimensional hyperplane, you
first need to choose the right hyperplane.

For example, a simple 2D dataset is represented
on the left in figure, along with three different axes (i.e., 1D hyperplanes).
On the right is the result of the projection of the dataset onto each of these axes.

<img src='https://github.com/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/8-dimensionality-reduction/images/1.png?raw=1' width='600'/>

As you can see, the projection onto the solid line preserves the maximum variance, while
the projection onto the dotted line preserves very little variance and the projection
onto the dashed line preserves an intermediate amount of variance.

It seems reasonable to select the axis that preserves the maximum amount of variance,
as it will most likely lose less information than the other projections. Another
way to justify this choice is that it is the axis that minimizes the mean squared distance
between the original dataset and its projection onto that axis. This is the rather
simple idea behind [**PCA**](https://www.tandfonline.com/doi/pdf/10.1080/14786440109462720).













##Setup

In [1]:
# Common imports
import numpy as np
import os

from sklearn.decomposition import PCA

from sklearn.datasets import make_swiss_roll


# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.patches import FancyArrowPatch
from mpl_toolkits.mplot3d import proj3d
from mpl_toolkits.mplot3d import Axes3D

mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

Let's build 3D dataset.

In [2]:
np.random.seed(4)
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

##Principal Components

PCA identifies the axis that accounts for the largest amount of variance in the training set.

It also finds a second axis, orthogonal to the
first one, that accounts for the largest amount of remaining variance. 


In this 2D
example there is no choice: it is the dotted line. If it were a higher-dimensional dataset,
PCA would also find a third axis, orthogonal to both previous axes, and a fourth,
a fifth, and so on—as many axes as the number of dimensions in the dataset.

The $i^{th}$ axis is called the $i^{th}$ principal component (PC) of the data.

- The first PC is the axis on which vector $c_1$ lies
- The second PC is the axis on which vector $c_2$ lies

The first two PCs are the orthogonal axes on which the
two arrows lie, on the plane, and the third PC is the axis orthogonal to that plane.

So how can you find the principal components of a training set?

Luckily, there is a standard matrix factorization technique called **Singular Value Decomposition** (SVD)
that can decompose the training set matrix $X$ into the matrix multiplication of three
matrices $U Σ V^⊺$, where $V$ contains the unit vectors that define all the principal components
that we are looking for.

$
\mathbf{V}^T =
\begin{pmatrix}
  \mid & \mid & & \mid \\
  \mathbf{c_1} & \mathbf{c_2} & \cdots & \mathbf{c_n} \\
  \mid & \mid & & \mid
\end{pmatrix}
$

The following Python code uses NumPy’s `svd()` function to obtain all the principal
components of the training set, then extracts the two unit vectors that define the first
two PCs:

In [3]:
# don’t forget to center the data first
X_centered = X - X.mean(axis=0)

U, s , Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]

PCA assumes that the dataset is centered around the origin.

If you implement PCA yourself, or if you use other libraries, don’t forget to center the data first.

In [4]:
c1

array([0.93636116, 0.29854881, 0.18465208])

In [5]:
c2

array([-0.34027485,  0.90119108,  0.2684542 ])

In [6]:
c3 = Vt.T[:, 2]
c3

array([-0.08626012, -0.31420255,  0.94542898])

##Projecting Down to d Dimensions

Once you have identified all the principal components, you can reduce the dimensionality
of the dataset down to `d` dimensions by projecting it onto the hyperplane
defined by the first `d` principal components. Selecting this hyperplane ensures that the
projection will preserve as much variance as possible.

To project the training set onto the hyperplane and obtain a reduced dataset $X_{d-proj}$ of
dimensionality `d`, compute the matrix multiplication of the training set matrix $X$ by
the matrix $W_d$, defined as the matrix containing the first `d` columns of $V$.

$
\mathbf{X}_{d\text{-proj}} = \mathbf{X} \mathbf{W}_d
$

The following Python code projects the training set onto the plane defined by the first
two principal components:

In [8]:
# define the matrix containing the first 2 columns of  V
W2 = Vt.T[:, :2]
# now project it onto the hyperplane and obtain a reduced dataset of dimensionality 2
X2D = X_centered.dot(W2)

In [9]:
X2D.shape  # reduced 

(60, 2)

In [10]:
X.shape  # original

(60, 3)

You now know how to reduce the dimensionality of any dataset
down to any number of dimensions, while preserving as much variance as possible.

## PCA using Scikit-Learn

Scikit-Learn’s PCA class uses SVD decomposition to implement PCA.

Let's applies PCA to reduce the dimensionality
of the dataset down to two dimensions (note that it automatically takes care of centering
the data):

In [11]:
pca = PCA(n_components=2)
X2D = pca.fit_transform(X)
X2D.shape

(60, 2)

After fitting the PCA transformer to the dataset, its `components_` attribute holds the
transpose of $W_d$ (e.g., the unit vector that defines the first principal component is
equal to `pca.components_.T[:, 0])`.

In [12]:
pca.components_

array([[-0.93636116, -0.29854881, -0.18465208],
       [ 0.34027485, -0.90119108, -0.2684542 ]])

In [13]:
pca.components_.T[:, 0]  # c1 = first principal component

array([-0.93636116, -0.29854881, -0.18465208])

In [14]:
pca.components_.T[:, 1]  # c2 = second principal component

array([ 0.34027485, -0.90119108, -0.2684542 ])

###Explained Variance Ratio

Another useful piece of information is the explained variance ratio of each principal
component, available via the `explained_variance_ratio_` variable. The ratio indicates
the proportion of the dataset’s variance that lies along each principal component.

For example, let’s look at the explained variance ratios of the first two
components of the 3D dataset.

In [15]:
pca.explained_variance_ratio_

array([0.84248607, 0.14631839])

This output tells you that 84.2% of the dataset’s variance lies along the first PC, and
14.6% lies along the second PC. This leaves less than 1.2% for the third PC, so it is
reasonable to assume that the third PC probably carries little information.

In [16]:
pca.explained_variance_

array([0.77830975, 0.1351726 ])

###Choosing the Right Number of Dimensions

Instead of arbitrarily choosing the number of dimensions to reduce down to, it is
simpler to choose the number of dimensions that add up to a sufficiently large portion
of the variance (e.g., 95%). 

Unless, of course, you are reducing dimensionality for
data visualization—in that case you will want to reduce the dimensionality down to 2
or 3.

Let's performs PCA without reducing dimensionality, then computes
the minimum number of dimensions required to preserve 95% of the training set’s
variance:

In [18]:
pca = PCA()
pca.fit(X)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
d

2

You could then set `n_components=d` and run PCA again. But there is a much better
option: instead of specifying the number of principal components you want to preserve,
you can set `n_components` to be a float between 0.0 and 1.0, indicating the ratio
of variance you wish to preserve:

In [20]:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X)
X_reduced.shape

(60, 2)

Yet another option is to plot the explained variance as a function of the number of
dimensions.

<img src='https://github.com/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/8-dimensionality-reduction/images/2.png?raw=1' width='600'/>

There will usually be an elbow in the
curve, where the explained variance stops growing fast. In this case, you can see that
reducing the dimensionality down to about 100 dimensions wouldn’t lose too much
explained variance.

##PCA for Compression