# Dimensionality Reduction: PCA

Dimensionality reduction is the task of deriving a set of new
artificial features that is smaller than the original feature
set while retaining most of the variance of the original data.
Here we'll use a common but powerful dimensionality reduction
technique called Principal Component Analysis (PCA).
We'll perform PCA on the iris dataset that we saw before:

##  Iris dataset

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

PCA is performed using linear combinations of the original features
using a truncated Singular Value Decomposition of the matrix X so
as to project the data onto a base of the top singular vectors.
If the number of retained components is 2 or 3, PCA can be used
to visualize the dataset.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2, whiten=True)
pca.fit(X)

Once fitted, the pca model exposes the singular vectors in the components_ attribute:

In [None]:
pca.components_

Other attributes are available as well:

### unit vector

In [None]:
print("|v1| = %5f" %np.sqrt(np.power(pca.components_[0],2).sum()))

### orthogonal vectors

In [None]:
print("inner product of v1 and v2 = %5f" % np.dot(pca.components_[0],pca.components_[1]))

### explained_variance_ratio_

In [None]:
pca.explained_variance_ratio_

In [None]:
pca.explained_variance_ratio_.sum()

Let us project the iris dataset along those first two dimensions:

In [None]:
X_pca = pca.transform(X)

PCA `normalizes` and `whitens` the data, which means that the data
is now centered on both components with unit variance:

In [None]:
import numpy as np
np.round(X_pca.mean(axis=0), decimals=5)

In [None]:
np.round(X_pca.std(axis=0), decimals=5)

Furthermore, the samples components do no longer carry any linear correlation:

In [None]:
np.corrcoef(X_pca.T)

We can visualize the projection using pylab, but first
let's make sure our ipython notebook is in pylab inline mode

In [None]:
pca.get_covariance()

In [None]:
pca.get_precision()

In [None]:
np.cov(X.T)

Now we can visualize the results using the following utility function:

In [None]:
from itertools import cycle

def plot_PCA_2D(data, target, target_names):
    colors = cycle('rgbcmykw')
    target_ids = range(len(target_names))
    plt.figure()
    for i, c, label in zip(target_ids, colors, target_names):
        plt.scatter(data[target == i, 0], data[target == i, 1],
                   c=c, label=label)
    plt.legend()

Now calling this function for our data, we see the plot:

In [None]:
plot_PCA_2D(X_pca, iris.target, iris.target_names)

## S curve

In [None]:
from sklearn.datasets import make_s_curve
X, y = make_s_curve(n_samples=1000)

from mpl_toolkits.mplot3d import Axes3D
ax = plt.axes(projection='3d')

ax.scatter3D(X[:, 0], X[:, 1], X[:, 2], c=y)
ax.view_init(10, -60)

This is a 2-dimensional dataset embedded in three dimensions, but it is embedded
in such a way that PCA cannot discover the underlying data orientation:

In [None]:
X_pca = PCA(n_components=2).fit_transform(X)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)

Manifold learning algorithms, however, available in the ``sklearn.manifold``
submodule, are able to recover the underlying 2-dimensional manifold:

## Digits dataset

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

In [None]:
from sklearn.decomposition import PCA

plt.figure()
pca = PCA(n_components=2, whiten=True)
projection = pca.fit_transform(digits.data)
plt.scatter(projection[:, 0], projection[:, 1], c=digits.target)
plt.title(pca.__class__.__name__)
