# Dimensionality Reduction - Principal Component Analysis (PCA)

1. Introduction to Principal Component Analysis
2. Principal Component Analysis as dimensionality reduction
3. Example of Principal Component Analysis 

## 1. Introduction to Principal Component Analysis (PCA)

According to <i>Wikipedia</i>, PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components.
<br>
<br>
<br>
<b>But where all we can apply PCA?</b>

<b>Data Visualization</b>: When working on any data related problem, the challenge in today's world is the sheer volume of data, and the variables/features that define that data. To solve a problem where data is the key, we need extensive data exploration like finding out how the variables are correlated or understanding the distribution of a few variables. Considering that there are a large number of variables or dimensions along which the data is distributed, visualization can be a challenge and almost impossible.

Hence, PCA can do that for us since it projects the data into a lower dimension, thereby allowing us to visualize the data in a 2D or 3D space with a naked eye.

<b>Speeding Machine Learning (ML) Algorithm</b>: Since PCA's main idea is dimensionality reduction, we can leverage that to speed up our machine learning algorithm's training and testing time considering our data has a lot of features, and the ML algorithm's learning is too slow.
<br>
<br>
<br>
<b>What is a Principal Component?</b>

Principal components are the key to PCA; they represent what's underneath the hood of our data. In a layman term, when the data is projected into a lower dimension (assume three dimensions) from a higher space, the three dimensions are nothing but the three Principal Components that captures (or holds) most of the variance (information) of our data.

Principal components have both direction and magnitude. The direction represents across which principal axes the data is mostly spread out or has most variance and the magnitude signifies the amount of variance that Principal Component captures of the data when projected onto that axis. The principal components are a straight line, and the first principal component holds the most variance in the data. Each subsequent principal component is orthogonal to the last and has a lesser variance. In this way, given a set of x correlated variables over y samples we achieve a set of uncorrelated principal components over the same y samples.

The reason we achieve uncorrelated principal components from the original features is that the correlated features contribute to the same principal component, thereby reducing the original data features into uncorrelated principal components; each representing a different set of correlated features with different amounts of variation.

Each principal component represents a percentage of total variation captured from the data.

The material is take from https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
and https://www.datacamp.com/community/tutorials/principal-component-analysis-in-python

In [None]:
# Load labraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Display all the columns of the dataframe
pd.pandas.set_option('display.max_columns', None)

Principal component analysis is a fast and flexible unsupervised method for dimensionality reduction in data. Its behavior is easiest to visualize by looking at a two-dimensional dataset. Consider the following 200 points plot:

In [None]:
# Plot for 200 random points
rng = np.random.RandomState(1)

plt.figure(figsize = (5,5))  # Set Figure size 

X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal')
plt.xlim(-3,3)
plt.ylim(-3,3)
plt.show()

In principal component analysis, relationship between `x` and `y` is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset. Using Scikit-Learn's ```PCA``` estimator, we can compute this as follows:

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)

The fit learns some quantities from the data, most importantly the "components" and "explained variance":

In [None]:
print(pca.components_)

In [None]:
print(pca.explained_variance_)

From the above output, we can observe that the `principal component 1` holds 76.3% of the information while the `principal component 2` holds only 1.8% of the information. Also, the other point to note is that while projecting to principal components, 21.9.0% information was lost.

To see what these numbers mean, let's visualize them as vectors over the input data, using the "components" to define the direction of the vector, and the "explained variance" to define the squared-length of the vector:

In [None]:
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=2,
                    shrinkA=0, shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# plot data
plt.figure(figsize = (5,5))  # Set Figure size 
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal')
plt.xlim(-3,3)
plt.ylim(-3,3)
plt.show()

These vectors represent the principal axes of the data, and the length of the vector is an indication of how "important" that axis is in describing the distribution of the data—more precisely, it is a measure of the variance of the data when projected onto that axis. The projection of each data point onto the principal axes are the "principal components" of the data.

If we plot these principal components beside the original data, we see the plots shown here:

In [None]:
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
pca = PCA(n_components=2, whiten=True)
pca.fit(X)

fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)

# plot data
ax[0].scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v, ax=ax[0])
ax[0].axis('equal');
ax[0].set(xlabel='x', ylabel='y', title='input')

# plot principal components
X_pca = pca.transform(X)
ax[1].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.2)
draw_vector([0, 0], [0, 3], ax=ax[1])
draw_vector([0, 0], [3, 0], ax=ax[1])
ax[1].axis('equal')
ax[1].set(xlabel='component 1', ylabel='component 2',
          title='principal components',
          xlim=(-5, 5), ylim=(-3, 3.1))

plt.show()

## 2. Principal Component Analysis (PCA) as dimensionality reduction

Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.

Here is an example of using PCA as a dimensionality reduction transform:

In [None]:
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:   ", X.shape)
print("transformed shape:", X_pca.shape)

The transformed data has been reduced to a single dimension. To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data:

In [None]:
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8, color='blue')
plt.xlim(-3,3)
plt.ylim(-3,3)
plt.axis('equal');

The light points are the original data, while the dark points are the projected version. This makes clear what a PCA dimensionality reduction means: the information along the least important principal axis or axes is removed, leaving only the component(s) of the data with the highest variance. The fraction of variance that is cut out (proportional to the spread of points about the line formed in this figure) is roughly a measure of how much "information" is discarded in this reduction of dimensionality.

This reduced-dimension dataset is in some senses "good enough" to encode the most important relationships between the points: despite reducing the dimension of the data by 50%, the overall relationship between the data points are mostly preserved.

## 3. Example of Principal Component Analysis (PCA)

We are going to use `Breast Cancer dataset` from `Sklearn library`.

The Breast Cancer data set is a real-valued multivariate data that consists of two classes, where each class signifies whether a patient has breast cancer or not. The two categories are: malignant and benign.

The malignant class has 212 samples, whereas the benign class has 357 samples.

It has 30 features shared across all classes: radius, texture, perimeter, area, smoothness, fractal dimension, etc.

In [None]:
# Breast Cancer Data Exploration
from sklearn.datasets import load_breast_cancer

`load_breast_cancer` will give us both labels and the data. To fetch the data, we will call `.data` and for fetching the labels `.target`.

The data has <b>569 samples</b> with <b>30 features</b>, and each sample has a label associated with it. There are two labels in this dataset.

In [None]:
# Get both data and target
breast = load_breast_cancer()

In [None]:
# Get data only
breast_data = breast.data
print('breast_data shape:',breast_data.shape)
breast_data

In [None]:
# Take a look at target values even if we will not use it
breast_labels = breast.target
print('breast_labels shape:',breast_labels.shape)
breast_labels

In [None]:
# Get the feature names
features = breast.feature_names
features

In [None]:
# Convert to DataFrame
breast_dataset = pd.DataFrame(breast_data)
breast_dataset.columns = features
breast_dataset

In [None]:
# Add Label column
breast_dataset['label'] = breast_labels.T # breast_labels transpose before adding to dataframe, to get it vertical
breast_dataset

In [None]:
# Since the original labels are in 0,1 format, we will change the labels to benign and malignant using .replace function.
# We will use inplace=True which will modify the dataframe breast_dataset.
breast_dataset['label'].replace(0, 'Benign',inplace=True)
breast_dataset['label'].replace(1, 'Malignant',inplace=True)
breast_dataset.tail()

Now comes the most exciting part of this tutorial. As we learned earlier that PCA projects turn high-dimensional data into a low-dimensional principal component, now is the time to visualize that with the help of Python!

To Visualizing the Breast Cancer data, we start by Standardizing the data since PCA's output is influenced based on the scale of the features of the data.

It is a common practice to normalize our data before feeding it to any machine learning algorithm.

To apply normalization, we will import `StandardScaler` module from the sklearn library and select only the features from the `breast_dataset` we created. Once we have the features, we will then apply scaling by doing `fit_transform` on the feature data.

While applying `StandardScaler`, each feature of our data should be normally distributed such that it will scale the distribution to a mean of zero and a standard deviation of one.

In [None]:
from sklearn.preprocessing import StandardScaler
x = breast_dataset.loc[:, features].values
x = StandardScaler().fit_transform(x) # normalizing the features

In [None]:
x.shape

In [None]:
# Let's check whether the normalized data has a mean of zero and a standard deviation of one.
np.mean(x),np.std(x)

In [None]:
# Let's convert the normalized features into a tabular format with the help of DataFrame.
feat_cols = ['feature' + str(i) for i in range(x.shape[1])]
normalised_breast = pd.DataFrame(x,columns=feat_cols)
normalised_breast.tail()

Now comes the critical part, the next few lines of code will be projecting the thirty-dimensional Breast Cancer data to two-dimensional `principal components`.

We will use the sklearn library to import the `PCA` module, and in the PCA method, we will pass the number of components `(n_components=2)` and finally call fit_transform on the aggregate data. Here, several components represent the lower dimension in which we will project our higher dimension data

In [None]:
from sklearn.decomposition import PCA
pca_breast = PCA(n_components=2)
principalComponents_breast = pca_breast.fit_transform(x)

In [None]:
# Next, let's create a DataFrame that will have the principal component values for all 569 samples.
principal_breast_Df = pd.DataFrame(data = principalComponents_breast
             , columns = ['principal component 1', 'principal component 2'])
principal_breast_Df.tail()

Once we have the principal components, we can find the `explained_variance_ratio`. It will provide us with the amount of information or variance each principal component holds after projecting the data to a lower dimensional subspace.

In [None]:
print('Explained variation per principal component: {}'.format(pca_breast.explained_variance_ratio_))

From the above output, we can observe that the `principal component 1` holds 44.2% of the information while the `principal component 2` holds only 19% of the information. Also, the other point to note is that while projecting thirty-dimensional data to a two-dimensional data, 36.8% information was lost.

Let's plot the visualization of the 569 samples along the `principal component - 1` and `principal component - 2` axis. It should give us good insight into how our samples are distributed among the two classes.

In [None]:
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of Breast Cancer Dataset",fontsize=20)
targets = ['Benign', 'Malignant']
colors = ['r', 'g']
for target, color in zip(targets,colors):
    indicesToKeep = breast_dataset['label'] == target
    plt.scatter(principal_breast_Df.loc[indicesToKeep, 'principal component 1']
               , principal_breast_Df.loc[indicesToKeep, 'principal component 2'], c = color, s = 50)

plt.legend(targets,prop={'size': 15})
plt.show()

From the above graph, we can observe that the two classes `benign` and `malignant`, when projected to a two-dimensional space, can be linearly separable up to some extent. Other observations can be that the `benign` class is spread out as compared to the `malignant` class