<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Principal Component Analysis

_Authors: Justin Pounders, Matt Brems, Noelle Brown, Patrick Wales-Dinan_

### LEARNING OBJECTIVES
By the end of the lesson, students should be able to:
1. Differentiate between feature selection and feature extraction.
2. Describe the PCA algorithm.
3. Implement PCA in `scikit-learn`.
4. Calculate and interpret proportion of explained variance.
5. Identify use cases for PCA.

In [None]:
# Import our libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import from sklearn.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

# Set a random seed.
np.random.seed(42)

### Introduction of Problem

Today, we're going to be using the [wine quality](http://www3.dsi.uminho.pt/pcortez/wine/) dataset by Cortez, Cerdeira, Almeida, Matos and Reis.

Specifically, we are going to use physicochemical properties of the wine in order to predict the quality of the wine.

In [None]:
# Read in the wine quality datasets.


# Stack datasets together. (They have the same column names!)


# Check out head of our dataframe.


### Fit a multiple linear regression model in `sklearn`.

In [None]:
# Set y to be the quality column.


# Set X as all other columns.


# How much missing data do we have?


In [None]:
# To show off the strength of PCA, we're going to make many, many more features.


# Fit and transform our X data using Polynomial Features.


# How many features do we have now?


# How many features did we start out with?


In [None]:
# Train/test split our data.


In [None]:
# Instantiate and fit a linear regression model.




# Score on training set. (We'll use R^2 for the score today.)


# Score on testing set.


<details><summary>Check: What is the problem with this?</summary>
    
- We've clearly overfit our model to the data (so much so that our model's performance is really bad)!
- We have a lot of columns relative to our number of rows! (If you have $n$ rows and you're fitting a linear model, it's often advised to keep your number of columns below $\sqrt{n}$.)
</details>

<details><summary>Check: How can we overcome this problem?</summary>

- We can drop features from our model. (However, this loses any benefit we'd get from including those features! It can also be time-consuming and/or require subject-matter expertise.)
- Maybe we can combine features together so that we can get the benefits of most/all of our features. (This is what PCA will do.)
</details>

### Dimensionality Reduction

[Dimensionality reduction](https://www.analyticsvidhya.com/blog/2015/07/dimension-reduction-methods/) refers to (approximately) reducing the number of features we use in our model.

<details><summary>Dimensionality reduction has a number of advantages:</summary>

- Increases computational efficiency when fitting models.
- Can help with addressing a multicollinearity problem.
- Makes visualization simpler (or feasible).
</details>

<details><summary>Dimensionality reduction can suffer from some drawbacks, though:</summary>

- We've invested our time and money into collecting information... why do we want to get rid of it?
</details>

### Is there a way to get the advantages of dimensionality reduction while minimizing the drawbacks?

Dimensionality reduction can generally be broken down into one of two categories:

<img src="./images/dim_red.png" alt="drawing" width="550"/>

- **Feature Selection**
    - We drop variables from our model.
- **Feature Extraction**
    - In feature extraction, we take our existing features and combine them together in a particular way. We can then drop some of these "new" variables, but the variables we keep are still a combination of the old variables!
    - This allows us to still reduce the number of features in our model **but** we can keep all of the most important pieces of the original features!

<img src="./images/feast.png" alt="drawing" width="550"/>

### $$
\begin{eqnarray*}
X_1, \ldots, X_p &\Rightarrow& Z_1, \ldots, Z_p \\
\\
\text{most important: }Z_1 &=& w_{1,1}X_1 + w_{1,2}X_2 + \cdots + w_{1,p}X_p \\
\text{slightly less important: }Z_2 &=& w_{2,1}X_1 + w_{2,2}X_2 + \cdots + w_{2,p}X_p \\
&\vdots&\\
\text{least important: }Z_p &=& w_{p,1}X_1 + w_{p,2}X_2 + \cdots + w_{p,p}X_p \\
\end{eqnarray*}
$$

- We don't usually care about the values of weights here. They aren't very meaningful and we don't try to interpret them.
- You can think of $Z_1$ as a "high performance" predictor, where $Z_1$ has all of the best pieces of our original predictors $X_1$ through $X_p$.
- As we move down the list toward $Z_p$, the variables will consist of the more "redundant" parts of our $X$ variables. 
- You can think of $Z_p$ as a "low performance" predictor.

<details><summary>If I'm going to keep three of my new predictors, which three would I keep?</summary>
    
- The first three: $Z_1$, $Z_2$, and $Z_3$.
- This is how we do feature extraction.
    - We take our old features $X_1$, $X_2$, $X_3$, and $X_4$.
    - We turn them into new features $Z_1$, $Z_2$, $Z_3$, and $Z_4$.
    - The new features are combinations of our old features.
    - If we drop new features, we're doing dimensionality reduction, but we also keep parts of every old feature!
</details>

Dimensionality reduction can be used as an exploratory/unsupervised learning method or as a pre-processing step for supervised learning later.

**Principal component analysis** is one algorithm for doing feature extraction.

<details><summary>How would you describe the difference between feature selection and feature extraction?</summary>

- Feature selection is a process of dropping original features from our model.
- Feature extraction is a process of transforming our original features into "new" features, then dropping some of the "new" features from our model.
</details>

## Principal Component Analysis

### Big picture, what is PCA doing?
1. We are going to look at how all of the $X$ variables relate to one another and summarize these relationships.
2. Then, we will take this summary and look at which combinations of our $X$ variables are most important.
3. We can also quantify how important each combination is and rank these combinations.

Once we've taken our original $X$ data and transformed it into $Z$, we can then drop the columns of $Z$ that are "least important."

Imagine you are this [Whale shark](https://en.wikipedia.org/wiki/Whale_shark):

<img src="./images/whaleshark.png" alt="drawing" width="500"/>

And you want a snack. Which way would you tilt your head to eat the most krill at once?

<img src="./images/krill.png" alt="drawing" width="500"/>

Above artwork by [@allison_horst](https://twitter.com/allison_horst)

<img src="./images/pca.gif" alt="drawing" width="600"/>

[Source](https://rpubs.com/jormerod/594859).

**Visually...**

> Think of our data floating out in $p$-dimensional space. Each observation is a dot and you can imagine this massive cloud of dots that exists somewhere. PCA is a way to rotate this cloud of dots (formally, a [coordinate transformation](http://farside.ph.utexas.edu/teaching/336k/Newtonhtml/node153.html)). The old axes are the original $X_1$, $X_2$, $\ldots$ features. **The new axes are the principal components from PCA**.

The principal components are the most concise, informative descriptors of our data as a whole.
- What does this mean?
- If we wanted to take our full data set and condense it into one dimension (think like our $X$ axis), we'd only use $Z_1$.
- If we wanted to take our full data set and condense it into two dimensions (think like our $X$ and $Y$ axes), we'd use $Z_1$ and $Z_2$.

Let's head to [this site](http://setosa.io/ev/principal-component-analysis/). Play around with the 2D data. Take 2-3 minutes.
1. As you interact with the data, how would you describe the red line?
2. As you interact with the data, how would you describe the green line?

---

### Principal Components

- We are looking for new *directions*. ([Insert Glee joke here](https://glee.fandom.com/wiki/New_Directions).)
- Each consecutive direction tries to explain the maximum *remaining variance* in our $X$ data.
- Each direction is *orthogonal* to all the others.

**These new *directions* are the "principal components."**

> Applying PCA to your data *transforms* your original data columns (variables) onto the new principal component axes.


### Two notes:

1. Train/test split **before** applying PCA!
2. Standardize our data **before** applying PCA!

In [None]:
# Instantiate our StandardScaler.


# Standardize X_train.


# Standardize X_test.


In [None]:
# Import PCA.
from sklearn.decomposition import PCA

#### (BONUS) Why decomposition?
The way PCA works "under the hood" is it takes one matrix and **decomposes** that matrix into multiple matrices.

Written out, we might take some matrix $\mathbf{A}$ and break it down into multiple matrices like this:

$$
\begin{eqnarray*}
\mathbf{A} &=& \mathbf{P}\mathbf{D}\mathbf{P}^{-1}
\end{eqnarray*}
$$

Check out [the Wikipedia article](https://en.wikipedia.org/wiki/Matrix_decomposition) for a list of ways to decompose matrices.
- A specific method of decomposition commonly used for PCA is known as the [eigendecomposition](https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix) or spectral decomposition of a matrix. However, eigendecompositon requires [diagonalizable](https://en.wikipedia.org/wiki/Diagonalizable_matrix) matrices. To generalize this to non-square/non-diagonalizable matrices, we more commonly use the [Singular Value Decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) (SVD) for PCA. [PCA in Sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) uses SVD.

In [None]:
# Instantiate PCA.


# Fit PCA on the training data.


# Transform PCA on the training data.


# Don't forget to transform the test data!




In [None]:
# Instantiate PCA.


In [None]:
# Fit PCA on the training data.


In [None]:
# Transform PCA on the training data.


In [None]:
# Let's check out the resulting data


In [None]:
# Don't forget to transform the test data!


### So, like, big picture, what is PCA doing?
Well, we're transforming our data. Specifically, we are:
1. We are going to look at how all of the $X$ variables relate to one another, then summarize these relationships. (This is done with the **covariance matrix**.)
2. Then, we will take this summary and look at which combinations of our $X$ variables are most important. (We will decompose our covariance matrix into its **eigenvectors**, which is a linear algebra term that allows us to understand the most important "directions" in our data, which are our principal components!)
3. We can also see exactly how important each combination is, then rank these combinations. (With each eigenvector, we get an **eigenvalue**. This eigenvalue is a number that tells us how important each "direction" or principal component is.)
    - Want a better understanding of eigenvectors and eigenvalues? [Check this 3Blue1Brown video out!](https://www.youtube.com/watch?v=PFDu9oVAE-g)

Remember that one of our goals with PCA is to do dimensionality reduction (a.k.a. get rid of features).

We can: 
- measure how important each principal component is using the eigenvalue, 
- rank the columns of `Z_train` by their eigenvalues,
- then drop the columns with small eigenvalues (little importance) but keep the columns with big eigenvalues (very important).
    - In `sklearn`, when transformed by PCA, the columns will already be sorted by their eigenvalues from biggest to smallest! The first column will be the most important, the second column will be the next most important, and so on.

#### But how many features do we discard?

A useful measure is the **proportion of explained variance**, which is calculated from the **eigenvalues**. 

The explained variance tells us how much information (variance) is captured by each principal component.

### $$ \text{explained variance of }PC_k = \bigg(\frac{\text{eigenvalue of } PC_k}{\sum_{i=1}^p\text{eigenvalue of } PC_i}\bigg)$$

Rather than write out "$\text{eigenvalue of } PC_k$", we usually just write $\lambda_k$.

If I want to calculate the proportion of explained variance by retaining $PC_1$ and $PC_2$, I would calculate this as:

### $$ \text{explained variance of } PC_1 \text{ and } PC_2 = \bigg(\frac{\lambda_1 + \lambda_2}{\sum_{i=1}^p \lambda_i} \bigg)$$

In [None]:
# Pull the explained variance attribute.


In [None]:
# Generate the cumulative explained variance.


In [None]:
# Plot total explained variance vs components  


<details><summary>Check: If I wanted to explain at least 80% of the variability in my data with principal components, what is the smallest number of principal components that I would need to keep? </summary>

- Only six! 
- I could keep $Z_1, Z_2, \ldots, Z_6$ in my model, and this would explain 80.8% of the variability in my $X$ data.
</details>

## Let's compare our PCA'ed performance to our original performance!

#### Original performance:

<img src="./images/lr_performance.png" alt="drawing" width="800"/>

#### Principal Component Regression performance:

In [None]:
# Instantiate PCA with 10 components.


# Fit PCA to training data.


# Instantiate linear regression model.


# Transform Z_train and Z_test.


# Fit on Z_train.


# Score on training and testing sets.



Our final model here is:

$$
\begin{eqnarray*}
[\text{quality}] &=& \beta_0 + \beta_1Z_1 + \beta_2Z_2 + \cdots + \beta_{10}Z_{10} \\
\\
\text{where } Z_1 &=& \gamma_1X_1 + \gamma_2X_2 + \gamma_3X_3 + \cdots \gamma_pX_p \\
\text{and } Z_2 &=& \delta_1X_1 + \delta_2X_2 + \delta_3X_3 + \cdots \delta_pX_p \\
&\vdots& \\
\text{and } Z_{10} &=& \eta_1X_1 + \eta_2X_2 + \eta_3X_3 + \cdots \eta_pX_p \\
\end{eqnarray*}
$$

In [None]:
# Make a PCA dataframe
columns = [f'PCA_{i+1}' for i in pd.DataFrame(Z_train).columns]
z_df = pd.DataFrame(data = Z_train, columns=columns)
z_df['Wine_Quality'] = y_train.values
z_df.head()

In [None]:
# Visualize PCA_1 vs. PCA_10
sns.lmplot(
    x="PCA_1",
    y="PCA_10",
    data=z_df, 
    fit_reg=False, 
    hue='Wine_Quality', # color by cluster
    legend=True,   
    scatter_kws={"s": 30} # specify the point size
);

**Two assumptions that PCA makes:**
1. **Linearity:** PCA detects and controls for linear relationships, so we assume that the data does not hold nonlinear relationships (or that we don't care about these nonlinear relationships).
    - We are using our covariance matrix to determine important "directions," which is a measure of the linear relationship between observations!
    - There are other types of feature extraction like [t-SNE](https://lvdmaaten.github.io/tsne/) and [PPA](https://towardsdatascience.com/interesting-projections-where-pca-fails-fe64ddca73e6), though we won't formally cover those in a global lesson.
    
    
2. **Large variances define importance:** If data is spread in a direction, that direction is important! If there is little spread in a direction, that direction is not very important.
    - That aligns with what we saw [here](http://setosa.io/ev/principal-component-analysis/).

### Potential Use Cases for PCA
- Situations where $p \not\ll n$. (Situations where $p$ is not substantially smaller than $n$.)
- Situations in which there are variables with high multicollinearity. (Can be traditional models or models with highly correlated inputs by design, like images.)
- Situations in which there are many variables, even without explicit multicollinearity.

### Interview Questions

<details><summary>Explain PCA to me.</summary>

- Principal component analysis is a method of dimensionality reduction that **identifies important relationships** in our data, **transforms the existing data** based on these relationships, and then **quantifies the importance** of these relationships so we can keep the most important relationships and drop the others!

<details><summary>How can I remember the above?</summary>

Matt's "Three Signposts:"
- Covariance Matrix
- Eigenvectors
- Eigenvalues
</details>
</details>

<details><summary>In what cases would I not use PCA?</summary>

- Since PCA distorts the interpretability of our features, we should not use PCA if our goal is to interpret the output of our model.
- If we have relatively few features as inputs, PCA is unlikely to have a large positive impact on our model.
</details>