<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 10px;"> 
# Principal Components Analysis - PCA

---
DS-SF-31 | Lesson 13 | Instructor: Mario Javier Carrillo

$~$

_Lesson adopted from [An Introduction to Statistical Learning] [1] and [A Tutorial on Principal Component Analysis] [2]_

[1]: http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf    "An Introduction to Statistical Learning"
[2]: https://arxiv.org/pdf/1404.1100.pdf   "A Tutorial on Principal Component Analysis"

### Principal Components Analysis
---
* Technique that when faced with a large set of correlated variables, allow us to summarize this set with a smaller number of representative variables that **collectively** explain most of the variability in the original set.
* **PCA** is the quintessential "dimensionality reduction" algorithm, where _"dimensionality reduction"_ = process of combining or collapsing your existing features (columns in $X$) into new features that retain the signal in the original data in fewer variables while ideally reducing noise.
* **PCA** is an unsupervised approach, since it involves only a set of features $X_1$, $X_2$, . . . , $X_p$, and no associated response $Y$ 
* **PCA** produces derived variables to use in supervised methods and is also a tool for data visualization



### Principal Components Analysis 
---
* Sure, but what the heck PCA really means? how does it work? see  the link below (5 min)

<a href = http://setosa.io/ev/principal-component-analysis/> PCA Explained Visually </a>

### Principal Components Analysis - the optimization process - part 1
---
* The idea is that each of the $n$ observations lives in $p$-dimensional space, but not all of these dimensions are equally interesting.
* Each of the dimensions found by **PCA** is a linear combination of the $p$ features, so the first principal component of a set of features $X_1$, $X_2$, . . . , $X_p$ is the **normalized** linear combination of the features:
$$Z_1 = φ_{11}X_1 +φ_{21}X_2 +...+φ_{p1}X_p$$
that have the **largest variance**  where by **normalized** we are referring to:
$$\sum_{j=1}^p φ_{j1}^2 = 1$$
The elements $φ_{11}$,...,$φ_{p1}$ are the loadings of the **first** PCA, **AND** together these loadings make up the principal component loading vector whose **SUM** is equal to **1**: $$φ_1 = (φ_{11}  φ_{21} ... φ_{p1})^T$$

### Principal Components Analysis - the optimization process - part 2
---
* Assume we have a $n$ × $p$ data set $X$, how can we calculate the **first** PCA?
    * Remember we are **only** interested in **variance**, therefore assume that each of the variables in $X$ have been centered to have mean zero =>_the column means of X are zero_<= 
    * We then look for the linear combination of the sample feature values of the form
$$z_{i1} = φ_{11}x_{i1} +φ_{21}x_{i2} +...+φ_{p1}x_{ip}$$
    * That is subject to a constrain $$\sum_{j=1}^p φ_{j1}^2 = 1$$


### Principal Components Analysis - the optimization process - part 3
---
* Therefore the First PCA loading vector solves the following optimization problem:
$$ \text{maximize}_{φ_{11},...,φ_{p1}} \left(\frac1n\sum_{i=1}^n \left(\sum_{j=1}^p φ_{j1}x_{ij} \right)^2\right) sub. to \sum_{j=1}^p φ_{j1}^2 = 1$$
* after some math:
$$\frac1n \sum_{i=1}^n z_{i1}^2$$
* and since $$\frac1n \sum_{i=1}^n x_{ij} = 0$$
the average of the $z_{11}$,...,$z_{n1}$ will be zero as well, thus our **maximization** is the **sample variance of the $n$ values of $z_{i1}$ 
where $z_{11}$,...,$z_{n1}$ are the scores of the first PCA** => solved by eigen decomposition.

### Principal Components Analysis - the optimization process - part 4
---

* Once the **First PCA** $Z_1$ has been determine, we can calculate the **Second PCA,** where the second PCA is the linear combination of $X_1$,...$X_p$ that has maximal variance out of all linear combinations that are **uncorrelated with $Z_1$**.
* now the second PCA scores $Z_{12}$, $Z_{22}$,...,$Z_{n2}$ take the following form:
$$Z_{i2} = φ_{12}X_{i1} +φ_{22}X_{i2} +...+φ_{p2}X_{ip}$$
where $φ_2$ is the second PCA loading vector with elements $φ_{12}$,  $φ_{22}$...,$φ_{p2}$
* Note: constraining $Z_2$ to be uncorrelated with $Z_1$, is equivalent to constraining the direction of $φ_2$ to be perpendicular = orthogonal (see graph on next slide) to the direction of $φ_1$



### Principal Components Analysis - PC 1 and PC 2
---

![](https://snag.gy/ECsJye.jpg)
*Image from Introduction to Statistical Learning*

### Principal Components Analysis - the process
---
* Linearly transform an $𝑁$×$𝑑$ matrix $𝑋$ into an $𝑁$×$𝑚$ matrix $𝑌$
    * Centralized the data (subtract the mean). 
    * Calculate the $𝑑$×$𝑑$ covariance matrix: $$𝐶 = \frac1{N-1} X^T X$$
        * $C_{ij}$ =  $\frac1{N-1}$ $\sum_{q=1}^N$ $X_{q,i}$ $X_{q,i}$ 
        * $C_{i,i}$ (diagonal) is the variance of variable $i$
        * $C_{i,j}$ (off-diagonal) is the covariance between variables $i$ and $j$
    * Calculate the **eigenvectors** of the covariance matrix
         * An **eigenvector** specifies a direction through the original coordinate space. 
    * Select $m$ **eigenvectors** that correspond to the **largest $m$ eigenvalues** to be the new basis.
         * The eigenvector with the highest correspoding **eigenvalue** is the first principal component.
         * **Eigenvalues** indicate the amount of variance in the direction of it's corresponding eigenvector

### Principal Components Analysis - eigenvectors
---
* If $A$ is a **square matrix**, a non-zero vector **$v$** is an **eigenvector** of $A$ if there is a scalar $λ$ **(eigenvalue)** such that $$Av = λv$$

* For example:
$$ Av = 
\left(\begin{array}{cc} 
2 & 3\\
2 & 1
\end{array}\right) *
\left(\begin{array}{cc} 
3 \\ 
2 
\end{array}\right) = 
\left(\begin{array}{cc} 
12 \\
8
\end{array}\right) = 
4
\left(\begin{array}{cc} 
3 \\
2
\end{array}\right)
= λv
$$ 
$~$
* If you think of the squared matrix $A$ as a transformation matrix, then multiply it with the **eigenvector do not change its direction.**

<a href = http://setosa.io/ev/eigenvectors-and-eigenvalues/> Please see Eigenvectors and Eigenvalues Visually </a>

### Principal Components Analysis - in sum
---

What is a principal component? **Principal components are the vectors that define the new coordinate system for your data.** Transforming your original data columns onto the principal component axes constructs new variables that are optimized to explain as much variance as possible and to be independent (uncorrelated).

Creating these variables is a well-defined mathematical process, but in essence **each component is created as a weighted sum of your original columns, such that all components are orthogonal (perpendicular) to each other**.


### Principal Components Analysis - in sum
---
![](https://snag.gy/0Hur9o.jpg)
*Image from http://setosa.io/ev/principal-component-analysis/*

### Principal Components Analysis - Why would we want to do PCA?
---
* We can reduce the number of dimensions (remove bottom number of components) and lose the least possible amount of variance information in our data.
* Since we are assuming our variables are interrelated (at least in the sense that they together explain a dependent variable), the information of interest should exist along directions with largest variance.
* The directions of largest variance should have the highest Signal to Noise ratio.
* Correlated predictor variables (also referred to as "redundancy" of information) are combined into independent variables. Our predictors from PCA are guaranteed to be independent.

### Principal Components Analysis - Want to explore more on PCA ?
---
[Performing PCA by Sebastian Raschka](http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html#pca-vs-lda)

[PCA 4 dummies](https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/)

[Stackoverflow making sense of PCA](http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues)

[PCA and spectral theorem](http://stats.stackexchange.com/questions/217995/what-is-an-intuitive-explanation-for-how-pca-turns-from-a-geometric-problem-wit)

[PCA math and examples](http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf)