#**Principal Component Analysis**

In Step M2.4.9 we introduced the concept of Principal Component Analysis (PCA) in order to deal with dimensionality reduction. We didn't explain the workings of PCA and how it can also be used to create new features. In this step we will explain the background to PCA and what an eigenvalue/eigenvector is and how they relate to a principal components. 

You might ask what is so important about PCAs. Well they can help us accomplish the following:
* Data Visualization
* Data Reduction
* Data Classification
* Trend Analysis
* Factor Analysis
* Noise Reduction

PCA is the most common form of Factor Analysis. Factor analysis is a process where we try to impute latent variables or variables that are not directly observable from our data. 

So lets try and explain in simple terms what we are trying to do with PCA. In PCA we are trying to create a new set of variables that explain the variation in the original dataset subject to the constraint that they are orthogonal(not correlated) to each other. This is an easy concept to understand but the maths behind is a little tricky. So imagine we have a set of variables $X=[x_1,x_2,...,x_n]$. Now we are going to find a transfomation that creates a new axis that describes the greastest variance in the data subject to the condition that each new varible is othogonal. We can see in the figure below that we have created a new axis $PC_1$ and $PC_2$ which are orthogonal.

![](https://www.computing.dcu.ie/~amccarren/mcm_images/PCA_1.png)

So lets go back a little and do a recap on what a covariance matrix is. The covariance between 2 variables $x$ and $y$ can be given by the following equation:
>> $\sigma(x,y)=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x_i})(y_i-\bar{y_i})$

now if we center each variable around their means,  with the following transformation:

>> $\hat{x}=x_i-\bar{x_i}$

then the covariance between $\hat{x}$ and $\hat{y}$ can be described as follows: 

>> $\sigma(\hat{x},\hat{y})=\frac{1}{n-1}\sum_{i=1}^n(\hat{x_i})(\hat{y_i})\approx\hat{x}\hat{y}$

Now if we center all the variables in $X$ above we can describe the covariance matrix as follows:

>>$\sigma = \hat{X}.\hat{X^T}$

Now we want to transform our $\hat{X}$ matrix with new vectors $B$ to give us a new feature matrix $B\hat{X}$. We also want to maximise the variance explained by new $B\hat{X}$ subject to them being orthognal. To have this condition then $BB^T$ will have to be the identify matrix, which confirms that we will have no colinearity in the $B$ vectors and the new vectors $B\hat{X}$ will be othogonal. 

We can accomplish all of this by minimising $B^T \hat{X}.\hat{X^T}B$ subject to $B^TB=I$ 

We can do this using a method called [lagrange multipliers](http://mathonline.wikidot.com/the-method-of-lagrange-multipliers) which states that if we optimize the following equation with respect to $B$:

>>$B^T \hat{X}.\hat{X^T}B -\lambda B^TB=0$ 


and using partial differentiation we will get 

>> $ \hat{X}.\hat{X^T}B =\lambda B$

So those of you who remember a little linear algebra will notice that this equation is the same as the derivation of 2 terms known as the eigenvalue $\lambda$ and the eigenvectors $B$ of the covariance matrix $\hat{X}.\hat{X^T}$. Now the great thing about all of this is the largest eigenvalue and the eigenector corresponding to it multiplied by the $\hat{X}$ matrix will give you the first component or the component that describes the most variation, and the second component  describes the second highest amount of the variation, and so on. I have regularly found datasets with 50 or more variables where the first component accounts for over 50% of the overall variation within the data. You will also find that the proportion of each eigenvalues to the total sum of the eigenvalues gives you the \% explained by each compponent or

>> % Explained by $PC_i=\frac{\lambda_i}{\sum_{i=1}^n\lambda_i}*100$

![](https://www.computing.dcu.ie/~amccarren/mcm_images/PCA_3.png)

You can see from the figure above that the amount of variance explained diminishes with the number of components. You do lose some information, but if the eigenvalues are small, you don’t lose much. So if you start with $n$ dimensions you will probably end up with about 

Now we mentioned this in the precious MOOC, but I will repeat it again. Don't use PCA on categorical data. It really is not appropriate. You can use a technique call Multiple Correspondence analysis to do this an we will look at this in step 2.7 of this topic.

If you are finding this a little difficult, have a look at a tutorial from [Lindsay Smith](http://www-labs.iro.umontreal.ca/~pift6080/H09/documents/papers/pca_tutorial.pdf) or this one from medium.com by [Rishav Kumar](https://medium.com/@aptrishu/understanding-principle-component-analysis-e32be0253ef0).

We will now look a very simple example of how we calculate the PCA's in the next step.


