# DIMENSIONALITY REDUCTION USING PCA

<a id=section1></a>

## 1. Introduction to Principal Component Analysis

<a id=section101></a>

### 1.1 What is Data Dimensionality?

In real world, __number of columns__ is the number of __dimensions of data__.However, some columns are __similar__, some are __correlated__, some are __duplicates__ in some way, some are __junk__, some are __useless__, etc. so the actual number of dimensions can be unknown. Its a knotty problem.

<a id=section102></a>

### 1.2 What is high dimensionality?

Suppose we have __500 variables__ in a data set. As it a very huge number, so it is quite difficult to read and understand the data. This is known as high dimensionality.
Anything which __can't be read and understood without any use of external resources__ is an example of high dimensionality.

#### Lost in high dimensional space: ![Imgur](https://i.imgur.com/zPXvzBY.jpg)

<a id=section103></a>

### 1.3 MOTIVATION

When dealing with real problems and real data we often deal with __high dimensional__ data that can go up to __millions__.
- Sometime we might need to deal with data having large number of columns/variables, so we need to __reduce its dimensionality__.
- The need to reduce dimensionality is often __associated with visualizations__ (reducing to 2–3 dimensions so we can plot it) but _that is not always the case_.
- Sometimes we might value __performance over precision__ so we could reduce _1,000 dimensional data to 10 dimensions so we can manipulate it faster_ (eg. calculate distances).
- Find essential __attributes/variables__.

The need to reduce dimensionality at times is __real and has many applications__.
For the same, there are __various techinques__.<br/>
This sheet is entirely focused on __PCA(Principal Component Analysis)__![Imgur](https://i.imgur.com/XF6Quuj.png)

The picture above explains a simple dimension reduction in which a __3-D__ figure is compressed to a __2-D__ figure.
This helps in better __visualisation__ and better __understanding__ of data points.

<a id=section104></a>

### 1.4 PRINCIPAL COMPONENT ANALYSIS

> _Too many variables? Should you be using all possible variables to generate model?_

In order to handle __'curse of dimensionality'__ and avoid issues like __over-fitting__ in high dimensional space, methods like __Principal Component analysis__ is used.

PCA is a method used for __compressing__ a lot of data into something that captures the __essence__ of the _original data_.
- It reduces the dimension of your data with the __aim of retaining__ as _much information as possible_. 
- Calculated efficiently with computer programs
- This method combines __highly correlated variables__ together to form a smaller number of an artificial set of variables.<br/>   
- These artificial set of variables are called __'principal components'__ that account for __most variance__ in the data.
- For much detailed and basic explanation _click on the __video__ just below_:

<a href="http://www.youtube.com/watch?feature=player_embedded&v=BfTMmoDFXyE
" target="_blank"><img src="http://img.youtube.com/vi/BfTMmoDFXyE/0.jpg" 
alt="A layman's introduction to principal component analysis" width="240" height="180" border="10" /></a>


This image below is an example for __visualization__, as how _different dimensions are arranged_.
As the __dimensionality increase__, the __complexity in visualization increases__.
In the image below, we can see that in 
- _1 dimension we have 10 positions which is easy to read and understand_.
- _2 dimensions is having 100 positions, it is still good_.
- _3 dimensions is having 1000 posiitons, it is now a bit difficult to read, as we have to check through 3 corners to understand the data well_.

__Note__ : Though we can go for __N-Dimensions__ (N=1,2,3,.....,1000,....,N), but __4-D and above__ cannot be drawn on a piece of paper as 1-D, 2-D and 3-D.

![Imgur](https://i.imgur.com/G4RkZPT.png)

<a id=section105></a>

### 1.4.1 PCA IS NOTHING BUT COORDINATE SYSTEM TRANSFORMATION.

The output model uses three axes:<br/> L (Length), W (Width) and H (Height) that perpendicular to each other to represent the 3-D world. So each data point on that object can be written as a function of three variables:

    Data(i) = f(L(i), W(i), H(i))         [function 1]

In the new coordinate system, each data point on that ellipse can be re-written as a function of two variables:

    Data(j) = g(C1(j), C2(j))             [function 2]


- Fewer variables (or lower dimensions of variables) of function 2 compared to function 1.
        (L, W, H) --> (C1, C2)
- No information lost.
        function 1 == function 2
        The relative geometric positions of all data points remain unchanged.



![Imgur](https://i.imgur.com/k0rvKf1.png)

<a id=section106></a>

### 1.4.2 PCA has limitations : example of failures

Any algorithm could __fail__ when its __assumption is not satisfied__. 
- PCA makes the __"largest variance"__ assumptions.
- If the data does not follow a multidimensional normal distribution
- PCA may not give the best principal components.

![Imgur](https://i.imgur.com/8RS7F6E.png)

<a id=section107></a>

### 1.4.3 PCA as a whole

![Imgur](https://i.imgur.com/LN5YZVm.png)

<a id=section108></a>

### 1.4.4 PCA explanation through animation

PCA will find the __"best"__ line according to __two different criteria__ of what is the "best".
- First, the variation of values along the line should be __maximal__. 
    - Pay attention to how the __"spread" (variance)__ of the _red dots_ changes while the line rotates.
    - __can you see when it _reaches maximum_?__ 
- Second, if we __reconstruct__ the original two characteristics (__position of a blue dot__) from the new one (__position of a red dot__), the __reconstruction error__ will be given by the _length of the connecting red line_.
- Observe how the length of these red lines changes while the line rotates.
    - __Can you see when the total length _reaches minimum_?__

If you stare at this animation for some time,
- You will notice that __"the maximum variance"__ and __"the minimum error"__ are reached at the __same time__, namely when the line points to the magenta ticks I marked on both sides of the data cloud. 
    - This line corresponds to the _new data property that will be constructed by PCA_.

![image.png](https://raw.githubusercontent.com/insaid2018/Term-3/master/Images/Q7HIP.gif)

### Conclusion

Thus PCA is a method that brings together:
1. A measure of how each variable is associated with one another. (Covariance matrix.)
2. The directions in which our data are dispersed. (Eigenvectors.)
3. The relative importance of these different directions. (Eigenvalues.)

__PCA combines our predictors and allows us to drop the eigenvectors that are relatively unimportant__.

### Covariance Matrix

- A square matrix of numbers that describe the **variance of the data, and the covariance among variables** is called covariance matrix. 
- It is an **empirical description** of data we observe.

- For a 2 x 2 matrix, a covariance matrix might look like this:
![Imgur](https://i.imgur.com/4rUDI9N.jpg)

- The numbers on the upper left and lower right represent the **variance of the x and y** variables.
- While the identical numbers on the lower left and upper right represent the **covariance between x and y**.


#### Graphical Representation:
- If two variables **increase and decrease together** (a line going up and to the right), they have a **positive covariance**.
- If one **decreases while the other increases**, they have a **negative covariance** (a line going down and to the right).
![Imgur](https://i.imgur.com/Suk3cPw.png)

### Let us understand the concept of Eigenvectors and Eigenvalues

![Imgur](https://i.imgur.com/0Aj8Ghr.jpg)

- An **Eigen vector** is a vector whose direction remains unchanged when a linear transformation is applied to it. 
- The submission of squared distances from origin of all data points is called **Eigen Value**.
- The eigenvector with the highest eigenvalue is therefore the **principal component**.


<a id=section2></a>

<a id=section3></a>