## Dimensionality Reduction

### Motivation I: Data Compression

We may want to reduce the dimension of our features if we have a lot of redundant data.

To do this, we find two highly correlated features, plot them, and make a new line that seems to describe both features accurately.

Doing dimensionality reduction will reduce the total data we have to store in computer memory and will speed up our learning algorithm.

Note: in dimensionality reduction, we are reducing our features rather than our number of examples. Our variable m will stay the same size; n, the number of features each example carries will be reduced.

### Motivation II: Visualization

It is not easy to visualize data that is more than three dimensions. We can reduce the dimensions of our data to 3 or less in order to plot it.

We need to find new features that can effectively summarize all the other features.

## Problem Formulation

The most popular dimensionality reduction algorithm is Principal Component Analysis (PCA)

Given two features, we want to find a single line that effectively describes both features at once. We then map our old features onto this new line to get a new single feature. The same can be done with three features, where we map them to a plane.

Reduce from n-dimension to k-dimension: 
- Find k vectors $u^{1} , u^{2}, … , u^{k}$ to determine the position of the hyperplane onto which to project the data so as to minimize the projection error.

If we are converting from 3d to 2d, we will project our data onto two directions (a plane), so k will be 2.

PCA is not linear regression :

In linear regression, we are minimizing the squared error from every point to our predictor line. These are vertical distances.
In PCA, we are minimizing the shortest orthogonal distances, to our data points.


## Principal Component Analysis Algorithm

Before we can apply PCA, we have to standardize our dataset by subtracting every observation for every feature by the feature mean and divide it by the feature standard deviation.

After this, PCA has two tasks: figure out the k eigenvectors $u^{1} , u^{2}, … , u^{k}$ and also to find $z_{1}, z_{2}, …,z_{m}$ for each m examples.

### Compute the covariance matrix

<img src='files/cov_mat.png'>

### Compute the eigenvectors of the covariance matrix

To do so we use the singular value decomposition technique applied to the covariance matrix.

<img src='files/eigen.png'>

Just take the first k-vectors from U (first k columns) and get the 'Ureduce' matrix. This will be an m×k matrix. 

We use it to compute our new features : $z = (Ureduce)^T * x$


## Reconstruction from Compressed Representation

If we use PCA to compress our data, how can we uncompress our data, or go back to our original number of features?

To go back from z in R to x in R², we can use the following equation :

$x_{\rm approx} =  Ureduce * z$

Note that we can only get approximations of our original data.

Note: It turns out that the U matrix has the special property that it is a Unitary Matrix. One of the special properties of a Unitary Matrix is :
$U^{-1} = U^{T}$ (since we are dealing with real numbers here)

## Choosing the Number of Principal Components

How do we choose k, also called the number of principal components?

One way to choose k is by using the following formula :

- Given the average squared projection error: $\frac{1}{m}\sum_{i=1}^m\|x^{(i)}-x^{(i)}_\text{approx}\|^2$

- Also given the total variation in the data : $\frac{1}{m}\sum_{i=1}^m\|x^{(i)}\|^2$

- Choose k to be the smallest value such so that : $\frac{ \frac{1}{m} \sum_{i=1}^m ||x^{(i)}- x^{(i)}_{\rm approx}||^2}{\frac{1}{m} \sum_{i=1}^m ||x^{(i)}||^2} \leq 0.01$ or $(0.05)$


In other words, the squared projection error divided by the total variation should be less than one percent, so that 99% of the variance is
retained.

Algorithm for choosing k : 
1. Try PCA with k=1,2,…
2. Compute $Ureduce, z, x$
3. Check the formula given above that 99% of the variance is retained. If not, go to step one and increase k.
This procedure would actually be horribly inefficient.

But recall that when we used the covariance matrix to compute our eigenvectors we also got other matrices : 

U,S,V = svd(Sigma) where U is the matrix of the eigenvectors and S is the matrix of the eigenvalues.

With that matrix S, we can actually check for 99% of retained variance as follows :

<img src='files/remaining_var_PCA.png'>



## Advice for Applying PCA

The most common use of PCA is to speed up supervised learning.

Given a training set with a large number of features, we can use PCA to reduce the number of features in each
example of the training set.

Note that we should define the PCA reduction only on the training set and not on the cross-validation or test sets. Then after this, you can apply the mapping to your cross-validation and test sets.

Trying to prevent overfitting is a very bad use of PCA. 

It might work, but is not recommended because it does not consider the values of our label y. Using just regularization will be at least as effective and we will not lose as much intel.

Don't assume you need to do PCA. Try your full machine learning algorithm without PCA first. Then use PCA if you find that you need it.


