## Visualizing High Dimensional Data Using PCA & t-SNE

1) Data analysis involves exploration by looking into data distribution of certain variables and correlations between variables.

2) Visual exploration is very important to comprehend the data and understand quickly.

3) Visual representation of data gets difficult with high dimensional data with different variables distribute across different dimensions.

4) Understand this high dimensional becomes easier if we can reduce number of dimensions. 

<b>PCA:</b> Principle Component Analysis<br>
<b>t-SNE:</b> t-Distributed Stochastic Neighbor Embedding.

## PCA

PCA is a linear transformation that finds the "principal components", or directions of greatest variance, in a data set. It can be used for dimension reduction among other things. In this exercise we're first tasked with implementing PCA and applying it to a simple 2-dimensional data set to see how it works. Let's start off by loading and visualizing the data set.

In [92]:
import numpy as np   
from numpy import linalg as LA

In [93]:
X = np.array([[1.0, 2.0, 3.0], [2.0, 4.0, 6.0], [3.0, 6.0, 9.0]])

In [85]:
def pca(X):
    # compute the covariance matrix
    X = np.matrix(X)
    cov = (X.T * X) / X.shape[0]
    # perform SVD
    U, S, V = np.linalg.svd(cov)    
    return U, S, V

def project_data(X, U, k):  
    U_reduced = U[:,:k]
    return np.dot(X, U_reduced)

def recover_data(Z, U, k):  
    U_reduced = U[:,:k]
    return np.dot(Z, U_reduced.T)

In [86]:
U, S, V = pca(X)  

U, S, V

(matrix([[-0.26726124, -0.68796149, -0.6747447 ],
         [-0.53452248, -0.47677442,  0.69783369],
         [-0.80178373,  0.54717011, -0.24030756]]),
 array([  6.53333333e+01,   1.27541700e-15,   4.42091896e-16]),
 matrix([[-0.26726124, -0.53452248, -0.80178373],
         [ 0.17708179, -0.84512377,  0.50438859],
         [-0.94721353, -0.00717777,  0.32052303]]))

### Performing dimension reduction to 2D

In [89]:
Z = project_data(X, U, 2)  
print ('2D Reduced Data: \n')
print (Z) 

print ('\n***********************')

X_recovered = recover_data(Z, U, 2)  
print ('Reconstructed Data: \n')
print (X_recovered)

print ('\n***********************')
print ('Approximation Error: \n')
print (LA.norm(X_recovered-X))  # compute approximation error

2D Reduced Data: 

[[ -3.74165739e+00  -1.11022302e-16]
 [ -7.48331477e+00  -2.22044605e-16]
 [ -1.12249722e+01  -5.55111512e-16]]

***********************
Reconstructed Data: 

[[ 1.  2.  3.]
 [ 2.  4.  6.]
 [ 3.  6.  9.]]

***********************
Approximation Error: 

2.71039981919e-15


### Performing dimension reduction to 1D

In [91]:
Z = project_data(X, U, 1)  
print ('2D Reduced Data: \n')
print (Z) 

print ('\n***********************')

X_recovered = recover_data(Z, U, 1)  
print ('Reconstructed Data: \n')
print (X_recovered)

print ('\n***********************')
print ('Approximation Error: \n')
print (LA.norm(X_recovered-X))  # compute approximation error

2D Reduced Data: 

[[ -3.74165739]
 [ -7.48331477]
 [-11.22497216]]

***********************
Reconstructed Data: 

[[ 1.  2.  3.]
 [ 2.  4.  6.]
 [ 3.  6.  9.]]

***********************
Approximation Error: 

3.69555874282e-15
