## Principle Component Analysis

PCA is a classcical approach to achieve dimension reduction. It uses orthogonal transformation to transfer original data into linear independent(uncorrelated) data points. The basic idea is to choose linearly independent principle components or features so that data points have the highest possible variance. (higher variance indicates the significance of the component)

Precedure:
1. If we have x data points with y features, transform them into a $x*y$ matrix X.
2. Subtract mean of data on each feature to center the data around the origin.
3. Calculate the covariance matrix(the diagonal elements are variances and other elements are covarianes): $(C*C^T)/x$
4. Diagonalize the covariance matrix C: sort n eigenvectors of C in a vector A (with n rows) according to n largest eigenvalues. Then A is the linear transformation that diagonalizes C. (Choose n largest largest eigenvalues to reduce dimension to n)
5. Y is the output with dimension n: Y=AX

In [1]:
import numpy as np
import random as rd

In [2]:
# Data Preparation (1000 datapoints with 5 features)
data=[]
for i in range(1000):
    data1=[]
    for i in range(5):
        data1.append(rd.randrange(0,1000))
    data.append(data1)    

dataset=np.mat(data) # Step 1

In [3]:
dataset

matrix([[618, 533, 800, 515, 597],
        [261, 184,  39, 463, 748],
        [501, 408, 270, 417, 641],
        ...,
        [198, 338, 352, 904,  26],
        [483, 263,  18, 856, 769],
        [264, 369, 583, 286, 607]])

In [4]:
def PCA (data_input, target_dimension):
    
    Standardized_dataset=data_input-np.mean(data_input,axis=0) # Step 2
    Covariance_matrix=np.cov(Standardized_dataset, rowvar=0) # Step 3
    Eigen_values, Eigen_vectors = np.linalg.eig(np.mat(Covariance_matrix)) # Calculate Eigenvalues and Eigenvectors
    Sorted_index = np.argsort(Eigen_values)[::-1][:target_dimension]  # Step 4
    Principal_Components = Eigen_vectors[:, Sorted_index]
    data_output = data_input*Principal_Components   # transform data into new dimensions
    print (Principal_Components)
    return data_output

In [14]:
PCA(dataset,3)

[[-0.50661826 -0.4048483   0.1228957 ]
 [-0.06072292 -0.62905902  0.33698531]
 [ 0.53293622 -0.48626836  0.27840452]
 [ 0.09972926 -0.44255239 -0.890296  ]
 [ 0.66759548  0.08985023  0.03466268]]


matrix([[  214.6460844 ,  -940.79813714,  -211.72432629],
        [   62.37464598,  -987.12497452,  -203.04666044],
        [  131.9846144 , -1076.13761916,   354.09278323],
        ...,
        [  588.5907383 ,  -580.75246291,  -449.93421729],
        [  128.42941207,  -771.598543  ,   -27.55315211],
        [  251.57967284,  -618.01705297,   -33.97749024]])