# Linear Algebra

In this exercise, we will develop a Principal Component Analysis (PCA) algorithm that widely use to reduce the number of feature but preserving the main information of the data. PCA is also well known to removing noise and come up with cleaner dataset. 

This part will only handle matrix with rank 3  
The outcome of this is a working example of PCA with 3 features

In [1]:
import numpy as np

## Step 1

Preparing the dataset  
Let's assume the data is 3 feature with 5 datapoints, every feature will be a vertical vector 

In [2]:
x1 = [2,4,5,1,2]
x2 = [5,1,2,2,3]
x3 = [4,2,4,3,1]

#initialize data
x = np.stack([x1,x2,x3]).T
x

array([[2, 5, 4],
       [4, 1, 2],
       [5, 2, 4],
       [1, 2, 3],
       [2, 3, 1]])

## Step 2

Calculating covariance matrix

In [3]:
def calculate_covariance_matrix(x):
    n_sample = x.shape[0]
    normalized_x = x - x.mean(axis=0)
    return (1/n_sample) * (normalized_x.T.dot(normalized_x))

In [4]:
cov_mat = calculate_covariance_matrix(x)
cov_mat

array([[ 2.16, -0.88,  0.36],
       [-0.88,  1.84,  0.52],
       [ 0.36,  0.52,  1.36]])

## Step 3

Getting eigen values and eigen vector  
Since this is a matrix will rank=3, there will be eigen value which we can find out by solving this equation

\begin{equation*}
\begin{vmatrix}
var(X_1) - \lambda & cov(X_1, X_2) & cov(X_1, X_3) \\
cov(X_1, X_2) &  var(X_2) - \lambda & cov(X_2, X_3) \\
cov(X_1, X_3) &  cov(X_2, X_3) & var(X_3) - \lambda
\end{vmatrix} = 0
\end{equation*}

Then we will solve the 3rd degree polinomial and get each value of lambda  
After that each value of lambda will be used to generate eagen vector

In [5]:
eigen_values, eigen_vectors = np.linalg.eig(cov_mat)
eigen_values

array([0.58960722, 2.89687973, 1.87351306])

In [6]:
eigen_vectors

array([[ 0.48363446,  0.75722987,  0.43897681],
       [ 0.60386509, -0.65172257,  0.4589168 ],
       [-0.63359661, -0.04313479,  0.77246018]])

## Step 4

Sort eigen_value and eigen_vectors, and calculate cumulative percentage variance explained

In [10]:
sorted_index = eigen_values.argsort()[::-1]
eigen_values, eigen_vector = eigen_values[sorted_index], eigen_vectors[sorted_index]

In [11]:
cumulative_variance_explained = np.cumsum(eigen_values) / np.sum(eigen_values)
cumulative_variance_explained

array([0.54046264, 0.88999865, 1.        ])

## Step 5

Get number of component that we will select transform the old dataset

In [13]:
n_component = 2
x_transformed = x.dot(eigen_vectors[:, :n_component])
x_transformed

array([[ 1.45220796, -1.91669229],
       [ 1.27120971,  2.29092732],
       [ 1.09151605,  2.31016504],
       [-0.20942517, -0.67561965],
       [ 2.14526759, -0.48384277]])

## Reflection

In this exercise, i learn about how to implement a PCA with Numpy and using a linear algebra package from Numpy.
It will be very tricky to implement PCA from scracth because of the complexity in calculating the eigen_values. We need to be able to calculate a polinomial equation to get the eigen_values

PCA is a useful method to reduce the data dimension and removing noise from the dataset. However there will be some information that is gone after the feature reduction
