# Principal Component Analysis (PCA)

### Approach	
    1) Subtract the mean from X
	2) Calculate the Covariance
	3) Calculate the eigenvectors and eigenvalues of cov matrix
	4) Sort the e-vec according to their e-values in decreasing order
	5) Choose first k e-vec and that will be the new k dimension
	6) Transform the orignal n dim data points into k dim (Projections with dot product)

In [None]:
import numpy as np

class my_pca:
    def __init__(self, n_components):
        self.n_components = n_components
        self.components = None
        self.mean = None
        self.eigenvalues = None
        self.explained_variance = None
        self.cumulative_variance = None
        
    def fit(self, X):
        #mean
        self.mean = np.mean(X, axis = 0) # column wise
        X = X - self.mean
        
        #Covariance
        cov = np.cov(X.T)
        
        #eigenvalues, eigenvectors
        eigenvalues, eigenvectors = np.linalg.eig(cov)
        
        #Sort eigenvectors
        eigenvectors = eigenvectors.T
        index = np.argsort(eigenvalues)[::-1] # Descending order
        eigenvalues = eigenvalues[index]
        eigenvectors = eigenvectors[index]
        
        #Store first n eigenvectors
        self.components = eigenvectors[0 : self.n_components]
        self.eigenvalues = eigenvalues[0 : self.n_components]
        
        # Calculate explained variance and cumulative variance
        sum_eig_val = np.sum(self.eigenvalues)
        self.explained_variance = self.eigenvalues / sum_eig_val
        self.cumulative_variance = np.cumsum(self.explained_variance)
        
    def transform(self, X):
        #project data
        X = X - self.mean
        return np.dot(X, self.components.T)
        
    def get_eigenvalues(self):
        # Access the eigenvalues
        return self.eigenvalues
    
    def get_explained_variance(self):
        # Access the explained variance
        return self.explained_variance
    
    def get_cumulative_variance(self):
        # Access the cumulative variance
        return self.cumulative_variance
        

In [1]:
#Import libraries
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


In [2]:
#Load dataset
data = datasets.load_iris()
X = data.data
y = data.target

## Statistical Tests to be done for PCA

The following are the 2 tests that we perform on the dataset to identify whether to perform PCA on the given dataset or not to perform the same.

#### 1. Bartlett’s Test of Sphericity: 
	
    It tests the hypothesis that the variables are uncorrelated within the population.

	H0: Null Hypothesis: All variables in the data are uncorrelated.

	Ha: Alternate Hypothesis: At least one pair of variables in the data are correlated if the null hypothesis cannot be 	rejected, then PCA is not advisable.

In [3]:
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value = calculate_bartlett_sphericity(X)
p_value

1.9226796044118258e-149

For our dataset p_value is 0.0 which is less than 0.05. Hence we can reject the Null Hypothesis and agree that there is at least one pair of variables in the data which are correlated. Hence, PCA is recommended.

#### 2.Kaiser- Meyer- Olkin (KMO): 

	A Measure of Sampling Adequacy (MSA) is an index used to determine how appropriate PCA is. Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is expected. On the other hand, MSA>0.7 is expected to provide a considerable reduction in the dimension and extraction of meaningful components.

In [5]:
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(X)
kmo_model

0.5400766750097201

In [None]:
#Project data onto the 2 primary principal components
pca = my_pca(2)
pca.fit(X)
X_projected = pca.transform(X)

In [None]:
pca.get_explained_variance()

In [None]:
pca.get_cumulative_variance()

In [None]:
print('Shape of X', X.shape)
print('Shape of Transformed X', X_projected.shape)

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(X_projected[:,0], X_projected[:,1], c = y, alpha=0.8, cmap = plt.cm.get_cmap('viridis',3))
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')

## Using Predefined Function

In [6]:
data = datasets.load_iris()
X = data.data
y = data.target


In [7]:
#Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)

StandardScaler()

In [8]:
scaled_data = scaler.transform(X)

In [None]:
#Applying PCA Algorithm
from sklearn.decomposition import PCA
pca = PCA(n_components =2)

In [None]:
data_pca = pca.fit_transform(scaled_data)

In [None]:
pca.explained_variance_

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(data_pca[:,0], data_pca[:,1], c = y, alpha=0.8, cmap = plt.cm.get_cmap('viridis',3))
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')