# Principal Component Analysis

Collection of different analyses and codebases for PCA analysis. Works off calculating the covariance across different features to create new axises called principal components which are aggregates of different features. Often used as exploratory analysis to look at data.

Uses:
- reduce number of dimensions
- find patterns
- visualize high dimension data
- ignore noise
- improve classification
- capture as much of variance as possible

## Potential Implementation of PCA

In [1]:
import numpy as np

def PCA(x, n_components):
    
    # subtract mean of each variable to center data around origin
    x_centered = x - np.mean(x, axis=0)
    
    # calculate covariance matrix
    #    square matrix denoting covariance of each element with one another
    cov_mat = np.cov(x_cetnered, rowvar=False)
    
    # compute eigenvalues and eigenvectors of covariance matrix
    #     eigenvectors will represent a different principal axis
    eigenvalues, eigenvectors = np.linalg.eigh(cov_mat)
    
    # sort eigenvalues in descending order to arrange eigenvalues in descending order of variability
    sorted_index = np.argsort(eigenvalues)[::-1]
    sorted_eigenvalue = eigenvalues[sorted_index]
    sorted_eigenvectors = eigen_vectors[:, sorted_index]
    
    # select a subset of eigenvectors based on selected components
    eigenvector_subset = sorted_eigenvectors[:, 0:n_components]
    
    # dot product of eigenvector and mean centered data to create a reduced dataset
    x_reduced = np.dot(eigenvector_subset.transpose(), x_centered.transpose()).transpose()
    
    return x_reduced

## SKlearn Implementation of PCA and Related Graphs

In [6]:
# iris dataset for testing

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
import pandas as pd

iris = datasets.load_iris()
x = iris.data
y = iris.target
example_dataset = StandardScaler().fit_transform(x)

In [7]:
from sklearn.decomposition import PCA

df = example_dataset
# set number of components as needed
n_comp = 2

pca = PCA(n_components=n_comp)

principal_components = pca.fit_transform(df)

column_names = [f"PC{i}" for i in range(1, n_comp+1)]
principal_df = pd.DataFrame(data=principal_components, columns=column_names)