## Principal Component Analysis

Principal Component Analysis (PCA) is dimensionality reduction technique used to reduce the complexity of high-dimensional data by transforming the data into a lower-dimensional space while retaining as much of the original variability as possible. It is an unsupervised learning algorithm that identifies patterns in the data and re-expresses the data in terms of a smaller number of variables called principal components.

The key idea behind PCA is to identify the directions in the data that contain the most variation, and to project the data onto these directions to create a new set of variables that capture as much of the original information as possible. The first principal component is the direction that explains the most variation in the data, and each subsequent principal component explains as much of the remaining variation as possible.

PCA can be summarized in the following steps:

1. Gather and preprocess the data.
2. Standardize the data to ensure that all variables are on the same scale.
3. Compute the covariance matrix of the standardized data.
4. Compute the eigenvalues and eigenvectors of the covariance matrix.
5. Sort the eigenvalues in descending order, and select the top k eigenvectors corresponding to the k largest eigenvalues.
6. Transform the data by multiplying it by the selected eigenvectors to obtain the principal components.
7. Use the transformed data for analysis or visualization.

PCA is widely used in applications such as image processing, finance, and biology. It provides a powerful and flexible approach for dimensionality reduction, and can help identify patterns and relationships in high-dimensional data. PCA can also be used as a preprocessing step for other machine learning algorithms to reduce the dimensionality of the input data and improve their performance.

In [None]:
import numpy as np


class PCA:
    
    def __init__(self, n_components):
        self.n_components = n_components
        self.components = None
        self.mean = None
        
    def fit(self, X):
        # mean centering
        self.mean = np.mean(X, axis = 0)
        X = X - self.mean
        
        # covariance
        cov = np.cov(X.T)
        
        # eigenvectors, eigenvalues 
        eigenvectors, eigenvalues = np.linalg.eig(cov)
        eigenvectors = eigenvectors.T
        
        idxs = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idxs]
        eigenvectors = eigenvectors[idxs]
        
        self.components = eigenvectors[:self.n_components]
    
    def transform(self, X):
        # data projection
        X = X - self.mean
        return np.dot(X, self.components.T)

In [None]:
import matplotlib.pyplot as plt
from sklearn import datasets

data = datasets.load_iris()
X = data.data
y = data.target

pca = PCA(2)
pca.fit(X)
X_projected = pca.transform(X)


x1 = X_projected[:, 0]
x2 = X_projected[:, 1]

plt.scatter(
    x1, x2, c=y, edgecolor="none", alpha=0.8, cmap=plt.cm.get_cmap("viridis", 3)
)

plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.colorbar()
plt.show()