<a href="https://colab.research.google.com/github/rajdeepbanerjee-git/JNCLectures_Intro_to_ML/blob/main/Week12/2025/Lec12_pca_from_scratch_cleaned_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PCA (Principal Component Analysis) from Scratch
This notebook demonstrates how to compute PCA manually using linear algebra. We'll use the Breast Cancer dataset from `sklearn.datasets` and reduce the dimensionality while preserving 80% variance.

### 🧠 What is PCA?
**Principal Component Analysis (PCA)** is a dimensionality reduction technique that transforms features into a new space of orthogonal axes (principal components) ordered by variance.

### ✅ Basic Steps of PCA:
1. **Standardize the Data**
2. **Compute the Covariance Matrix**
3. **Compute Eigenvalues and Eigenvectors**
4. **Sort Eigenvectors by Explained Variance**
5. **Select top `k` components covering `p%` variance**
6. **Project data onto those components**

## 📊 Load and Standardize the Dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

# Load breast cancer dataset
cancer = load_breast_cancer(as_frame=True)
cancer_df = cancer.frame
X = cancer_df[cancer['feature_names']]

print('Original Dataframe shape:', cancer_df.shape)
print('Input Feature Matrix shape:', X.shape)

Original Dataframe shape: (569, 31)
Input Feature Matrix shape: (569, 30)


## ⚙️ Standardize the Data

In [2]:
Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
Z.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,-0.56245,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


## 📐 Covariance Matrix and Eigendecomposition

In [3]:
cov_mat = Z.cov()
eigenvalues, eigenvectors = np.linalg.eig(cov_mat)

# Sort eigenvalues and eigenvectors
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

explained_var = np.cumsum(eigenvalues) / np.sum(eigenvalues)
explained_var

array([0.44272026, 0.63243208, 0.72636371, 0.79238506, 0.84734274,
       0.88758796, 0.9100953 , 0.92598254, 0.93987903, 0.95156881,
       0.961366  , 0.97007138, 0.97811663, 0.98335029, 0.98648812,
       0.98915022, 0.99113018, 0.99288414, 0.9945334 , 0.99557204,
       0.99657114, 0.99748579, 0.99829715, 0.99889898, 0.99941502,
       0.99968761, 0.99991763, 0.99997061, 0.99999557, 1.        ])

## 🎯 Select Principal Components that Explain 80% Variance

In [4]:
p = 0.8
n_components = np.argmax(explained_var >= p) + 1
eigenvectors_chosen = eigenvectors[:, :n_components]
pca_component = pd.DataFrame(eigenvectors_chosen, index=cancer['feature_names'])
print(f"Number of components explaining {p*100}% variance: {n_components}")

Number of components explaining 80.0% variance: 5


## 🔁 Project Data onto Selected Components

In [5]:
Z_pca = Z @ pca_component
Z_pca.columns = [f"PC_{i+1}" for i in range(Z_pca.shape[1])]
Z_pca.head()

Unnamed: 0,PC_1,PC_2,PC_3,PC_4,PC_5
0,9.192837,1.948583,-1.123166,3.633731,1.19511
1,2.387802,-3.768172,-0.529293,1.118264,-0.621775
2,5.733896,-1.075174,-0.551748,0.912083,0.177086
3,7.122953,10.275589,-3.23279,0.152547,2.960878
4,3.935302,-1.948072,1.389767,2.940639,-0.546747


## 🧰 Functional Implementation of PCA

In [6]:
def calc_pca_components(X, p):
    Z = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
    eigenval, eigenvec = np.linalg.eig(Z.cov())
    idx = eigenval.argsort()[::-1]
    eigenval = eigenval[idx]
    eigenvec = eigenvec[:, idx]
    explained_var = np.cumsum(eigenval) / np.sum(eigenval)
    n_components = np.argmax(explained_var >= p) + 1
    pca_component = pd.DataFrame(eigenvec[:, :n_components])
    return pca_component, Z

def project_components(Z, pca_comp):
    Z_pca = Z @ pca_comp
    Z_pca.columns = [f"PC_{i+1}" for i in range(Z_pca.shape[1])]
    return Z_pca

## ❓ PCA with Missing Values and Imputation

In [7]:
def remove_vals(X, percentage):
    X_new = X.copy()
    num_vals = int(percentage * X.size)
    for _ in range(num_vals):
        i = np.random.randint(0, X.shape[0])
        j = np.random.randint(0, X.shape[1])
        X_new.iat[i, j] = np.nan
    return X_new

X_with_null = remove_vals(X, 0.1)
X_with_null = X_with_null.fillna(X_with_null.mean())
pca_comp, Z = calc_pca_components(X_with_null, 0.8)
pca_comp.head()

Unnamed: 0,0,1,2,3,4,5
0,0.213111,0.223376,0.046727,-0.004057,0.077562,-0.053838
1,0.103045,0.056667,-0.082441,0.5956,-0.04643,0.023488
2,0.230632,0.211633,0.013296,-0.032525,0.057932,-0.041808
3,0.225519,0.236833,-0.02506,-0.056184,-0.004679,0.006242
4,0.135531,-0.189908,0.104256,-0.164956,-0.362088,0.272952
