### What is Principal Component Analysis (PCA)?

PCA is a statistical technique that transforms a high-dimensional dataset into a lower-dimensional space by identifying the **principal components**—directions (linear combinations of features) that capture the maximum variance in the data. These components are orthogonal (uncorrelated) and ordered by the amount of variance they explain.

- **Key Steps in PCA:**
  1. **Standardize the Data:** Center the features (zero mean) and scale them (unit variance) since PCA is sensitive to feature scales.
  2. **Compute Covariance Matrix:** Measure how features vary together.
  3. **Eigenvalue Decomposition:** Find eigenvectors (principal components) and eigenvalues (variance explained by each component).
  4. **Project Data:** Transform the original data onto the top `k` principal components, reducing dimensionality from `n` features to `k`.

- **Purpose:**
  - Reduce dimensionality to simplify models, decrease computation time, and mitigate overfitting.
  - Remove noise and redundant information by focusing on the most significant variance.

- **Trade-Off:** PCA discards some information (variance in lower components), which might affect accuracy if important patterns are lost.

In [2]:
### Example Using Scikit-learn’s `diabetes` Dataset
""" We’ll use the `diabetes` dataset (10 features, 442 samples) to apply Linear Regression before and after PCA, comparing MSE and R². """

#### Step 1: Load and Prepare the Data
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

# Load diabetes dataset
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Original training data shape:", X_train.shape)
print("Sample of training data:")
print(X_train.head())

Original training data shape: (353, 10)
Sample of training data:
          age       sex       bmi        bp        s1        s2        s3  \
17   0.070769  0.050680  0.012117  0.056301  0.034206  0.049416 -0.039719   
66  -0.009147  0.050680 -0.018062 -0.033213 -0.020832  0.012152 -0.072854   
137  0.005383 -0.044642  0.049840  0.097615 -0.015328 -0.016345 -0.006584   
245 -0.027310 -0.044642 -0.035307 -0.029770 -0.056607 -0.058620  0.030232   
31  -0.023677 -0.044642 -0.065486 -0.081413 -0.038720 -0.053610  0.059685   

           s4        s5        s6  
17   0.034309  0.027364 -0.001078  
66   0.071210  0.000272  0.019633  
137 -0.002592  0.017036 -0.013504  
245 -0.039493 -0.049872 -0.129483  
31  -0.076395 -0.037129 -0.042499  


In [3]:
#### Step 2: Train Model Without PCA (Baseline)

# Train Linear Regression without PCA
lr_baseline = LinearRegression()
lr_baseline.fit(X_train, y_train)

# Predict and evaluate
y_pred_baseline = lr_baseline.predict(X_test)
mse_baseline = mean_squared_error(y_test, y_pred_baseline)
r2_baseline = r2_score(y_test, y_pred_baseline)

print("Performance without PCA:")
print(f"Mean Squared Error: {mse_baseline:.2f}")
print(f"R² Score: {r2_baseline:.2f}")

Performance without PCA:
Mean Squared Error: 2900.19
R² Score: 0.45


In [4]:
#### Step 3: Apply PCA
""" We’ll standardize the data (required for PCA) and reduce it to, say, 5 components (half the original features), then retrain the model. """

from sklearn.decomposition import PCA

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply PCA (retain 5 components)
pca = PCA(n_components=5)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

print("Training data shape after PCA:", X_train_pca.shape)

print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Cumulative explained variance:", np.cumsum(pca.explained_variance_ratio_))



Training data shape after PCA: (353, 5)
Explained variance ratio: [0.39688108 0.1477974  0.12516602 0.10108708 0.06582897]
Cumulative explained variance: [0.39688108 0.54467848 0.6698445  0.77093158 0.83676055]


In [5]:
print(pca.n_components_)
feature_names = X.columns
loadings_df = pd.DataFrame(
    pca.components_,
    columns=feature_names,
    index=[f'PC{i+1}' for i in range(pca.n_components_)]
)
print("Loadings DataFrame:\n", loadings_df)

5
Loadings DataFrame:
           age       sex       bmi        bp        s1        s2        s3  \
PC1  0.210113  0.160236  0.312409  0.253470  0.354838  0.354777 -0.273562   
PC2  0.113868 -0.403087 -0.123077 -0.045113  0.544183  0.407079  0.565671   
PC3  0.423640 -0.107244  0.237920  0.557766 -0.159641 -0.352714  0.290746   
PC4  0.487631  0.682857 -0.443322  0.060452  0.066322  0.137428  0.106035   
PC5  0.676362 -0.345954 -0.060748 -0.564717 -0.124912 -0.124565 -0.197181   

           s4        s5        s6  
PC1  0.432013  0.383022  0.329204  
PC2 -0.140732 -0.019057 -0.073287  
PC3 -0.356359  0.109526  0.260588  
PC4 -0.038137 -0.229687 -0.083633  
PC5  0.087306  0.138717  0.058215  
