# PCA Basic Example - MA2003B Multivariate Statistics Course

This notebook demonstrates the fundamental concepts of Principal Component Analysis (PCA) using a simple 3×2 dataset. PCA is a dimensionality reduction technique that identifies the principal components (directions of maximum variance) in the data.

## Learning Objectives:
- Understand how PCA transforms correlated variables into uncorrelated principal components
- Interpret eigenvalues and explained variance ratios
- See how PCA rotates the coordinate system to align with data variance

**Data**: Simple 3 observations × 2 variables matrix to illustrate core concepts

**Expected Output**:
- PC1 captures most variance (horizontal spread)
- PC2 captures remaining variance (vertical spread)
- Transformed data shows uncorrelated coordinates

In [1]:
# Import Required Libraries
import numpy as np
from sklearn.decomposition import PCA
from scipy.linalg import eigh

In [2]:
# Create Sample Data
# This represents a simple bivariate dataset where points form a diagonal pattern
X = np.array(
    [
        [5, 3],  # Point with high values on both variables
        [3, 1],  # Point with moderate values
        [1, 3],  # Point showing the correlation pattern
    ]
)

In [3]:
# Display Original Data
print("Original Data Matrix:")
print("Observations (rows) × Variables (columns)")
print(X)

Original Data Matrix:
Observations (rows) × Variables (columns)
[[5 3]
 [3 1]
 [1 3]]


In [4]:
# Initialize PCA
# No component limit means extract all possible components
pca = PCA()

In [5]:
# Fit and Transform Data
# fit_transform() centers the data and rotates it to align with principal components
X_transformed = pca.fit_transform(X)

In [6]:
# Calculate covariance matrix for X
cov_matrix = np.cov(X, rowvar=False)
print("Covariance Matrix of X:")
print(cov_matrix)

Covariance Matrix of X:
[[4.         0.        ]
 [0.         1.33333333]]


In [7]:
# Calculate eigenvalues and eigenvectors using scipy
eigenvalues_scipy, eigenvectors_scipy = eigh(cov_matrix)

# Sort in descending order (eigh returns ascending)
idx = eigenvalues_scipy.argsort()[::-1]
eigenvalues_scipy = eigenvalues_scipy[idx]
eigenvectors_scipy = eigenvectors_scipy[:, idx]

print("Eigenvalues:")
print(eigenvalues_scipy)
print("\nEigenvectors (columns):")
print(eigenvectors_scipy)

Eigenvalues:
[4.         1.33333333]

Eigenvectors (columns):
[[-1.  0.]
 [-0. -1.]]


In [8]:
# ## Manual Transformation using Scipy Eigenvectors
#
# The transformed data (X_transformed) is obtained by projecting the centered original data onto the principal component directions (eigenvectors).
#
# Steps:
# 1. **Center the data**: Subtract the mean of each variable from the original data matrix X.
# 2. **Multiply by the eigenvector matrix**: Use the eigenvectors from scipy as directions.
#
# This allows us to compare the manual calculation with scikit-learn's result.

X_centered = X - X.mean(axis=0)
X_manual_transformed = X_centered @ eigenvectors_scipy

print("Manual transformation (using scipy eigenvectors):")
print(X_manual_transformed)

Manual transformation (using scipy eigenvectors):
[[-2.         -0.66666667]
 [ 0.          1.33333333]
 [ 2.         -0.66666667]]


In [11]:
# ## Manual Transformation using Sympy Eigenvectors
#
# The transformed data (X_transformed) is obtained by projecting the centered original data onto the principal component directions (eigenvectors).
#
# Steps:
# 1. **Center the data**: Subtract the mean of each variable from the original data matrix X.
# 2. **Multiply by the eigenvector matrix**: Use the eigenvectors from sympy as directions.
#
# This allows us to compare the manual calculation with scikit-learn's result.

X_centered = X - X.mean(axis=0)
X_manual_transformed = X_centered @ eigenvectors_scipy

print("\nManual transformation (using sympy eigenvectors):")
print(X_manual_transformed)


Manual transformation (using sympy eigenvectors):
[[-2.         -0.66666667]
 [ 0.          1.33333333]
 [ 2.         -0.66666667]]


In [12]:
# Extract PCA Results from scikit-learn
eigenvalues = pca.explained_variance_
eigenvectors = pca.components_.T
variance_ratio = pca.explained_variance_ratio_

# Display Eigenvalues and Variance Ratios
print("\n" + "=" * 50)
print("PCA Results from scikit-learn:")
print("=" * 50)
print(f"Eigenvalues: {eigenvalues}")
print(f"Explained variance ratio (PC1): {variance_ratio[0]:.3f}")
print(f"Explained variance ratio (PC2): {variance_ratio[1]:.3f}")


PCA Results from scikit-learn:
Eigenvalues: [4.         1.33333333]
Explained variance ratio (PC1): 0.750
Explained variance ratio (PC2): 0.250


In [13]:
# Display Principal Component Directions from scikit-learn
print("\nPrincipal Component Directions (Eigenvectors from scikit-learn):")
print(eigenvectors)


Principal Component Directions (Eigenvectors from scikit-learn):
[[1. 0.]
 [0. 1.]]


In [14]:
# Compare transformations
print("\nTransformed Data from scikit-learn:")
print(X_transformed)

print("\n" + "=" * 50)
print("Comparison: Are both transformations equal?")
print("=" * 50)
print(
    f"Difference (max absolute error): {np.max(np.abs(X_transformed - X_manual_transformed)):.10f}"
)


Transformed Data from scikit-learn:
[[ 2.          0.66666667]
 [-0.         -1.33333333]
 [-2.          0.66666667]]

Comparison: Are both transformations equal?
Difference (max absolute error): 4.0000000000


## Interpretation

- **PC1** captures the main diagonal trend in the data (most variance)
- **PC2** captures the remaining perpendicular variation (less variance)
- The transformation **decorrelates the original variables** - they become uncorrelated in the new coordinate system
- PCA rotates the coordinate system to align with the directions of maximum variance