# Dimensionality Reduction: PCA, t-SNE, and UMAP

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain PCA (Principal Component Analysis) and its role in dimensionality reduction
2. Compute PCA, interpret explained variance, and create scree plots
3. Use PCA as a preprocessing step before clustering
4. *(Optional)* Understand the basics of t-SNE and UMAP for visualization
5. Know when to use PCA vs. t-SNE vs. UMAP

## Prerequisites

- Notebooks 01-04 (KNN and clustering fundamentals)
- Linear algebra intuition (eigenvectors, covariance -- explained below)
- NumPy, Matplotlib, scikit-learn basics

## Table of Contents

1. [PCA Theory: Variance Maximization](#1-pca-theory-variance-maximization)
2. [PCA on the Iris Dataset](#2-pca-on-the-iris-dataset)
3. [Scree Plot: Explained Variance](#3-scree-plot-explained-variance)
4. [PCA as Preprocessing for Clustering](#4-pca-as-preprocessing-for-clustering)
5. [t-SNE (Optional)](#5-t-sne-optional)
6. [UMAP (Optional)](#6-umap-optional)
7. [When to Use PCA vs. t-SNE vs. UMAP](#7-when-to-use-pca-vs-t-sne-vs-umap)
8. [Common Mistakes](#8-common-mistakes)
9. [Exercise](#9-exercise)

---

## 1. PCA Theory: Variance Maximization

**PCA** finds new axes (principal components) that capture the maximum variance in the data. It is a **linear** transformation that projects data from a high-dimensional space to a lower-dimensional one while preserving as much information as possible.

### The Math

Given a centered data matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$ (n samples, p features), PCA proceeds as:

1. **Compute the covariance matrix:**

$$\mathbf{C} = \frac{1}{n-1} \mathbf{X}^T \mathbf{X}$$

2. **Eigendecomposition:**

$$\mathbf{C} \mathbf{v}_i = \lambda_i \mathbf{v}_i$$

where $\mathbf{v}_i$ are the eigenvectors (principal component directions) and $\lambda_i$ are the eigenvalues (variance captured by each component).

3. **Project:** select the top $d$ eigenvectors (sorted by $\lambda_i$ descending) and project:

$$\mathbf{X}_{\text{reduced}} = \mathbf{X} \mathbf{V}_d$$

where $\mathbf{V}_d \in \mathbb{R}^{p \times d}$ contains the top $d$ eigenvectors.

### Explained Variance Ratio

The fraction of total variance captured by the $i$-th component is:

$$\text{EVR}_i = \frac{\lambda_i}{\sum_{j=1}^{p} \lambda_j}$$

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

print("Imports complete.")

---

## 2. PCA on the Iris Dataset

The Iris dataset has 4 features. We will reduce it to 2 dimensions for visualization.

In [None]:
# Load and scale
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Original shape: {X_scaled.shape}")
print(f"Features: {feature_names}")

In [None]:
# Apply PCA
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

print(f"Reduced shape: {X_pca.shape}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance captured: {pca.explained_variance_ratio_.sum():.4f}")

In [None]:
# 2D scatter plot colored by true class
plt.figure(figsize=(10, 7))
for i, name in enumerate(target_names):
    mask = y == i
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], label=name, s=50, edgecolors="k", alpha=0.8)

plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)")
plt.title("PCA of Iris Dataset (2D)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

---

## 3. Scree Plot: Explained Variance

A **scree plot** shows how much variance each principal component captures. It helps decide how many components to keep.

In [None]:
# Fit PCA with all components to see the full variance breakdown
pca_full = PCA(random_state=42)
pca_full.fit(X_scaled)

evr = pca_full.explained_variance_ratio_
cumulative_evr = np.cumsum(evr)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Individual explained variance
axes[0].bar(range(1, len(evr) + 1), evr, color="steelblue", edgecolor="k")
axes[0].set_xlabel("Principal Component")
axes[0].set_ylabel("Explained Variance Ratio")
axes[0].set_title("Scree Plot")
axes[0].set_xticks(range(1, len(evr) + 1))

# Cumulative explained variance
axes[1].plot(range(1, len(cumulative_evr) + 1), cumulative_evr, "bo-", linewidth=2)
axes[1].axhline(y=0.95, color="r", linestyle="--", label="95% threshold")
axes[1].set_xlabel("Number of Components")
axes[1].set_ylabel("Cumulative Explained Variance")
axes[1].set_title("Cumulative Explained Variance")
axes[1].set_xticks(range(1, len(cumulative_evr) + 1))
axes[1].legend()

plt.tight_layout()
plt.show()

for i, (ev, cev) in enumerate(zip(evr, cumulative_evr), 1):
    print(f"PC{i}: {ev:.4f} (cumulative: {cev:.4f})")

---

## 4. PCA as Preprocessing for Clustering

When data has many features, PCA can reduce dimensionality **before** clustering. This helps:
- Remove noise from less informative features
- Speed up clustering
- Mitigate the curse of dimensionality

Below we use the digits dataset (64 features) and compare K-Means on the original data vs. PCA-reduced data.

In [None]:
# Load digits dataset (8x8 images, 64 features)
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

scaler_d = StandardScaler()
X_digits_scaled = scaler_d.fit_transform(X_digits)

print(f"Digits dataset shape: {X_digits_scaled.shape}")
print(f"Number of classes: {len(np.unique(y_digits))}")

In [None]:
# K-Means on original 64 features
km_orig = KMeans(n_clusters=10, random_state=42, n_init=10)
labels_orig = km_orig.fit_predict(X_digits_scaled)
sil_orig = silhouette_score(X_digits_scaled, labels_orig)

# PCA to reduce to components that capture 95% variance
pca_digits = PCA(n_components=0.95, random_state=42)
X_digits_pca = pca_digits.fit_transform(X_digits_scaled)
print(f"PCA reduced 64 features to {X_digits_pca.shape[1]} components (95% variance)")

# K-Means on PCA-reduced features
km_pca = KMeans(n_clusters=10, random_state=42, n_init=10)
labels_pca = km_pca.fit_predict(X_digits_pca)
sil_pca = silhouette_score(X_digits_pca, labels_pca)

print(f"\nSilhouette score (original 64 features): {sil_orig:.4f}")
print(f"Silhouette score (PCA-reduced):            {sil_pca:.4f}")

In [None]:
# Visualize PCA-reduced digits in 2D
pca_2d = PCA(n_components=2, random_state=42)
X_digits_2d = pca_2d.fit_transform(X_digits_scaled)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_digits_2d[:, 0], X_digits_2d[:, 1],
                      c=y_digits, cmap="tab10", s=10, alpha=0.7)
plt.colorbar(scatter, label="Digit")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Digits Dataset: PCA 2D Projection")
plt.tight_layout()
plt.show()

---

## 5. t-SNE (Optional)

> **This section is OPTIONAL.** t-SNE is a visualization technique, not a general dimensionality reduction method.

**t-SNE** (t-distributed Stochastic Neighbor Embedding) is a **non-linear** method that excels at creating 2D/3D visualizations of high-dimensional data.

Key concepts:
- Converts pairwise distances into probabilities (Gaussian in high-D, Student-t in low-D)
- Minimizes the KL divergence between the two distributions
- **Perplexity** (typical range: 5-50) controls the balance between local and global structure

**Important caveats:**
- t-SNE is **stochastic** -- different runs produce different results
- **Distances between clusters are NOT meaningful** -- only within-cluster structure is reliable
- Computationally expensive for large datasets
- Should NOT be used as input for downstream ML models (use PCA instead)

In [None]:
# OPTIONAL: t-SNE on the digits dataset
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)
X_tsne = tsne.fit_transform(X_digits_scaled)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1],
                      c=y_digits, cmap="tab10", s=10, alpha=0.7)
plt.colorbar(scatter, label="Digit")
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.title("Digits Dataset: t-SNE 2D (perplexity=30)")
plt.tight_layout()
plt.show()

print("t-SNE often produces better-separated visual clusters than PCA.")
print("BUT: distances between clusters are NOT meaningful.")

In [None]:
# OPTIONAL: Compare different perplexity values
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, perp in zip(axes, [5, 30, 50]):
    tsne_p = TSNE(n_components=2, perplexity=perp, random_state=42, n_iter=1000)
    X_tsne_p = tsne_p.fit_transform(X_digits_scaled)
    ax.scatter(X_tsne_p[:, 0], X_tsne_p[:, 1], c=y_digits, cmap="tab10", s=8, alpha=0.7)
    ax.set_title(f"t-SNE (perplexity={perp})")

plt.suptitle("Effect of Perplexity on t-SNE", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
print("Lower perplexity emphasizes local structure; higher perplexity captures more global structure.")

---

## 6. UMAP (Optional)

> **This section is OPTIONAL.** UMAP requires the `umap-learn` package.

**UMAP** (Uniform Manifold Approximation and Projection) is a more recent non-linear technique that:
- Is generally **faster** than t-SNE
- Better preserves **global structure** (relative cluster positions)
- Can be used for both visualization and as a general-purpose dimensionality reduction

Key parameter: `n_neighbors` (similar to perplexity in t-SNE)

In [None]:
# OPTIONAL: UMAP on the digits dataset
try:
    import umap
    
    reducer = umap.UMAP(n_components=2, n_neighbors=15, random_state=42)
    X_umap = reducer.fit_transform(X_digits_scaled)
    
    plt.figure(figsize=(10, 8))
    scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1],
                          c=y_digits, cmap="tab10", s=10, alpha=0.7)
    plt.colorbar(scatter, label="Digit")
    plt.xlabel("UMAP 1")
    plt.ylabel("UMAP 2")
    plt.title("Digits Dataset: UMAP 2D")
    plt.tight_layout()
    plt.show()
    
    print("UMAP tends to produce tighter, better-separated clusters than t-SNE.")
except ImportError:
    print("UMAP is not installed. Install with: pip install umap-learn")
    print("Skipping UMAP demo.")

---

## 7. When to Use PCA vs. t-SNE vs. UMAP

| Method | Type | Best For | Preserves | Speed |
|---|---|---|---|---|
| **PCA** | Linear | Preprocessing, feature reduction, denoising | Global structure, variance | Fast |
| **t-SNE** | Non-linear | 2D/3D visualization | Local structure | Slow for large n |
| **UMAP** | Non-linear | Visualization + general reduction | Local + some global | Faster than t-SNE |

**Rules of thumb:**
- Use **PCA** when you need a preprocessor for downstream models (regression, clustering, etc.)
- Use **t-SNE** or **UMAP** when you want to visualize high-dimensional data in 2D
- Use **UMAP** if you also want to preserve global structure or need speed
- Do NOT use t-SNE/UMAP output as features for supervised learning (use PCA instead)

---

## 8. Common Mistakes

| Mistake | Why It Matters |
|---|---|
| **Interpreting t-SNE distances as meaningful** | t-SNE does NOT preserve global distances. Two clusters far apart in the plot may or may not be far apart in the original space. |
| **Not scaling before PCA** | PCA finds directions of maximum variance. If features are on different scales, the largest-scale feature dominates. Always standardize first. |
| **Using t-SNE/UMAP for preprocessing** | These are primarily visualization tools (especially t-SNE). Use PCA for feature reduction before training models. |
| **Keeping too few PCA components** | Check the cumulative explained variance. A common threshold is 95%. Keeping too few components loses important information. |
| **Running t-SNE with default perplexity on all data** | Perplexity should scale with dataset size. Always try a few values (5, 30, 50) and compare. |

---

## 9. Exercise

**Task:** Load the wine dataset (`sklearn.datasets.load_wine`). Scale the features. Apply PCA and create a scree plot to determine how many components capture 90% of the variance. Reduce to that number of components, then run K-Means (k=3) on the reduced data. Visualize the first 2 PCs colored by K-Means labels.

Bonus: If t-SNE is available, create a 2D t-SNE visualization of the wine data colored by the true labels.

In [None]:
# YOUR CODE HERE
from sklearn.datasets import load_wine

# 1. Load wine data and scale
# 2. Fit PCA with all components, create scree plot
# 3. Determine number of components for 90% variance
# 4. Reduce dimensions, run KMeans(k=3)
# 5. Visualize first 2 PCs colored by KMeans labels
# 6. (Bonus) t-SNE visualization colored by true labels