# Dimensionality Reduction Analysis

This notebook explores various dimensionality reduction techniques on our preprocessed datasets:

1. **Principal Component Analysis (PCA)** - Linear dimensionality reduction
2. **t-SNE** - Non-linear manifold learning for visualization  
3. **UMAP** - Uniform Manifold Approximation and Projection
4. **Linear Discriminant Analysis (LDA)** - Supervised dimensionality reduction

## Objectives:
- Visualize high-dimensional data in 2D/3D space
- Understand data structure and class separability
- Identify optimal dimensionality for downstream tasks
- Compare linear vs non-linear reduction methods

In [ ]:
import sys
import os
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap

# Dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import umap
import optuna
from sklearn.metrics import silhouette_score

# Custom modules
from dimensionality.reduction import DimensionalityReducer

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

## 1. Load Preprocessed Data

In [ ]:
# Load preprocessed data
processed_dir = "../data/processed"

# BBB Dataset
X_bbb_train = np.load(f"{processed_dir}/X_bbb_train.npy")
X_bbb_test = np.load(f"{processed_dir}/X_bbb_test.npy")
y_bbb_train = np.load(f"{processed_dir}/y_bbb_train.npy")
y_bbb_test = np.load(f"{processed_dir}/y_bbb_test.npy")

# Breast Cancer Dataset
X_bc_train = np.load(f"{processed_dir}/X_bc_train.npy")
X_bc_test = np.load(f"{processed_dir}/X_bc_test.npy")
y_bc_train = np.load(f"{processed_dir}/y_bc_train.npy")
y_bc_test = np.load(f"{processed_dir}/y_bc_test.npy")

print("Loaded preprocessed datasets:")
print(f"BBB - Train: {X_bbb_train.shape}, Test: {X_bbb_test.shape}")
print(f"BC  - Train: {X_bc_train.shape}, Test: {X_bc_test.shape}")

# Check class distributions
print(f"\nClass distributions:")
print(f"BBB train: {np.bincount(y_bbb_train)}")
print(f"BC train:  {np.bincount(y_bc_train)}")

## 2. Helper Functions for Visualization

In [ ]:
def setup_2d_subplot(ax, data, labels, title):
    """
    Setup a 2D subplot with the given data and labels, minimal axes style,
    with arrow-like x/y axes, no spines, no grid, no tick labels.
    """
    # colormap for binary classification: red/green
    rg_cmap = ListedColormap(["red", "green"])
    
    ax.scatter(data[:, 0], data[:, 1],
               c=labels, cmap=rg_cmap, s=50, edgecolor='k', alpha=0.8)
    ax.set_title(title, fontsize=14)
    
    # Remove all spines
    for spine in ax.spines.values():
        spine.set_visible(False)
    
    # Remove ticks
    ax.tick_params(left=False, bottom=False, labelleft=False, labelbottom=False)
    
    # Remove grid
    ax.grid(False)
    
    x_min, x_max = ax.get_xlim()
    y_min, y_max = ax.get_ylim()
    x_half = (x_max - x_min) / 2.0
    y_half = (y_max - y_min) / 2.0
    
    # Add arrowed axes
    ax.annotate("",
                xy=(x_min + x_half, y_min),
                xytext=(x_min, y_min),
                arrowprops=dict(arrowstyle="->", lw=3, color='k'))
    ax.annotate("",
                xy=(x_min, y_min + y_half),
                xytext=(x_min, y_min),
                arrowprops=dict(arrowstyle="->", lw=3, color='k'))
    
    # Offsets for axis labels
    x_offset = 0.03 * (y_max - y_min)
    y_offset = 0.03 * (x_max - x_min)
    
    # Dynamically choose prefix based on the title
    title_upper = title.upper()
    if "PCA" in title_upper:
        comp_name = "PC"
    elif "UMAP" in title_upper:
        comp_name = "UMAP"
    elif "T-SNE" in title_upper or "TSNE" in title_upper:
        comp_name = "t-SNE"
    else:
        comp_name = "Component"
    
    ax.text(x_min + x_half, y_min - x_offset, f"{comp_name}1",
            fontsize=12, fontweight='bold', ha="center", va="top")
    ax.text(x_min - y_offset, y_min + y_half, f"{comp_name}2",
            fontsize=12, fontweight='bold', ha="right", va="center", rotation=90)

def plot_cumulative_variance(pca, dataset_name, variance_threshold=0.90):
    """Plot cumulative explained variance for PCA"""
    cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
    
    plt.figure(figsize=(10, 6))
    plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, 
             'bo-', linewidth=2, markersize=6)
    plt.axhline(y=variance_threshold, color='r', linestyle='--', 
                label=f'{variance_threshold*100:.0f}% Variance Threshold')
    
    # Find number of components for threshold
    n_components = np.argmax(cumulative_variance >= variance_threshold) + 1
    plt.axvline(x=n_components, color='g', linestyle=':', 
                label=f'{n_components} Components')
    
    plt.xlabel('Number of Principal Components')
    plt.ylabel('Cumulative Explained Variance')
    plt.title(f'{dataset_name} - PCA Cumulative Explained Variance')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
    
    return n_components

print("Helper functions defined!")

## 3. Principal Component Analysis (PCA)

In [ ]:
# Perform PCA analysis for BBB dataset
print("=== BBB Dataset PCA Analysis ===")

# Full PCA to analyze variance
pca_bbb_full = PCA()
pca_bbb_full.fit(X_bbb_train)

# Plot cumulative variance explained
n_components_bbb = plot_cumulative_variance(pca_bbb_full, "BBB Dataset", variance_threshold=0.90)

print(f"Components needed for 90% variance: {n_components_bbb}")
print(f"Explained variance by first 10 components:")
for i in range(min(10, len(pca_bbb_full.explained_variance_ratio_))):
    print(f"  PC{i+1}: {pca_bbb_full.explained_variance_ratio_[i]:.4f}")

# 2D PCA visualization
pca_bbb_2d = PCA(n_components=2)
X_bbb_pca_2d = pca_bbb_2d.fit_transform(X_bbb_train)

print(f"\n2D PCA - Explained variance ratio: {pca_bbb_2d.explained_variance_ratio_}")
print(f"Total explained variance: {sum(pca_bbb_2d.explained_variance_ratio_):.4f}")

In [ ]:
# Breast Cancer PCA analysis
print("\n=== Breast Cancer Dataset PCA Analysis ===")

# Full PCA
pca_bc_full = PCA()
pca_bc_full.fit(X_bc_train)

# Plot cumulative variance
n_components_bc = plot_cumulative_variance(pca_bc_full, "Breast Cancer Dataset", variance_threshold=0.90)

print(f"Components needed for 90% variance: {n_components_bc}")
print(f"Explained variance by first 10 components:")
for i in range(min(10, len(pca_bc_full.explained_variance_ratio_))):
    print(f"  PC{i+1}: {pca_bc_full.explained_variance_ratio_[i]:.4f}")

# 2D PCA visualization
pca_bc_2d = PCA(n_components=2)
X_bc_pca_2d = pca_bc_2d.fit_transform(X_bc_train)

print(f"\n2D PCA - Explained variance ratio: {pca_bc_2d.explained_variance_ratio_}")
print(f"Total explained variance: {sum(pca_bc_2d.explained_variance_ratio_):.4f}")

In [ ]:
# Plot PCA visualizations side by side
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

setup_2d_subplot(axes[0], X_bbb_pca_2d, y_bbb_train, "BBB Dataset PCA")
setup_2d_subplot(axes[1], X_bc_pca_2d, y_bc_train, "Breast Cancer Dataset PCA")

plt.suptitle("PCA Visualization Comparison", fontsize=16)
plt.tight_layout()
plt.show()

## 4. t-SNE (t-Distributed Stochastic Neighbor Embedding)

In [ ]:
# Apply t-SNE to both datasets
print("Applying t-SNE...")

# BBB Dataset t-SNE
print("Computing t-SNE for BBB dataset...")
tsne_bbb = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_bbb_tsne = tsne_bbb.fit_transform(X_bbb_train)

# Breast Cancer Dataset t-SNE
print("Computing t-SNE for Breast Cancer dataset...")
tsne_bc = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_bc_tsne = tsne_bc.fit_transform(X_bc_train)

print("t-SNE computation completed!")

# Plot t-SNE visualizations
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

setup_2d_subplot(axes[0], X_bbb_tsne, y_bbb_train, "BBB Dataset t-SNE")
setup_2d_subplot(axes[1], X_bc_tsne, y_bc_train, "Breast Cancer Dataset t-SNE")

plt.suptitle("t-SNE Visualization Comparison", fontsize=16)
plt.tight_layout()
plt.show()

## 5. UMAP (Uniform Manifold Approximation and Projection)

UMAP is a modern dimensionality reduction technique that often provides better global structure preservation than t-SNE.

In [ ]:
# Apply UMAP to both datasets
print("Applying UMAP...")

# UMAP configuration based on optimization results
umap_config = {
    'n_neighbors': 28,
    'min_dist': 0.04,
    'n_components': 2,
    'target_metric': 'categorical',
    'target_weight': 0.5,
    'random_state': 42
}

# BBB Dataset UMAP (supervised)
print("Computing supervised UMAP for BBB dataset...")
umap_bbb = umap.UMAP(**umap_config)
X_bbb_umap = umap_bbb.fit_transform(X_bbb_train, y_bbb_train)

# Breast Cancer Dataset UMAP (supervised)
print("Computing supervised UMAP for Breast Cancer dataset...")
umap_bc = umap.UMAP(**umap_config)
X_bc_umap = umap_bc.fit_transform(X_bc_train, y_bc_train)

print("UMAP computation completed!")

# Plot UMAP visualizations
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

setup_2d_subplot(axes[0], X_bbb_umap, y_bbb_train, "BBB Dataset UMAP")
setup_2d_subplot(axes[1], X_bc_umap, y_bc_train, "Breast Cancer Dataset UMAP")

plt.suptitle("UMAP Visualization Comparison", fontsize=16)
plt.tight_layout()
plt.show()

## 6. Linear Discriminant Analysis (LDA)

LDA is a supervised dimensionality reduction technique that maximizes class separability.

In [ ]:
# Apply LDA to both datasets
print("Applying LDA...")

# For binary classification, LDA extracts at most 1 component
n_components = 1

# BBB Dataset LDA
lda_bbb = LinearDiscriminantAnalysis(n_components=n_components)
X_bbb_lda = lda_bbb.fit_transform(X_bbb_train, y_bbb_train)

# Breast Cancer Dataset LDA
lda_bc = LinearDiscriminantAnalysis(n_components=n_components)
X_bc_lda = lda_bc.fit_transform(X_bc_train, y_bc_train)

print(f"LDA completed! Extracted {n_components} discriminant component(s)")

# Plot LDA distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# BBB Dataset LDA distribution
for label in np.unique(y_bbb_train):
    subset = X_bbb_lda[y_bbb_train == label].ravel()
    axes[0].hist(subset, alpha=0.7, label=f"Class {label}", bins=30, density=True)
axes[0].set_xlabel("LDA Component 1")
axes[0].set_ylabel("Density")
axes[0].set_title("BBB Dataset LDA")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Breast Cancer Dataset LDA distribution
for label in np.unique(y_bc_train):
    subset = X_bc_lda[y_bc_train == label].ravel()
    axes[1].hist(subset, alpha=0.7, label=f"Class {label}", bins=30, density=True)
axes[1].set_xlabel("LDA Component 1")
axes[1].set_ylabel("Density")
axes[1].set_title("Breast Cancer Dataset LDA")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle("LDA Class Separation Analysis", fontsize=16)
plt.tight_layout()
plt.show()

# Print LDA statistics
print(f"\nLDA Analysis:")
print(f"BBB Dataset - Class separability (between-class variance):")
print(f"  Explained variance ratio: {lda_bbb.explained_variance_ratio_}")
print(f"BC Dataset - Class separability:")
print(f"  Explained variance ratio: {lda_bc.explained_variance_ratio_}")

## 7. Comprehensive Comparison

Let's create a comprehensive visualization comparing all dimensionality reduction techniques:

In [ ]:
# Comprehensive comparison plot
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# BBB Dataset (top row)
setup_2d_subplot(axes[0, 0], X_bbb_pca_2d, y_bbb_train, "BBB - PCA")
setup_2d_subplot(axes[0, 1], X_bbb_tsne, y_bbb_train, "BBB - t-SNE")
setup_2d_subplot(axes[0, 2], X_bbb_umap, y_bbb_train, "BBB - UMAP")

# Breast Cancer Dataset (bottom row)
setup_2d_subplot(axes[1, 0], X_bc_pca_2d, y_bc_train, "BC - PCA")
setup_2d_subplot(axes[1, 1], X_bc_tsne, y_bc_train, "BC - t-SNE")
setup_2d_subplot(axes[1, 2], X_bc_umap, y_bc_train, "BC - UMAP")

plt.suptitle("Dimensionality Reduction Techniques Comparison", fontsize=20)
plt.tight_layout()
plt.show()

## 8. Apply Optimal PCA for BBB Dataset

Based on the variance analysis, we'll apply PCA to reduce the BBB dataset to the optimal number of components:

In [ ]:
# Apply optimal PCA to BBB dataset for active learning
print(f"Applying PCA with {n_components_bbb} components to BBB dataset...")

# Use the optimal number of components identified earlier
pca_bbb_optimal = PCA(n_components=n_components_bbb)
X_bbb_train_pca = pca_bbb_optimal.fit_transform(X_bbb_train)
X_bbb_test_pca = pca_bbb_optimal.transform(X_bbb_test)

print(f"PCA transformation completed:")
print(f"  Original BBB train shape: {X_bbb_train.shape}")
print(f"  PCA BBB train shape: {X_bbb_train_pca.shape}")
print(f"  Original BBB test shape: {X_bbb_test.shape}")
print(f"  PCA BBB test shape: {X_bbb_test_pca.shape}")
print(f"  Explained variance: {sum(pca_bbb_optimal.explained_variance_ratio_):.4f}")

# Save the PCA-transformed data for active learning experiments
np.save(f"{processed_dir}/X_bbb_train_pca.npy", X_bbb_train_pca)
np.save(f"{processed_dir}/X_bbb_test_pca.npy", X_bbb_test_pca)

print(f"PCA-transformed BBB data saved to {processed_dir}/")

# For comparison: Breast Cancer dataset doesn't need PCA (only 30 features)
print(f"\nBreast Cancer dataset: {X_bc_train.shape[1]} features - no PCA needed")

## Summary

**Dimensionality Reduction Analysis Complete!**

### Key Findings:

#### BBB Dataset:
- **Original dimensions**: 500+ molecular features
- **PCA optimal**: 90% variance retained with ~100-150 components
- **Visualization**: Class separation moderate in 2D projections
- **Recommendation**: Use PCA-reduced features for active learning

#### Breast Cancer Dataset:
- **Original dimensions**: 30 clinical features
- **PCA variance**: First 2 components explain ~63% variance
- **Visualization**: Strong class separation visible in all methods
- **Recommendation**: Use original features (no dimensionality reduction needed)

### Method Comparison:
1. **PCA**: Linear, fast, preserves global structure, good for feature reduction
2. **t-SNE**: Non-linear, good local structure, can create artificial clusters
3. **UMAP**: Non-linear, faster than t-SNE, better global structure preservation
4. **LDA**: Supervised, maximizes class separation, limited to n_classes-1 dimensions

### For Active Learning:
- BBB dataset: Use PCA-reduced features for efficiency while retaining 90% variance
- Breast Cancer: Use original 30 features directly
- Both datasets show sufficient class separability for active learning experiments

**Next Step**: Active learning experiments using optimally processed data