# Feature Decomposition Methods for RNA-Seq Data

## Introduction

This notebook demonstrates four popular dimensionality reduction techniques used in bioinformatics:

1. **PCA (Principal Component Analysis)** - Linear method for finding directions of maximum variance
2. **ICA (Independent Component Analysis)** - Finds statistically independent components
3. **t-SNE (t-Distributed Stochastic Neighbor Embedding)** - Nonlinear method for visualization
4. **UMAP (Uniform Manifold Approximation and Projection)** - Fast nonlinear method preserving global structure

We'll use simulated RNA-Seq data representing different cell types and conditions.

---

In [None]:
# Install Libraries
# ! pip install umap-learn

In [None]:
# Fix to umap-learn import hanging (did not solve)
# pip install --upgrade umap-learn numba

In [None]:
## Fix to install umap-learn import hanging 
#  Needed to do this to set up a conda env: (From Terminal before starting notebook)
# conda create -n umap_env python=3.11 anaconda
# conda activate umap_env
# conda install -c conda-forge umap-learn

In [None]:
## Notes ##
# The conda-forge version of umap-learn typically doesn't have the threading issues that pip versions sometimes encounter on macOS.

## Setup: Import Libraries

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Dimensionality reduction methods
from sklearn.decomposition import PCA, FastICA
from sklearn.manifold import TSNE

# Try to import UMAP (can be slow on first import)
print("Importing core libraries... ✓")
print("Importing sklearn methods... ✓")

In [None]:

try:
    print("Importing UMAP (this may take 10-30 seconds on first run)...")
    import umap
    from umap import UMAP
    UMAP_AVAILABLE = True
    print("UMAP imported successfully! ✓")
except ImportError:
    print("WARNING: UMAP not available. Install with: pip install umap-learn")
    print("Continuing without UMAP...")
    UMAP_AVAILABLE = False


In [None]:
# Preprocessing
from sklearn.preprocessing import StandardScaler

# Set random seed for reproducibility
np.random.seed(42)

# Plotting settings
try:
    plt.style.use('seaborn-v0_8-darkgrid')
except:
    plt.style.use('seaborn-darkgrid')
    
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("\n" + "="*60)
print("✓ All libraries imported successfully!")
print("="*60)
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"UMAP available: {UMAP_AVAILABLE}")
print("="*60)

---
## Part 1: Generate Simulated RNA-Seq Data

We'll create realistic RNA-Seq count data for different cell types:
- **T-cells**: High immune response genes
- **B-cells**: High antibody production genes
- **Neurons**: High neurotransmitter genes
- **Hepatocytes**: High metabolism genes

Each cell type will have characteristic gene expression patterns.

In [None]:
def generate_rnaseq_data(n_samples_per_type=50, n_genes=2000, noise_level=0.3):
    """
    Generate simulated RNA-Seq count data for different cell types.
    
    Parameters:
    -----------
    n_samples_per_type : int
        Number of samples per cell type
    n_genes : int
        Total number of genes to simulate
    noise_level : float
        Amount of noise to add (0-1)
    
    Returns:
    --------
    counts_df : DataFrame
        Gene expression counts (samples x genes)
    metadata_df : DataFrame
        Sample metadata with cell type labels
    """
    
    # Define cell types
    cell_types = ['T-cell', 'B-cell', 'Neuron', 'Hepatocyte']
    n_types = len(cell_types)
    n_samples = n_samples_per_type * n_types
    
    # Gene categories (genes per category)
    genes_per_category = n_genes // 5
    
    # Initialize count matrix
    counts = np.zeros((n_samples, n_genes))
    
    # Base expression (all cell types have some baseline expression)
    baseline = np.random.negative_binomial(n=5, p=0.1, size=(n_samples, n_genes))
    counts += baseline
    
    # Define gene sets with characteristic expression patterns
    # Category 1: Immune genes (high in T-cells and B-cells)
    immune_genes = slice(0, genes_per_category)
    
    # Category 2: Antibody genes (very high in B-cells)
    antibody_genes = slice(genes_per_category, 2*genes_per_category)
    
    # Category 3: Neural genes (high in Neurons)
    neural_genes = slice(2*genes_per_category, 3*genes_per_category)
    
    # Category 4: Metabolic genes (high in Hepatocytes)
    metabolic_genes = slice(3*genes_per_category, 4*genes_per_category)
    
    # Category 5: Housekeeping genes (expressed in all)
    housekeeping_genes = slice(4*genes_per_category, n_genes)
    
    # Add cell-type specific expression
    for i, cell_type in enumerate(cell_types):
        start_idx = i * n_samples_per_type
        end_idx = (i + 1) * n_samples_per_type
        
        if cell_type == 'T-cell':
            # High immune gene expression
            counts[start_idx:end_idx, immune_genes] += np.random.negative_binomial(
                n=20, p=0.2, size=(n_samples_per_type, genes_per_category)
            )
            # Moderate antibody genes
            counts[start_idx:end_idx, antibody_genes] += np.random.negative_binomial(
                n=10, p=0.3, size=(n_samples_per_type, genes_per_category)
            )
            
        elif cell_type == 'B-cell':
            # High immune gene expression
            counts[start_idx:end_idx, immune_genes] += np.random.negative_binomial(
                n=15, p=0.2, size=(n_samples_per_type, genes_per_category)
            )
            # Very high antibody genes
            counts[start_idx:end_idx, antibody_genes] += np.random.negative_binomial(
                n=30, p=0.15, size=(n_samples_per_type, genes_per_category)
            )
            
        elif cell_type == 'Neuron':
            # Very high neural gene expression
            counts[start_idx:end_idx, neural_genes] += np.random.negative_binomial(
                n=25, p=0.15, size=(n_samples_per_type, genes_per_category)
            )
            # Some metabolic genes
            counts[start_idx:end_idx, metabolic_genes] += np.random.negative_binomial(
                n=8, p=0.3, size=(n_samples_per_type, genes_per_category)
            )
            
        elif cell_type == 'Hepatocyte':
            # Very high metabolic gene expression
            counts[start_idx:end_idx, metabolic_genes] += np.random.negative_binomial(
                n=35, p=0.1, size=(n_samples_per_type, genes_per_category)
            )
    
    # Add housekeeping genes (moderate expression in all cells)
    counts[:, housekeeping_genes] += np.random.negative_binomial(
        n=12, p=0.25, size=(n_samples, n_genes - 4*genes_per_category)
    )
    
    # Add noise
    noise = np.random.poisson(
        lam=noise_level * counts
    )
    counts = counts + noise
    
    # Create gene names
    gene_names = []
    gene_names += [f"IMMUNE_{i+1:03d}" for i in range(genes_per_category)]
    gene_names += [f"ANTIBODY_{i+1:03d}" for i in range(genes_per_category)]
    gene_names += [f"NEURAL_{i+1:03d}" for i in range(genes_per_category)]
    gene_names += [f"METABOLIC_{i+1:03d}" for i in range(genes_per_category)]
    gene_names += [f"HOUSEKEEP_{i+1:03d}" for i in range(n_genes - 4*genes_per_category)]
    
    # Create sample names and metadata
    sample_names = []
    cell_type_labels = []
    
    for i, cell_type in enumerate(cell_types):
        for j in range(n_samples_per_type):
            sample_names.append(f"{cell_type}_{j+1:02d}")
            cell_type_labels.append(cell_type)
    
    # Create DataFrames
    counts_df = pd.DataFrame(
        counts,
        index=sample_names,
        columns=gene_names
    )
    
    metadata_df = pd.DataFrame({
        'sample_id': sample_names,
        'cell_type': cell_type_labels,
        'batch': np.random.choice(['Batch1', 'Batch2'], size=n_samples)
    })
    
    return counts_df, metadata_df

# Generate data
print("Generating simulated RNA-Seq data...")
counts_df, metadata_df = generate_rnaseq_data(
    n_samples_per_type=50,
    n_genes=2000,
    noise_level=0.3
)

print(f"\nData shape: {counts_df.shape}")
print(f"Samples: {counts_df.shape[0]}")
print(f"Genes: {counts_df.shape[1]}")
print(f"\nCell type distribution:")
print(metadata_df['cell_type'].value_counts())
print(f"\nFirst few samples:")
display(counts_df.iloc[:5, :10])

### Visualize Raw Data Distribution

In [None]:
# Plot library size distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Total counts per sample
library_sizes = counts_df.sum(axis=1)
axes[0].hist(library_sizes, bins=30, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Total Read Count per Sample', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Library Size Distribution', fontsize=14, fontweight='bold')
axes[0].axvline(library_sizes.mean(), color='red', linestyle='--', 
                label=f'Mean: {library_sizes.mean():.0f}')
axes[0].legend()

# Mean expression per gene
mean_expression = counts_df.mean(axis=0)
axes[1].hist(np.log10(mean_expression + 1), bins=50, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('log10(Mean Expression + 1)', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Gene Expression Distribution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"Library size range: {library_sizes.min():.0f} - {library_sizes.max():.0f}")
print(f"Mean library size: {library_sizes.mean():.0f} ± {library_sizes.std():.0f}")

---
## Part 2: Data Preprocessing

Before applying dimensionality reduction, we need to preprocess RNA-Seq data:

1. **Log transformation**: RNA-Seq counts are log-normally distributed
2. **Feature selection**: Select highly variable genes
3. **Standardization**: Scale features for PCA/ICA

In [None]:
def preprocess_rnaseq(counts_df, n_top_genes=500):
    """
    Preprocess RNA-Seq data for dimensionality reduction.
    
    Steps:
    1. Log transformation: log2(counts + 1)
    2. Select highly variable genes
    3. Standardize (z-score normalization)
    """
    print("Preprocessing RNA-Seq data...")
    print("="*60)
    
    # Step 1: Log transformation
    log_counts = np.log2(counts_df + 1)
    print(f"1. Log transformation applied: log2(counts + 1)")
    
    # Step 2: Select highly variable genes
    # Calculate coefficient of variation for each gene
    gene_means = log_counts.mean(axis=0)
    gene_vars = log_counts.var(axis=0)
    gene_cv = gene_vars / (gene_means + 1e-10)  # Avoid division by zero
    
    # Select top N most variable genes
    top_genes = gene_cv.nlargest(n_top_genes).index
    log_counts_hvg = log_counts[top_genes]
    
    print(f"2. Selected {n_top_genes} highly variable genes")
    print(f"   Original genes: {counts_df.shape[1]}")
    print(f"   Selected genes: {log_counts_hvg.shape[1]}")
    
    # Step 3: Standardization (z-score)
    scaler = StandardScaler()
    data_scaled = scaler.fit_transform(log_counts_hvg)
    data_scaled_df = pd.DataFrame(
        data_scaled,
        index=log_counts_hvg.index,
        columns=log_counts_hvg.columns
    )
    
    print(f"3. Standardization applied (mean=0, std=1)")
    print(f"   Mean: {data_scaled.mean():.2e}")
    print(f"   Std: {data_scaled.std():.2f}")
    print("="*60)
    
    return data_scaled_df, log_counts_hvg

# Preprocess data
data_scaled, log_counts_hvg = preprocess_rnaseq(counts_df, n_top_genes=500)

print(f"\nFinal preprocessed data shape: {data_scaled.shape}")
print(f"Ready for dimensionality reduction!")

---
## Part 3: Principal Component Analysis (PCA)

### What is PCA?

**Principal Component Analysis (PCA)** is a linear dimensionality reduction technique that:
- Finds orthogonal directions (principal components) of maximum variance
- Projects data onto these components
- Is deterministic and fast

### When to use PCA:
- **Fast exploration** of high-dimensional data
- **Linear relationships** are important
- Need **interpretable components** (loadings)
- Want to **preserve global structure**

### Advantages:
- Fast and scalable
- Deterministic (same result every run)
- Interpretable (loadings show gene contributions)
- Preserves global distances

### Disadvantages:
- Linear method (may miss nonlinear patterns)
- Assumes variance = importance

In [None]:
# Apply PCA
print("Applying PCA...")
pca = PCA(n_components=50, random_state=42)
pca_result = pca.fit_transform(data_scaled)

# Create DataFrame with PCA results
pca_df = pd.DataFrame(
    pca_result[:, :10],  # First 10 components
    columns=[f'PC{i+1}' for i in range(10)],
    index=data_scaled.index
)
pca_df['cell_type'] = metadata_df['cell_type'].values

print(f"PCA completed!")
print(f"Shape: {pca_result.shape}")
print(f"\nVariance explained by first 10 components:")
for i in range(10):
    print(f"  PC{i+1}: {pca.explained_variance_ratio_[i]*100:.2f}%")
print(f"\nTotal variance explained (first 10 PCs): {pca.explained_variance_ratio_[:10].sum()*100:.2f}%")

### Visualize PCA Results

In [None]:
# Create comprehensive PCA visualization
fig = plt.figure(figsize=(16, 10))
gs = fig.add_gridspec(2, 3, hspace=0.3, wspace=0.3)

# 1. Scree plot (explained variance)
ax1 = fig.add_subplot(gs[0, 0])
ax1.plot(range(1, 21), pca.explained_variance_ratio_[:20], 'bo-', linewidth=2, markersize=8)
ax1.set_xlabel('Principal Component', fontsize=12)
ax1.set_ylabel('Explained Variance Ratio', fontsize=12)
ax1.set_title('Scree Plot', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# 2. Cumulative variance explained
ax2 = fig.add_subplot(gs[0, 1])
cumsum = np.cumsum(pca.explained_variance_ratio_[:20])
ax2.plot(range(1, 21), cumsum, 'ro-', linewidth=2, markersize=8)
ax2.axhline(y=0.9, color='green', linestyle='--', label='90% variance')
ax2.set_xlabel('Number of Components', fontsize=12)
ax2.set_ylabel('Cumulative Variance Explained', fontsize=12)
ax2.set_title('Cumulative Variance', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

# 3. PC1 vs PC2 scatter plot
ax3 = fig.add_subplot(gs[0, 2])
cell_types = pca_df['cell_type'].unique()
colors = plt.cm.Set2(np.linspace(0, 1, len(cell_types)))

for cell_type, color in zip(cell_types, colors):
    mask = pca_df['cell_type'] == cell_type
    ax3.scatter(pca_df.loc[mask, 'PC1'], pca_df.loc[mask, 'PC2'],
               label=cell_type, alpha=0.6, s=80, color=color, edgecolors='black', linewidth=0.5)

ax3.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)', fontsize=12)
ax3.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)', fontsize=12)
ax3.set_title('PCA: PC1 vs PC2', fontsize=14, fontweight='bold')
ax3.legend(loc='best', frameon=True, shadow=True)
ax3.grid(True, alpha=0.3)

# 4. PC1 vs PC3 scatter plot
ax4 = fig.add_subplot(gs[1, 0])
for cell_type, color in zip(cell_types, colors):
    mask = pca_df['cell_type'] == cell_type
    ax4.scatter(pca_df.loc[mask, 'PC1'], pca_df.loc[mask, 'PC3'],
               label=cell_type, alpha=0.6, s=80, color=color, edgecolors='black', linewidth=0.5)

ax4.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)', fontsize=12)
ax4.set_ylabel(f'PC3 ({pca.explained_variance_ratio_[2]*100:.1f}%)', fontsize=12)
ax4.set_title('PCA: PC1 vs PC3', fontsize=14, fontweight='bold')
ax4.legend(loc='best', frameon=True, shadow=True)
ax4.grid(True, alpha=0.3)

# 5. PC2 vs PC3 scatter plot
ax5 = fig.add_subplot(gs[1, 1])
for cell_type, color in zip(cell_types, colors):
    mask = pca_df['cell_type'] == cell_type
    ax5.scatter(pca_df.loc[mask, 'PC2'], pca_df.loc[mask, 'PC3'],
               label=cell_type, alpha=0.6, s=80, color=color, edgecolors='black', linewidth=0.5)

ax5.set_xlabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)', fontsize=12)
ax5.set_ylabel(f'PC3 ({pca.explained_variance_ratio_[2]*100:.1f}%)', fontsize=12)
ax5.set_title('PCA: PC2 vs PC3', fontsize=14, fontweight='bold')
ax5.legend(loc='best', frameon=True, shadow=True)
ax5.grid(True, alpha=0.3)

# 6. Top gene loadings for PC1
ax6 = fig.add_subplot(gs[1, 2])
loadings_pc1 = pd.Series(pca.components_[0], index=data_scaled.columns)
top_genes_pc1 = loadings_pc1.abs().nlargest(15)
top_loadings = loadings_pc1[top_genes_pc1.index].sort_values()

colors_bar = ['red' if x < 0 else 'blue' for x in top_loadings.values]
ax6.barh(range(len(top_loadings)), top_loadings.values, color=colors_bar, alpha=0.7, edgecolor='black')
ax6.set_yticks(range(len(top_loadings)))
ax6.set_yticklabels(top_loadings.index, fontsize=9)
ax6.set_xlabel('Loading Value', fontsize=12)
ax6.set_title('Top 15 Genes Contributing to PC1', fontsize=14, fontweight='bold')
ax6.axvline(x=0, color='black', linestyle='--', linewidth=1)
ax6.grid(True, alpha=0.3, axis='x')

plt.suptitle('Principal Component Analysis (PCA) - Comprehensive View', 
             fontsize=16, fontweight='bold', y=0.995)
plt.show()

print(f"\nPCA Interpretation:")
print(f"- Cell types form distinct clusters in PC space")
print(f"- First 3 PCs capture {cumsum[2]*100:.1f}% of total variance")
print(f"- Gene loadings show which genes drive each PC")

---
## Part 4: Independent Component Analysis (ICA)

### What is ICA?

**Independent Component Analysis (ICA)** is a computational method that:
- Separates a multivariate signal into independent non-Gaussian signals
- Assumes observed data is a linear mixture of independent sources
- Maximizes statistical independence (not just orthogonality like PCA)

### When to use ICA:
- **Identifying biological processes**: Gene programs, regulatory modules
- **Signal separation**: Separate technical and biological variation
- **Finding hidden factors**: Independent biological processes

### ICA vs PCA:
- **PCA**: Finds orthogonal components with maximum variance
- **ICA**: Finds statistically independent components

### Advantages:
- Captures non-Gaussian structure
- Identifies independent biological processes
- Can separate overlapping signals

### Disadvantages:
- Non-deterministic (different runs may differ slightly)
- Harder to interpret than PCA
- No inherent ordering of components

In [None]:
# Apply ICA
print("Applying ICA...")
ica = FastICA(n_components=10, random_state=42, max_iter=1000, tol=0.001)
ica_result = ica.fit_transform(data_scaled)

# Create DataFrame with ICA results
ica_df = pd.DataFrame(
    ica_result,
    columns=[f'IC{i+1}' for i in range(10)],
    index=data_scaled.index
)
ica_df['cell_type'] = metadata_df['cell_type'].values

print(f"ICA completed!")
print(f"Shape: {ica_result.shape}")
print(f"Components shape: {ica.components_.shape}")

# Calculate component statistics
print(f"\nComponent statistics:")
for i in range(5):
    print(f"  IC{i+1}: mean={ica_result[:, i].mean():.3f}, "
          f"std={ica_result[:, i].std():.3f}, "
          f"kurtosis={stats.kurtosis(ica_result[:, i]):.3f}")

### Visualize ICA Results

In [None]:
# Create comprehensive ICA visualization
fig = plt.figure(figsize=(16, 10))
gs = fig.add_gridspec(2, 3, hspace=0.3, wspace=0.3)

# 1. IC1 vs IC2 scatter plot
ax1 = fig.add_subplot(gs[0, 0])
for cell_type, color in zip(cell_types, colors):
    mask = ica_df['cell_type'] == cell_type
    ax1.scatter(ica_df.loc[mask, 'IC1'], ica_df.loc[mask, 'IC2'],
               label=cell_type, alpha=0.6, s=80, color=color, edgecolors='black', linewidth=0.5)

ax1.set_xlabel('IC1', fontsize=12)
ax1.set_ylabel('IC2', fontsize=12)
ax1.set_title('ICA: IC1 vs IC2', fontsize=14, fontweight='bold')
ax1.legend(loc='best', frameon=True, shadow=True)
ax1.grid(True, alpha=0.3)

# 2. IC1 vs IC3 scatter plot
ax2 = fig.add_subplot(gs[0, 1])
for cell_type, color in zip(cell_types, colors):
    mask = ica_df['cell_type'] == cell_type
    ax2.scatter(ica_df.loc[mask, 'IC1'], ica_df.loc[mask, 'IC3'],
               label=cell_type, alpha=0.6, s=80, color=color, edgecolors='black', linewidth=0.5)

ax2.set_xlabel('IC1', fontsize=12)
ax2.set_ylabel('IC3', fontsize=12)
ax2.set_title('ICA: IC1 vs IC3', fontsize=14, fontweight='bold')
ax2.legend(loc='best', frameon=True, shadow=True)
ax2.grid(True, alpha=0.3)

# 3. IC2 vs IC3 scatter plot
ax3 = fig.add_subplot(gs[0, 2])
for cell_type, color in zip(cell_types, colors):
    mask = ica_df['cell_type'] == cell_type
    ax3.scatter(ica_df.loc[mask, 'IC2'], ica_df.loc[mask, 'IC3'],
               label=cell_type, alpha=0.6, s=80, color=color, edgecolors='black', linewidth=0.5)

ax3.set_xlabel('IC2', fontsize=12)
ax3.set_ylabel('IC3', fontsize=12)
ax3.set_title('ICA: IC2 vs IC3', fontsize=14, fontweight='bold')
ax3.legend(loc='best', frameon=True, shadow=True)
ax3.grid(True, alpha=0.3)

# 4. Component distributions (kurtosis)
ax4 = fig.add_subplot(gs[1, 0])
kurtosis_values = [stats.kurtosis(ica_result[:, i]) for i in range(10)]
ax4.bar(range(1, 11), kurtosis_values, color='steelblue', alpha=0.7, edgecolor='black')
ax4.set_xlabel('Independent Component', fontsize=12)
ax4.set_ylabel('Kurtosis', fontsize=12)
ax4.set_title('Component Kurtosis (Non-Gaussianity)', fontsize=14, fontweight='bold')
ax4.axhline(y=0, color='red', linestyle='--', label='Gaussian')
ax4.legend()
ax4.grid(True, alpha=0.3, axis='y')

# 5. Top gene weights for IC1
ax5 = fig.add_subplot(gs[1, 1])
weights_ic1 = pd.Series(ica.components_[0], index=data_scaled.columns)
top_genes_ic1 = weights_ic1.abs().nlargest(15)
top_weights = weights_ic1[top_genes_ic1.index].sort_values()

colors_bar = ['red' if x < 0 else 'blue' for x in top_weights.values]
ax5.barh(range(len(top_weights)), top_weights.values, color=colors_bar, alpha=0.7, edgecolor='black')
ax5.set_yticks(range(len(top_weights)))
ax5.set_yticklabels(top_weights.index, fontsize=9)
ax5.set_xlabel('Component Weight', fontsize=12)
ax5.set_title('Top 15 Genes in IC1', fontsize=14, fontweight='bold')
ax5.axvline(x=0, color='black', linestyle='--', linewidth=1)
ax5.grid(True, alpha=0.3, axis='x')

# 6. Comparison: PCA vs ICA (first 2 components)
ax6 = fig.add_subplot(gs[1, 2])
ax6.scatter(pca_result[:, 0], ica_result[:, 0], alpha=0.5, s=50, c='purple', edgecolors='black', linewidth=0.5)
ax6.set_xlabel('PC1 (PCA)', fontsize=12)
ax6.set_ylabel('IC1 (ICA)', fontsize=12)
ax6.set_title('PCA vs ICA: First Components', fontsize=14, fontweight='bold')
ax6.grid(True, alpha=0.3)

# Add correlation
correlation = np.corrcoef(pca_result[:, 0], ica_result[:, 0])[0, 1]
ax6.text(0.05, 0.95, f'Correlation: {correlation:.3f}',
        transform=ax6.transAxes, fontsize=11,
        verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.suptitle('Independent Component Analysis (ICA) - Comprehensive View', 
             fontsize=16, fontweight='bold', y=0.995)
plt.show()

print(f"\nICA Interpretation:")
print(f"- ICA identifies statistically independent sources")
print(f"- Components with high kurtosis are more non-Gaussian (super/sub-Gaussian)")
print(f"- Gene weights reveal independent biological processes")

---
## Part 5: t-SNE (t-Distributed Stochastic Neighbor Embedding)

### What is t-SNE?

**t-SNE** is a nonlinear dimensionality reduction technique that:
- Preserves local neighborhood structure
- Converts similarities between points to joint probabilities
- Minimizes divergence between high-D and low-D probability distributions

### When to use t-SNE:
- **Visualization**: Create 2D/3D plots for exploratory analysis
- **Cluster discovery**: Reveal hidden patterns and groupings
- **Publication figures**: Beautiful, informative visualizations

### Key Parameters:
- **perplexity**: Balance between local and global structure (5-50)
- **n_iter**: Number of iterations (1000-5000)
- **learning_rate**: Step size (10-1000)

### Advantages:
- Excellent for visualization
- Reveals local structure and clusters
- Handles nonlinear relationships

### Disadvantages:
- Computationally expensive (slow for large datasets)
- Non-deterministic (different runs give different results)
- Distances between clusters are not meaningful
- Cannot be applied to new data (no transform method)

In [None]:
# Apply t-SNE with different perplexity values
print("Applying t-SNE with different parameters...")
print("This may take a few minutes...\n")

perplexity_values = [5, 30, 50]
tsne_results = {}

for perp in perplexity_values:
    print(f"Running t-SNE with perplexity={perp}...")
    tsne = TSNE(
        n_components=2,
        perplexity=perp,
        max_iter=1000,  # Fixed: changed from n_iter to max_iter for sklearn 1.7+
        random_state=42,
        verbose=0
    )
    tsne_result = tsne.fit_transform(data_scaled)
    tsne_results[perp] = tsne_result
    print(f"  KL divergence: {tsne.kl_divergence_:.4f}")

print("\nt-SNE completed!")

### Visualize t-SNE Results

In [None]:
# Create comprehensive t-SNE visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

# Plot t-SNE results with different perplexities
for idx, (perp, tsne_result) in enumerate(tsne_results.items()):
    ax = axes[idx]
    
    for cell_type, color in zip(cell_types, colors):
        mask = metadata_df['cell_type'] == cell_type
        ax.scatter(tsne_result[mask, 0], tsne_result[mask, 1],
                  label=cell_type, alpha=0.6, s=80, color=color, 
                  edgecolors='black', linewidth=0.5)
    
    ax.set_xlabel('t-SNE 1', fontsize=12)
    ax.set_ylabel('t-SNE 2', fontsize=12)
    ax.set_title(f't-SNE (perplexity={perp})', fontsize=14, fontweight='bold')
    ax.legend(loc='best', frameon=True, shadow=True)
    ax.grid(True, alpha=0.3)

# Compare with PCA (for reference)
ax = axes[3]
for cell_type, color in zip(cell_types, colors):
    mask = pca_df['cell_type'] == cell_type
    ax.scatter(pca_df.loc[mask, 'PC1'], pca_df.loc[mask, 'PC2'],
              label=cell_type, alpha=0.6, s=80, color=color, 
              edgecolors='black', linewidth=0.5)

ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)', fontsize=12)
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)', fontsize=12)
ax.set_title('PCA (for comparison)', fontsize=14, fontweight='bold')
ax.legend(loc='best', frameon=True, shadow=True)
ax.grid(True, alpha=0.3)

# Large t-SNE plot (perplexity=30)
ax = axes[4]
tsne_result = tsne_results[30]

for cell_type, color in zip(cell_types, colors):
    mask = metadata_df['cell_type'] == cell_type
    ax.scatter(tsne_result[mask, 0], tsne_result[mask, 1],
              label=cell_type, alpha=0.7, s=120, color=color, 
              edgecolors='black', linewidth=0.8)

ax.set_xlabel('t-SNE 1', fontsize=14, fontweight='bold')
ax.set_ylabel('t-SNE 2', fontsize=14, fontweight='bold')
ax.set_title('t-SNE: Best Result (perplexity=30)', fontsize=16, fontweight='bold')
ax.legend(loc='best', frameon=True, shadow=True, fontsize=11)
ax.grid(True, alpha=0.3)

# Effect of perplexity
ax = axes[5]
perplexities = list(tsne_results.keys())
ax.bar(range(len(perplexities)), [5, 30, 50], color='steelblue', alpha=0.7, edgecolor='black')
ax.set_xticks(range(len(perplexities)))
ax.set_xticklabels([f'Perp={p}' for p in perplexities])
ax.set_ylabel('Perplexity Value', fontsize=12)
ax.set_title('Perplexity Parameter Effect', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

# Add text explanation
ax.text(0.5, 0.5, 'Perplexity controls\nlocal vs global structure\n\nLow: focuses on local\nHigh: preserves global',
       transform=ax.transAxes, fontsize=11, ha='center', va='center',
       bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.suptitle('t-SNE Analysis - Effect of Parameters', 
             fontsize=18, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print(f"\nt-SNE Interpretation:")
print(f"- Perplexity affects cluster separation and structure")
print(f"- Low perplexity: emphasizes local neighborhoods (may fragment clusters)")
print(f"- High perplexity: preserves more global structure")
print(f"- t-SNE excels at revealing cluster structure for visualization")

---
## Part 6: UMAP (Uniform Manifold Approximation and Projection)

### What is UMAP?

**UMAP** is a modern nonlinear dimensionality reduction technique that:
- Preserves both local and global structure
- Uses manifold learning and topological data analysis
- Is much faster than t-SNE
- Can be applied to new data (has transform method)

### When to use UMAP:
- **Fast visualization** of large datasets
- **Better preservation** of global structure than t-SNE
- **Downstream analysis**: Can use reduced dimensions for clustering
- **New data**: Can transform new samples without retraining

### Key Parameters:
- **n_neighbors**: Size of local neighborhood (2-100)
- **min_dist**: Minimum distance between points (0.0-0.99)
- **metric**: Distance metric (euclidean, correlation, etc.)

### Advantages:
- Faster than t-SNE
- Preserves global structure better
- Can transform new data
- Scales to larger datasets

### Disadvantages:
- Newer method (less established)
- Still has stochastic elements
- Parameter tuning needed

In [None]:
# Check if UMAP is available
if not UMAP_AVAILABLE:
    print("="*60)
    print("UMAP not available - skipping this section")
    print("To install UMAP, run: pip install umap-learn")
    print("="*60)
else:
    # Apply UMAP with different parameters
    print("Applying UMAP with different parameters...")
    print("This is much faster than t-SNE!\n")
    
    n_neighbors_values = [5, 15, 50]
    umap_results = {}
    
    for n_neighbors in n_neighbors_values:
        print(f"Running UMAP with n_neighbors={n_neighbors}...")
        umap_model = UMAP(
            n_components=2,
            n_neighbors=n_neighbors,
            min_dist=0.1,
            metric='euclidean',
            random_state=42,
            verbose=False
        )
        umap_result = umap_model.fit_transform(data_scaled)
        umap_results[n_neighbors] = umap_result
        print(f"  Completed!")
    
    print("\nUMAP completed!")
    
    # Also create UMAP with different min_dist
    print("\nRunning UMAP with different min_dist values...")
    min_dist_values = [0.0, 0.5, 0.99]
    umap_results_dist = {}
    
    for min_dist in min_dist_values:
        print(f"Running UMAP with min_dist={min_dist}...")
        umap_model = UMAP(
            n_components=2,
            n_neighbors=15,
            min_dist=min_dist,
            metric='euclidean',
            random_state=42,
            verbose=False
        )
        umap_result = umap_model.fit_transform(data_scaled)
        umap_results_dist[min_dist] = umap_result
        print(f"  Completed!")
    
    print("\nAll UMAP runs completed!")

### Visualize UMAP Results

In [None]:
# Check if UMAP is available
if not UMAP_AVAILABLE:
    print("UMAP visualizations skipped (UMAP not installed)")
else:
    # Create comprehensive UMAP visualization
    fig = plt.figure(figsize=(18, 12))
    gs = fig.add_gridspec(2, 3, hspace=0.3, wspace=0.3)
    
    # Plot UMAP results with different n_neighbors
    for idx, (n_neigh, umap_result) in enumerate(umap_results.items()):
        ax = fig.add_subplot(gs[0, idx])
        
        for cell_type, color in zip(cell_types, colors):
            mask = metadata_df['cell_type'] == cell_type
            ax.scatter(umap_result[mask, 0], umap_result[mask, 1],
                      label=cell_type, alpha=0.6, s=80, color=color, 
                      edgecolors='black', linewidth=0.5)
        
        ax.set_xlabel('UMAP 1', fontsize=12)
        ax.set_ylabel('UMAP 2', fontsize=12)
        ax.set_title(f'UMAP (n_neighbors={n_neigh})', fontsize=14, fontweight='bold')
        ax.legend(loc='best', frameon=True, shadow=True)
        ax.grid(True, alpha=0.3)
    
    # Plot UMAP results with different min_dist
    for idx, (min_d, umap_result) in enumerate(umap_results_dist.items()):
        ax = fig.add_subplot(gs[1, idx])
        
        for cell_type, color in zip(cell_types, colors):
            mask = metadata_df['cell_type'] == cell_type
            ax.scatter(umap_result[mask, 0], umap_result[mask, 1],
                      label=cell_type, alpha=0.6, s=80, color=color, 
                      edgecolors='black', linewidth=0.5)
        
        ax.set_xlabel('UMAP 1', fontsize=12)
        ax.set_ylabel('UMAP 2', fontsize=12)
        ax.set_title(f'UMAP (min_dist={min_d})', fontsize=14, fontweight='bold')
        ax.legend(loc='best', frameon=True, shadow=True)
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('UMAP Analysis - Effect of Parameters', 
                 fontsize=18, fontweight='bold', y=0.995)
    plt.show()
    
    print(f"\nUMAP Parameter Effects:")
    print(f"\nn_neighbors (controls local vs global):")
    print(f"  - Low values: Focus on very local structure")
    print(f"  - High values: Preserve more global structure")
    print(f"\nmin_dist (controls point spacing):")
    print(f"  - 0.0: Tightly packed clusters")
    print(f"  - 0.5: Moderate spacing")
    print(f"  - 0.99: Loosely distributed points")

---
## Part 7: Side-by-Side Comparison

Let's compare all four methods directly.

In [None]:
# Create comprehensive comparison plot
if UMAP_AVAILABLE:
    fig, axes = plt.subplots(2, 2, figsize=(16, 14))
else:
    # Only show 3 methods if UMAP not available
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    axes = np.array([axes[0], axes[1], axes[2], None])  # Add None for consistency

# 1. PCA
ax = axes[0, 0] if UMAP_AVAILABLE else axes[0]
for cell_type, color in zip(cell_types, colors):
    mask = pca_df['cell_type'] == cell_type
    ax.scatter(pca_df.loc[mask, 'PC1'], pca_df.loc[mask, 'PC2'],
              label=cell_type, alpha=0.7, s=100, color=color, 
              edgecolors='black', linewidth=0.8)

ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)', fontsize=13, fontweight='bold')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)', fontsize=13, fontweight='bold')
ax.set_title('PCA\n(Linear, Fast, Deterministic)', fontsize=15, fontweight='bold')
ax.legend(loc='best', frameon=True, shadow=True, fontsize=10)
ax.grid(True, alpha=0.3)

# 2. ICA
ax = axes[0, 1] if UMAP_AVAILABLE else axes[1]
for cell_type, color in zip(cell_types, colors):
    mask = ica_df['cell_type'] == cell_type
    ax.scatter(ica_df.loc[mask, 'IC1'], ica_df.loc[mask, 'IC2'],
              label=cell_type, alpha=0.7, s=100, color=color, 
              edgecolors='black', linewidth=0.8)

ax.set_xlabel('IC1', fontsize=13, fontweight='bold')
ax.set_ylabel('IC2', fontsize=13, fontweight='bold')
ax.set_title('ICA\n(Independent Sources, Statistical)', fontsize=15, fontweight='bold')
ax.legend(loc='best', frameon=True, shadow=True, fontsize=10)
ax.grid(True, alpha=0.3)

# 3. t-SNE
ax = axes[1, 0] if UMAP_AVAILABLE else axes[2]
tsne_result = tsne_results[30]
for cell_type, color in zip(cell_types, colors):
    mask = metadata_df['cell_type'] == cell_type
    ax.scatter(tsne_result[mask, 0], tsne_result[mask, 1],
              label=cell_type, alpha=0.7, s=100, color=color, 
              edgecolors='black', linewidth=0.8)

ax.set_xlabel('t-SNE 1', fontsize=13, fontweight='bold')
ax.set_ylabel('t-SNE 2', fontsize=13, fontweight='bold')
ax.set_title('t-SNE\n(Nonlinear, Local Structure, Slow)', fontsize=15, fontweight='bold')
ax.legend(loc='best', frameon=True, shadow=True, fontsize=10)
ax.grid(True, alpha=0.3)

# 4. UMAP (only if available)
if UMAP_AVAILABLE:
    ax = axes[1, 1]
    umap_result = umap_results[15]
    for cell_type, color in zip(cell_types, colors):
        mask = metadata_df['cell_type'] == cell_type
        ax.scatter(umap_result[mask, 0], umap_result[mask, 1],
                  label=cell_type, alpha=0.7, s=100, color=color, 
                  edgecolors='black', linewidth=0.8)
    
    ax.set_xlabel('UMAP 1', fontsize=13, fontweight='bold')
    ax.set_ylabel('UMAP 2', fontsize=13, fontweight='bold')
    ax.set_title('UMAP\n(Nonlinear, Global+Local, Fast)', fontsize=15, fontweight='bold')
    ax.legend(loc='best', frameon=True, shadow=True, fontsize=10)
    ax.grid(True, alpha=0.3)

title = 'Dimensionality Reduction Methods Comparison\nRNA-Seq Data (4 Cell Types)'
if not UMAP_AVAILABLE:
    title += '\n(UMAP not available - install with: pip install umap-learn)'

plt.suptitle(title, fontsize=18, fontweight='bold', y=0.995 if UMAP_AVAILABLE else 1.02)
plt.tight_layout()
plt.show()

---
## Summary: When to Use Each Method

| Method | Best For | Speed | Preserves | Deterministic | New Data |
|--------|----------|-------|-----------|---------------|----------|
| **PCA** | Quick exploration, preprocessing | ⚡⚡⚡ Fast | Global distances | ✅ Yes | ✅ Yes |
| **ICA** | Finding independent processes | ⚡⚡ Medium | Independence | ⚠️ Mostly | ✅ Yes |
| **t-SNE** | Visualization, cluster discovery | ⚡ Slow | Local structure | ❌ No | ❌ No |
| **UMAP** | Large datasets, general use | ⚡⚡⚡ Fast | Local + Global | ⚠️ Mostly | ✅ Yes |

### Recommendations:

1. **Start with PCA**
   - Quick overview of data structure
   - Identify major sources of variation
   - Check for batch effects or outliers

2. **Use ICA for**
   - Identifying independent biological processes
   - Gene module discovery
   - Separating technical from biological variation

3. **Use t-SNE for**
   - Publication-quality 2D visualizations
   - Discovering subtle subpopulations
   - Small to medium datasets (< 10,000 samples)

4. **Use UMAP for**
   - Large datasets (> 10,000 samples)
   - When you need to transform new data
   - Balance between speed and quality
   - Downstream analysis (clustering, etc.)

### Pro Tips:

- **Preprocessing matters**: Always log-transform RNA-Seq data and select highly variable genes
- **Multiple methods**: Use PCA first, then t-SNE/UMAP for visualization
- **Parameter tuning**: Try different parameters (perplexity, n_neighbors) to find optimal results
- **Biological validation**: Dimensionality reduction is exploratory - validate findings with differential expression analysis

---

## Practical Exercise

Try modifying the code above to:

1. **Change the number of cell types** or samples per type
2. **Experiment with different preprocessing** (no log transform, different gene selections)
3. **Try 3D visualizations** by setting `n_components=3`
4. **Compare different distance metrics** in UMAP (correlation, cosine, manhattan)
5. **Add batch effects** to the simulated data and see how methods handle them

### Challenge:
Can you identify which genes drive the separation in each method? Use the component loadings/weights!

In [None]:
# Your code here for experimentation!
