# Dimensionality Reduction Techniques: PCA, t-SNE, and UMAP

In this notebook, we explore three popular dimensionality reduction methods—**Principal Component Analysis (PCA)**, **t-SNE**, and **UMAP**—with detailed technical explanations and code examples. We begin with a synthetic 2D dataset to illustrate PCA, apply PCA to the Iris dataset, and finally compare the results of t-SNE, UMAP, and PCA on a synthetic 3D dataset.

## Objectives

By the end of this notebook, you will be able to:

- Understand the theory behind PCA and how it finds the directions of maximum variance.
- Apply PCA to both synthetic and real datasets (Iris data) for dimensionality reduction.
- Implement t-SNE and UMAP for non-linear dimensionality reduction and compare their performance with PCA.
- Visualize and interpret the results of each algorithm in terms of cluster structure and variance preservation.

In [None]:
# Install necessary libraries (uncomment and run if not already installed)
!pip install numpy pandas matplotlib scikit-learn plotly umap-learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import make_blobs, load_iris
from sklearn.preprocessing import StandardScaler

import umap.umap_ as umap
import plotly.express as px

# Configure matplotlib inline (if using Jupyter Notebook)
%matplotlib inline

## Part I: PCA on a Synthetic 2D Dataset

In this section, we generate a bivariate normal dataset and use PCA to find its principal components. We then project the data onto these axes and visualize both the original data and its projections.

In [None]:
# Generate a 2D dataset using a bivariate normal distribution
np.random.seed(42)
mean = [0, 0]
cov = [[3, 2], [2, 2]]  
X_2d = np.random.multivariate_normal(mean, cov, 200)

### Visualize the Original 2D Data

We start by plotting a scatter plot of our two features to see the structure of the data.

In [None]:
# Scatter plot of the original 2D data
plt.figure()
plt.scatter(X_2d[:, 0], X_2d[:, 1], edgecolor='k', alpha=0.7)
plt.title("Scatter Plot of 2D Bivariate Normal Distribution")
plt.xlabel("X1")
plt.ylabel("X2")
plt.axis('equal')
plt.grid(True)
plt.show()

### Apply PCA to the 2D Data

We initialize a PCA model with 2 components, fit it to our data, and then transform the data into its principal component space. The principal components represent the directions of maximum variance.

In [None]:
# Apply PCA on the 2D data
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_2d)

components = pca_2d.components_
explained_variance_ratio = pca_2d.explained_variance_ratio_

print("Principal Components:\n", components)
print("Explained Variance Ratio:", explained_variance_ratio)

### Projecting Data onto Principal Component Axes

The new coordinates of each point along a principal component can be computed as the dot product of the data with that component. We then map these projections back to the original feature space for visualization.

In [None]:
# Compute projections onto the two principal components
projection_pc1 = np.dot(X_2d, components[0])
projection_pc2 = np.dot(X_2d, components[1])

# Map the projections back to the original feature space
x_pc1 = projection_pc1 * components[0][0]
y_pc1 = projection_pc1 * components[0][1]
x_pc2 = projection_pc2 * components[1][0]
y_pc2 = projection_pc2 * components[1][1]

### Visualize the Projections

Here, we overlay the original data with its projections onto the first and second principal components.

In [None]:
# Plot the original data and its projections onto PC1 and PC2
plt.figure()
plt.scatter(X_2d[:, 0], X_2d[:, 1], label='Original Data', c='gray', edgecolor='k', alpha=0.6)
plt.scatter(x_pc1, y_pc1, marker='X', s=70, c='r', edgecolor='k', alpha=0.5, label='Projection on PC1')
plt.scatter(x_pc2, y_pc2, marker='X', s=70, c='b', edgecolor='k', alpha=0.5, label='Projection on PC2')
plt.title('2D Data Projected onto Principal Components')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.axis('equal')
plt.show()

### Reflection on PCA (2D Data)

The printed explained variance ratios indicate how much of the total variance each principal component captures. In many cases, the first principal component captures a very high percentage of the variance, demonstrating the main direction of variation in the data.

## Part II: PCA on the Iris Dataset

Next, we apply PCA to a real-world dataset—the Iris dataset. This dataset has four features, and we will reduce it to two principal components to visualize how well the data can be separated based on the flower species.

In [None]:
# Load and standardize the Iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
target_names = iris.target_names

scaler = StandardScaler()
X_iris_scaled = scaler.fit_transform(X_iris)

### Apply PCA to the Iris Dataset

We now reduce the Iris dataset from four to two dimensions using PCA. This helps in visualizing the separability of the species.

In [None]:
# Apply PCA on the Iris dataset
pca_iris = PCA(n_components=2)
X_iris_pca = pca_iris.fit_transform(X_iris_scaled)

print("Explained Variance Ratio (Iris):", pca_iris.explained_variance_ratio_)
print("Combined Variance Explained:", np.sum(pca_iris.explained_variance_ratio_)*100, "%")

In [None]:
# Plot the PCA-transformed Iris data
plt.figure(figsize=(8,6))
colors = ['navy', 'turquoise', 'darkorange']
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X_iris_pca[y_iris == i, 0], X_iris_pca[y_iris == i, 1], color=color, s=50, edgecolor='k', alpha=0.7, label=target_name)
plt.title('PCA: 2D Reduction of the Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='best')
plt.grid(True)
plt.show()

### Reflection on the Iris PCA Result

The two principal components capture a large portion of the variance in the Iris dataset. This reduction helps in visualizing the natural grouping of the flower species, even though some overlap may still exist.

## Part III: Comparing t-SNE, UMAP, and PCA on a Synthetic 3D Dataset

In this section, we generate a synthetic dataset with clusters in a 3-dimensional space. We then apply t-SNE, UMAP, and PCA to reduce the data to 2 dimensions. This comparison helps illustrate the strengths and trade-offs of each algorithm in preserving cluster structure and local relationships.

In [None]:
# Generate a synthetic 3D dataset with 4 clusters
centers = [[2, -6, -6], [-1, 9, 4], [-8, 7, 2], [4, 7, 9]]
cluster_std = [1, 1, 2, 3.5]
X_3d, labels_3d = make_blobs(n_samples=500, centers=centers, n_features=3, cluster_std=cluster_std, random_state=42)

### Interactive 3D Visualization

Below is an interactive 3D scatter plot of the synthetic data using Plotly. Use the tools provided in the plot to rotate, zoom, and pan.

In [None]:
# Create a DataFrame and plot an interactive 3D scatter plot
df_3d = pd.DataFrame(X_3d, columns=['X', 'Y', 'Z'])
fig = px.scatter_3d(df_3d, x='X', y='Y', z='Z', color=labels_3d.astype(str), opacity=0.7,
                     title="Interactive 3D Scatter Plot of Synthetic Data")
fig.update_traces(marker=dict(size=5, line=dict(width=1, color='black')), showlegend=False)
fig.update_layout(width=800, height=600)
fig.show()

### Standardize the 3D Data

Standardization ensures that each feature contributes equally to the dimensionality reduction process.

In [None]:
scaler_3d = StandardScaler()
X_3d_scaled = scaler_3d.fit_transform(X_3d)

### Apply t-SNE to the 3D Data

We reduce the dimensionality from 3D to 2D using t-SNE. The algorithm is particularly good at preserving local structure, although the results may vary with different perplexity values.

In [None]:
tsne_model = TSNE(n_components=2, random_state=42, perplexity=30, max_iter=1000)
X_tsne = tsne_model.fit_transform(X_3d_scaled)

### Plot the t-SNE Projection

The following scatter plot shows the 2D projection obtained from t-SNE.

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels_3d, cmap='viridis', s=50, edgecolor='k', alpha=0.7)
plt.title('2D t-SNE Projection of 3D Synthetic Data')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.xticks([])
plt.yticks([])
plt.show()

### Apply UMAP to the 3D Data

UMAP is another non-linear dimensionality reduction technique that tends to preserve both local and global structure. Here, we set `min_dist=0.5` and `spread=1` as parameters.

In [None]:
umap_model = umap.UMAP(n_components=2, random_state=42, min_dist=0.5, spread=1, n_jobs=1)
X_umap = umap_model.fit_transform(X_3d_scaled)

### Plot the UMAP Projection

We now visualize the 2D projection obtained from UMAP.

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=labels_3d, cmap='viridis', s=50, edgecolor='k', alpha=0.7)
plt.title('2D UMAP Projection of 3D Synthetic Data')
plt.xlabel('UMAP Component 1')
plt.ylabel('UMAP Component 2')
plt.xticks([])
plt.yticks([])
plt.show()

### Apply PCA to the 3D Data

Finally, we use PCA to reduce the 3D data to 2 dimensions. This serves as a baseline for comparison with t-SNE and UMAP.

In [None]:
pca_3d = PCA(n_components=2)
X_pca_3d = pca_3d.fit_transform(X_3d_scaled)

### Plot the PCA Projection for 3D Data

The following scatter plot shows the 2D projection from PCA.

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(X_pca_3d[:, 0], X_pca_3d[:, 1], c=labels_3d, cmap='viridis', s=50, edgecolor='k', alpha=0.7)
plt.title('2D PCA Projection of 3D Synthetic Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.xticks([])
plt.yticks([])
plt.show()

### Reflection on the 3D Data Projections

Compare the three 2D projections obtained from t-SNE, UMAP, and PCA:

- **t-SNE**: Excellent at preserving local structure; clusters may appear more separated, but results can vary with the perplexity parameter.
- **UMAP**: Balances local and global structure; it often preserves more of the overall data connectivity compared to t-SNE.
- **PCA**: A linear method that preserves the directions of maximum variance but may not capture non-linear relationships as well as t-SNE or UMAP.

Each method has its trade-offs in terms of computational performance and the type of structure they preserve.

## Conclusion

In this notebook, we demonstrated how to apply PCA, t-SNE, and UMAP to both synthetic and real datasets. We covered:

- The theory behind PCA and how to visualize its results on a 2D dataset and the Iris dataset.
- The application of t-SNE and UMAP on a synthetic 3D dataset, highlighting how non-linear methods can capture more complex relationships.

By comparing the results of these methods, you can choose the best dimensionality reduction technique for your specific problem.

Happy coding and exploring your data!