<a href="https://colab.research.google.com/github/tanzilahmed0/CS-133/blob/main/Tanzil_Ahmed_Visualize_High_dimensional_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Hands-on: Visualizing High-Dimensional Data with PCA, t-SNE, and UMAP**

#### **Objective**
In this assignment, you will learn how to reduce high-dimensional data to two dimensions and visualize it using three popular dimensionality reduction techniques: PCA (Principal Component Analysis), t-SNE (t-Distributed Stochastic Neighbor Embedding), and UMAP (Uniform Manifold Approximation and Projection). The goal is to observe how well each method captures and visualizes clusters based on different wine cultivars.

#### **Dataset**
We will use the **Wine Dataset** from `scikit-learn`, which contains 13 chemical features of wines from three cultivars. Each wine sample is labeled according to its cultivar, allowing us to assess the clustering effectiveness of each dimensionality reduction technique.

#### **Instructions**

1. **Load and Prepare the Data**
   - Load the Wine dataset from `sklearn.datasets`.
   ```
   from sklearn.datasets import load_wine
   X = wine.data  # Features
   y = wine.target  # Labels
   target_names = wine.target_names # Wine Cultivars (class_0, class_1, class_2)
   ```
   - Extract the features (`X`) and target labels (`y`), where each row represents the chemical composition of a wine sample, and each label represents its cultivar.
   
   - Standardize the data using `StandardScaler` to ensure each feature has a mean of 0 and a standard deviation of 1. Standardization improves the performance of dimensionality reduction techniques by balancing the scale of features.

   ```
   scaler = StandardScaler()
   X_scaled = scaler.fit_transform(X)
   ```

2. **Implement Dimensionality Reduction Techniques**
   - **PCA**: Use `sklearn.decomposition.PCA` to reduce the 13-dimensional standardized data to 2D.
   - **t-SNE**: Use `sklearn.manifold.TSNE` to reduce the 13-dimensional standardized data to 2D. Experiment with the `perplexity` parameter to observe its effect on clustering.
   - **UMAP**: Use `umap.UMAP()` to reduce the 13-dimensional standardized data to 2D. Experiment with `n_neighbors` and `min_dist` parameters to see how they influence clustering.

3. **Track and Discuss Time Efficiency**
   - For each method (PCA, t-SNE, and UMAP), measure the time it takes to execute the dimensionality reduction.
   - To do this, use `time.time()` to record the start time before the reduction and end time after it. Calculate the duration by taking the difference between the start and end times.
   - Discuss the time efficiency of each method based on these measurements, noting which methods are faster and which may require more computational time.

4. **Visualize the Results**
   - For each method, create a 2D scatter plot of the first two components.
   - You may use **either Plotly Express or Seaborn’s `relplot`** to visualize the data:
     - If using **Plotly Express**: Use `px.scatter` to create an interactive scatter plot with the wine cultivars as colors.
     - If using **Seaborn’s `relplot`**: Use `sns.relplot` to create a scatter plot with `hue` set to the wine cultivars.
   - Each plot should clearly indicate the dimensionality reduction technique used (PCA, t-SNE, or UMAP) in the title.

5. **Analysis**
   - Observe and compare the clustering patterns for each method. Describe which method(s) provide the clearest separation of the wine cultivars.
   - Refer to your time measurements and discuss which technique is most computationally efficient and why this may be the case.

#### **Deliverables**
- A Colab Notebook with:
  - Code to load, standardize, and apply PCA, t-SNE, and UMAP to the Wine dataset.
  - Scatter plots for each method using either Plotly Express or Seaborn `relplot`, with each plot showing the two-dimensional projection of the data.
  - A brief analysis of which technique produced the clearest clusters and a discussion on the computation time for each method.


In [10]:
!pip install umap-learn

Collecting umap-learn
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading umap_learn-0.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pynndescent-0.5.13-py3-none-any.whl (56 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.9/56.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pynndescent, umap-learn
Successfully installed pynndescent-0.5.13 umap-learn-0.5.7


In [17]:
# Your code here . . .
import time
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP
import plotly.express as px

wine = load_wine()
X = wine.data  # Features
y = wine.target  # Labels
target_names = wine.target_names # Wine Cultivars (class_0, class_1, class_2)


In [2]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [23]:
start_time = time.time()

pca_2d = PCA(n_components=2)
pca_projections = pca_2d.fit_transform(X_scaled)

end_time = time.time()

print(f"PCA execution time: {end_time - start_time:.4f} seconds")

# Plot the PCA projections
fig = px.scatter(
    pca_projections, x=pca_projections[:, 0], y=pca_projections[:, 1],
    color=wine.target.astype(str), labels={'color': 'Wine Cultivar'}
)
fig.update_layout(title="PCA Projections of Wine Dataset")
fig.show()



PCA execution time: 0.0081 seconds


In [36]:
start_time = time.time()
# Applying t-SNe to reduce dimensions to 2
tsne_2d = TSNE(n_components=2, perplexity=47, random_state=42)

# Creating 2D projection of wine data
tsne_projections = tsne_2d.fit_transform(X_scaled)
end_time = time.time()

print(f"t-SNE execution time: {end_time - start_time:.4f} seconds")

# Plot the t-SNE projections
fig = px.scatter(
    tsne_projections, x=0, y=1,
    color=wine.target.astype(str), labels={'color': 'Wine Cultivar'}
)
fig.update_layout(title="t-SNE Projections of Wine Dataset")
fig.show()


t-SNE execution time: 2.2974 seconds


In [40]:
start_time = time.time()
# Initialize UMAP with n_neighbors and min_dist parameters
umap_2d = UMAP(n_components=2, n_neighbors=9, min_dist=0.1, random_state=42)

# Fitting the UMAP model to the wine data
umap_2d.fit(X_scaled)

umap_projections = umap_2d.transform(X_scaled)

end_time = time.time()

print(f"UMAP execution time: {end_time - start_time:.4f} seconds")

# Create scatter plot of projection

fig = px.scatter(
    umap_projections, x=0, y=1,
    color = wine.target.astype(str), labels ={'color': 'Wine Cultivar'}
)
fig.update_layout(title="UMAP Projections of Wine Dataset")

fig.show()


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



UMAP execution time: 0.9775 seconds


**A brief analysis of which technique produced the clearest clusters and a discussion on the computation time for each method.**

For PCA, you can see there is some separation between the clusters, but there aren't very clear margins and there is some overlap. This is probably because PCA is a linear dimensionality reduction method and may not the best for capturing non-linear relationships with the data. However, it was the fasted among the three methods. For t-SNE, the clusters are more defined compared to PCA, but there is still some overlap. t-SNE is good for revealing local clusters and very effective and capturing complex relationships but it can vary a lot with different perplexity values. t-SNE took significantly longer to run than PCA, which is to be expective as it is an iterative optimization approach. UMAP had the most clearly separated clusters with very minimal overlap. This shows that it is better for mainting both local and globala data relationships. It was also adequate in its computational run time, as it was faster than t-SNE, but slower than PCA