# In-Class Exercise: Dimensionality Reduction with PCA and t-SNE

**Time: 10 minutes**

**Instructions:**
- Work in groups of 2-3 people
- Complete the code cells below by filling in the missing parts
- After completing the exercise, pair up with another group to review and compare your solutions

---


## Setup: Load the Data

We'll use the Wine dataset, which has 13 features (chemical properties) and 3 wine classes.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target

print(f"Dataset shape: {X.shape}")
print(f"Number of features: {X.shape[1]}")
print(f"Number of samples: {X.shape[0]}")
print(f"Classes: {wine.target_names}")


## Part 1: Data Preprocessing (2 minutes)

**Task:** Before applying PCA, we need to standardize the data. Why is this important?

**Your answer here:** _(Discuss with your group and write a brief answer)_

---

Complete the code below to standardize the data:


In [None]:
# TODO: Create a StandardScaler and fit_transform the data
scaler = StandardScaler()
X_scaled = ___  # Fill in: use scaler to transform X

# Verify: the mean should be ~0 and std should be ~1
print(f"Mean of first feature (should be ~0): {X_scaled[:, 0].mean():.6f}")
print(f"Std of first feature (should be ~1): {X_scaled[:, 0].std():.6f}")


## Part 2: Apply PCA (3 minutes)

**Task:** Apply PCA to reduce the data from 13 dimensions to 2 dimensions for visualization.


In [None]:
# TODO: Create a PCA object with 2 components and fit_transform the scaled data
pca = PCA(n_components=___)  # Fill in: how many components?
X_pca = ___  # Fill in: transform X_scaled using PCA

print(f"Reduced data shape: {X_pca.shape}")

# Calculate explained variance
explained_var = pca.explained_variance_ratio_
print(f"\nExplained variance by PC1: {explained_var[0]:.3f}")
print(f"Explained variance by PC2: {explained_var[1]:.3f}")
print(f"Total explained variance: {explained_var.sum():.3f}")


**Discussion Question:** What percentage of the total variance is explained by the first two principal components? Is this enough?


## Part 3: Visualize PCA Results (2 minutes)

**Task:** Complete the plotting code to visualize the wine data in the PCA space.


In [None]:
# TODO: Complete the scatter plot
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], 
                     c=___,  # Fill in: what should color the points?
                     cmap='viridis', 
                     edgecolor='k', 
                     s=50)
plt.xlabel('___')  # Fill in: appropriate label
plt.ylabel('___')  # Fill in: appropriate label
plt.title('Wine Dataset - PCA Projection')
plt.colorbar(scatter, label='Wine Class')
plt.show()


**Discussion Question:** Are the three wine classes well-separated in the PCA space?


## Part 4: Compare with t-SNE (3 minutes)

**Task:** Apply t-SNE to the same dataset and compare the results with PCA.


In [None]:
# TODO: Apply t-SNE with perplexity=30 and random_state=42
tsne = TSNE(n_components=2, 
            perplexity=___,  # Fill in: use 30
            random_state=42)
X_tsne = ___  # Fill in: transform X_scaled using t-SNE

print(f"t-SNE reduced data shape: {X_tsne.shape}")


In [None]:
# Visualize t-SNE results
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], 
                     c=y, 
                     cmap='viridis', 
                     edgecolor='k', 
                     s=50)
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('Wine Dataset - t-SNE Projection')
plt.colorbar(scatter, label='Wine Class')
plt.show()


## Part 5: Comparison and Discussion

**Group Discussion Questions:**

1. **Separation:** Which method (PCA or t-SNE) shows better separation between the three wine classes?

2. **Interpretability:** Can you interpret what PC1 and PC2 represent? Can you interpret what the t-SNE components represent?

3. **Use Cases:** Based on what you learned in the lecture:
   - When would you use PCA over t-SNE?
   - When would you use t-SNE over PCA?

4. **Limitations:** What are the key limitations of each method?

---

**Write your group's answers below:**

1. Separation:

2. Interpretability:

3. Use Cases:

4. Limitations:


---

## Next Steps: Peer Review

**Now that you've completed the exercise:**

1. **Find another group** to pair up with
2. **Compare your solutions:**
   - Did you get similar visualizations?
   - Do you agree on the discussion questions?
   - Did you have different interpretations?
3. **Discuss any differences** in your approaches or conclusions
4. **Share insights** you discovered during the exercise

---

## Bonus Challenge (If Time Permits)

Try experimenting with different perplexity values for t-SNE (e.g., 5, 15, 50). How does this affect the clustering visualization?


In [None]:
# Bonus: Experiment with different perplexity values
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
perplexities = [5, 30, 50]

for idx, perp in enumerate(perplexities):
    tsne_temp = TSNE(n_components=2, perplexity=perp, random_state=42)
    X_tsne_temp = tsne_temp.fit_transform(X_scaled)
    
    scatter = axes[idx].scatter(X_tsne_temp[:, 0], X_tsne_temp[:, 1], 
                               c=y, cmap='viridis', edgecolor='k', s=30)
    axes[idx].set_title(f't-SNE (perplexity={perp})')
    axes[idx].set_xlabel('Component 1')
    axes[idx].set_ylabel('Component 2')

plt.tight_layout()
plt.show()
