#  Complete Guide to Unsupervised Learning Algorithms

## 📚 Introduction to Unsupervised Learning

**Unsupervised Learning** is a type of machine learning where we discover hidden patterns in data **without labeled examples**. Unlike supervised learning (where we have input-output pairs), unsupervised learning works with raw data to find structure, relationships, and patterns that aren't immediately obvious.

### 🌟 Why is Unsupervised Learning Important?

1. **Data Exploration**: Understand your data before applying supervised methods
2. **Feature Engineering**: Create new features for better model performance
3. **Anomaly Detection**: Find unusual patterns or outliers
4. **Data Compression**: Reduce data size while preserving important information
5. **Customer Segmentation**: Group customers based on behavior patterns

### 🏢 Real-World Examples

- **Netflix**: Grouping users with similar viewing preferences
- **Google News**: Clustering similar news articles together
- **Banking**: Detecting fraudulent transactions (anomaly detection)
- **Genetics**: Finding patterns in DNA sequences
- **Marketing**: Customer segmentation for targeted campaigns

Let's dive into the most important unsupervised learning algorithms! 🚀

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, load_iris, make_swiss_roll
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print(" All libraries imported successfully!")

📦 All libraries imported successfully!


## 1. K-Means Clustering

### 🧠 Intuitive Explanation

Imagine you're a teacher organizing students into study groups. You want each group to have students with similar study habits and performance levels. **K-Means** is like that organizing process:

1. You decide upfront how many groups (k) you want
2. You randomly place "group leaders" (centroids) in the classroom
3. Each student joins the nearest group leader
4. Group leaders move to the center of their group
5. Students might switch groups if they're now closer to a different leader
6. Repeat until everyone is happy with their groups!

### ⚙️ How It Works (Mechanism)

K-Means uses the concept of **centroids** (cluster centers) and **Euclidean distance** to group similar data points:

1. **Initialize**: Randomly place k centroids in the data space
2. **Assignment**: Each data point joins the closest centroid (using Euclidean distance: √[(x₁-x₂)² + (y₁-y₂)²])
3. **Update**: Move each centroid to the center (mean) of its assigned points
4. **Repeat**: Steps 2-3 until centroids stop moving significantly

The algorithm minimizes **inertia** (sum of squared distances from points to their centroid).

### 📝 Pseudo Structure / Workflow

```
ALGORITHM: K-Means Clustering
INPUT: Dataset X, number of clusters k
OUTPUT: Cluster assignments, centroids

1. INITIALIZE k centroids randomly
2. REPEAT until convergence:
   a. FOR each data point:
      - Calculate distance to all centroids
      - Assign to nearest centroid
   b. FOR each centroid:
      - Move to mean position of assigned points
   c. CHECK if centroids moved significantly
3. RETURN final clusters and centroids
```

### ✅ Use Cases

• **Customer Segmentation**: Group customers by purchase behavior  
• **Market Research**: Segment survey responses  
• **Image Compression**: Reduce color palette  
• **Gene Sequencing**: Group similar genetic patterns  
• **Recommendation Systems**: Group users with similar preferences  
• **Quality Control**: Identify defective products  

### 💡 Why & When To Use

**Strengths:**
- Simple and fast
- Works well with spherical clusters
- Scales well to large datasets
- Easy to implement and interpret

**Limitations:**
- Must choose k beforehand
- Assumes spherical clusters
- Sensitive to initialization
- Affected by outliers
- Struggles with varying cluster sizes

### 💻 Code Example

> **Problem**: We have customer data with annual spending and frequency of purchases. Let's use K-Means to segment customers into 3 groups for targeted marketing campaigns.

In [None]:
# Generate sample customer data
np.random.seed(42)
X_customers, _ = make_blobs(n_samples=300, centers=3, n_features=2, 
                          cluster_std=1.5, random_state=42)

# Create a DataFrame for better understanding
customer_data = pd.DataFrame(X_customers, columns=['Annual_Spending', 'Purchase_Frequency'])

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
customer_clusters = kmeans.fit_predict(X_customers)

# Add cluster labels to our data
customer_data['Cluster'] = customer_clusters

print(" Customer Segmentation Results:")
print(f"Total customers: {len(customer_data)}")
print(f"Number of clusters: {len(np.unique(customer_clusters))}")
print("\n Cluster Summary:")
print(customer_data.groupby('Cluster').agg({
    'Annual_Spending': ['mean', 'std'],
    'Purchase_Frequency': ['mean', 'std']
}).round(2))

🎯 Customer Segmentation Results:
Total customers: 300
Number of clusters: 3

📊 Cluster Summary:
        Annual_Spending       Purchase_Frequency      
                   mean   std               mean   std
Cluster                                               
0                 -2.70  1.30               9.06  1.50
1                 -6.89  1.53              -7.04  1.49
2                  4.80  1.56               2.03  1.39


In [None]:
# Visualize the clustering results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Original data
ax1.scatter(X_customers[:, 0], X_customers[:, 1], alpha=0.6, s=50)
ax1.set_title(' Original Customer Data', fontsize=14, fontweight='bold')
ax1.set_xlabel('Annual Spending ($)')
ax1.set_ylabel('Purchase Frequency (times/year)')
ax1.grid(True, alpha=0.3)

# Clustered data
colors = ['red', 'blue', 'green', 'purple', 'orange']
for i in range(3):
    cluster_points = X_customers[customer_clusters == i]
    ax2.scatter(cluster_points[:, 0], cluster_points[:, 1], 
               c=colors[i], label=f'Cluster {i}', alpha=0.6, s=50)

# Plot centroids
centroids = kmeans.cluster_centers_
ax2.scatter(centroids[:, 0], centroids[:, 1], 
           marker='x', s=300, linewidths=3, color='black', label='Centroids')

ax2.set_title(' K-Means Clustering Results', fontsize=14, fontweight='bold')
ax2.set_xlabel('Annual Spending ($)')
ax2.set_ylabel('Purchase Frequency (times/year)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Interpret the clusters
print("\n Cluster Interpretation:")
cluster_names = ['Low Value', 'Medium Value', 'High Value']
for i in range(3):
    cluster_data = customer_data[customer_data['Cluster'] == i]
    avg_spending = cluster_data['Annual_Spending'].mean()
    avg_frequency = cluster_data['Purchase_Frequency'].mean()
    print(f"Cluster {i}: {len(cluster_data)} customers")
    print(f"  - Average spending: ${avg_spending:.2f}")
    print(f"  - Average frequency: {avg_frequency:.2f} purchases/year")
    print()

## 2. Hierarchical Clustering

### 🧠 Intuitive Explanation

Think of **Hierarchical Clustering** like building a family tree, but in reverse! Imagine you have a big family reunion with hundreds of relatives:

**Agglomerative (Bottom-up)**: 
- Start with each person as their own "family"
- Find the two most similar people and group them
- Keep combining the most similar groups until everyone is in one big family
- You get a tree showing how everyone is related!

**Divisive (Top-down)**: 
- Start with everyone in one big group
- Keep splitting groups based on differences
- Stop when each person is alone

### ⚙️ How It Works (Mechanism)

Hierarchical clustering builds a **dendrogram** (tree diagram) showing relationships between clusters:

**Key Components:**
1. **Distance Metric**: How to measure similarity (Euclidean, Manhattan, etc.)
2. **Linkage Criteria**: How to measure distance between clusters:
   - **Single**: Minimum distance between any two points
   - **Complete**: Maximum distance between any two points  
   - **Average**: Average distance between all pairs
   - **Ward**: Minimizes variance when merging

**Process (Agglomerative):**
1. Calculate distance matrix between all points
2. Merge closest pair of clusters
3. Update distance matrix
4. Repeat until one cluster remains

### 📝 Pseudo Structure / Workflow

```
ALGORITHM: Agglomerative Hierarchical Clustering
INPUT: Dataset X, linkage method
OUTPUT: Dendrogram, cluster hierarchy

1. INITIALIZE each point as its own cluster
2. CALCULATE distance matrix between all clusters
3. WHILE more than one cluster exists:
   a. FIND pair of clusters with minimum distance
   b. MERGE these clusters
   c. UPDATE distance matrix using linkage method
   d. RECORD merge in dendrogram
4. RETURN complete dendrogram
5. CUT dendrogram at desired level to get k clusters
```

### ✅ Use Cases

• **Phylogenetic Analysis**: Evolutionary relationships between species  
• **Social Network Analysis**: Community detection  
• **Gene Expression**: Grouping genes with similar expression patterns  
• **Document Clustering**: Organizing research papers by topic  
• **Image Segmentation**: Grouping similar pixels  
• **Recommendation Systems**: Finding user groups with similar preferences  

### 💡 Why & When To Use

**Strengths:**
- No need to specify number of clusters beforehand
- Produces a hierarchy of clusters (dendrogram)
- Deterministic results (same input = same output)
- Can find arbitrarily shaped clusters
- Good for small to medium datasets

**Limitations:**
- Computationally expensive O(n³)
- Sensitive to outliers
- Difficult to handle large datasets
- Once merged, cannot undo (greedy approach)
- Choice of linkage method affects results significantly

### 💻 Code Example

> **Problem**: We have data about different wine samples with their chemical properties. Let's use Hierarchical Clustering to group similar wines and create a dendrogram to understand the relationships.

In [None]:
# Load the Iris dataset as a proxy for wine data (3 classes, multiple features)
iris = load_iris()
X_wine = iris.data
feature_names = iris.feature_names
true_labels = iris.target

# Standardize the features (important for distance-based algorithms)
scaler = StandardScaler()
X_wine_scaled = scaler.fit_transform(X_wine)

print(" Wine Dataset Information:")
print(f"Number of wine samples: {X_wine.shape[0]}")
print(f"Number of chemical features: {X_wine.shape[1]}")
print(f"Features: {feature_names}")
print(f"Original wine types: {len(np.unique(true_labels))}")

NameError: name 'load_iris' is not defined

In [None]:
# Perform hierarchical clustering with different linkage methods
linkage_methods = ['ward', 'complete', 'average', 'single']

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

for idx, method in enumerate(linkage_methods):
    # Calculate linkage matrix
    linkage_matrix = linkage(X_wine_scaled, method=method)
    
    # Create dendrogram
    dendrogram(linkage_matrix, ax=axes[idx], truncate_mode='level', p=5,
              leaf_rotation=90, leaf_font_size=8)
    axes[idx].set_title(f' Dendrogram - {method.capitalize()} Linkage', 
                       fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Sample Index')
    axes[idx].set_ylabel('Distance')

plt.tight_layout()
plt.show()

In [None]:
# Apply hierarchical clustering and compare with K-means
# Use Ward linkage (generally works well)
hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
hier_clusters = hierarchical.fit_predict(X_wine_scaled)

# Compare with K-means
kmeans_wine = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_clusters = kmeans_wine.fit_predict(X_wine_scaled)

# Visualize results using first two features
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# True labels
scatter1 = axes[0].scatter(X_wine[:, 0], X_wine[:, 1], c=true_labels, 
                          cmap='viridis', alpha=0.7, s=50)
axes[0].set_title(' True Wine Types', fontsize=14, fontweight='bold')
axes[0].set_xlabel(feature_names[0])
axes[0].set_ylabel(feature_names[1])
axes[0].grid(True, alpha=0.3)

# Hierarchical clustering results
scatter2 = axes[1].scatter(X_wine[:, 0], X_wine[:, 1], c=hier_clusters, 
                          cmap='viridis', alpha=0.7, s=50)
axes[1].set_title(' Hierarchical Clustering', fontsize=14, fontweight='bold')
axes[1].set_xlabel(feature_names[0])
axes[1].set_ylabel(feature_names[1])
axes[1].grid(True, alpha=0.3)

# K-means clustering results
scatter3 = axes[2].scatter(X_wine[:, 0], X_wine[:, 1], c=kmeans_clusters, 
                          cmap='viridis', alpha=0.7, s=50)
axes[2].set_title(' K-Means Clustering', fontsize=14, fontweight='bold')
axes[2].set_xlabel(feature_names[0])
axes[2].set_ylabel(feature_names[1])
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate accuracy (how well clusters match true labels)
from sklearn.metrics import adjusted_rand_score

hier_accuracy = adjusted_rand_score(true_labels, hier_clusters)
kmeans_accuracy = adjusted_rand_score(true_labels, kmeans_clusters)

print("\n Clustering Performance Comparison:")
print(f"Hierarchical Clustering ARI Score: {hier_accuracy:.3f}")
print(f"K-Means Clustering ARI Score: {kmeans_accuracy:.3f}")
print("\n(ARI Score: 1.0 = Perfect match, 0.0 = Random assignment)")

## 3. DBSCAN (Density-Based Spatial Clustering)

### 🧠 Intuitive Explanation

Imagine you're looking at a satellite image of Earth at night, trying to identify cities. **DBSCAN** works like this:

- **Dense areas** (lots of lights close together) = Cities (clusters)
- **Sparse areas** (few scattered lights) = Countryside (noise/outliers)
- You don't need to know how many cities exist beforehand!
- Cities can have any shape (not just circular like K-means assumes)

DBSCAN finds **dense regions** separated by **sparse regions**, making it perfect for:
- Irregularly shaped clusters
- Automatically determining the number of clusters
- Identifying outliers

### ⚙️ How It Works (Mechanism)

DBSCAN uses two key parameters and three types of points:

**Parameters:**
- **eps (ε)**: Maximum distance between two points to be neighbors
- **min_samples**: Minimum points needed to form a dense region

**Point Types:**
1. **Core Point**: Has at least min_samples neighbors within eps distance
2. **Border Point**: Within eps of a core point but has < min_samples neighbors
3. **Noise Point**: Neither core nor border (outlier)

**Algorithm Process:**
1. For each point, count neighbors within eps distance
2. Mark core points (≥ min_samples neighbors)
3. Form clusters by connecting core points and their neighbors
4. Assign border points to nearby clusters
5. Mark remaining points as noise

### 📝 Pseudo Structure / Workflow

```
ALGORITHM: DBSCAN
INPUT: Dataset X, eps, min_samples
OUTPUT: Cluster assignments, noise points

1. INITIALIZE all points as unvisited
2. FOR each unvisited point P:
   a. MARK P as visited
   b. FIND all neighbors of P within eps distance
   c. IF neighbors < min_samples:
      - MARK P as noise (for now)
   d. ELSE:
      - CREATE new cluster
      - ADD P to cluster
      - FOR each neighbor N:
         - IF N is unvisited: recursively expand
         - IF N not in any cluster: add to current cluster
3. RETURN clusters and noise points
```

### ✅ Use Cases

• **Anomaly Detection**: Fraud detection, network intrusion  
• **Image Processing**: Segmenting objects in images  
• **Geolocation Analysis**: Finding hotspots in GPS data  
• **Customer Behavior**: Identifying unusual purchase patterns  
• **Bioinformatics**: Protein structure analysis  
• **Social Media**: Detecting communities in networks  
• **Quality Control**: Finding defective products  

### 💡 Why & When To Use

**Strengths:**
- Finds clusters of arbitrary shape
- Automatically determines number of clusters
- Robust to outliers (marks them as noise)
- No need to specify cluster centers
- Works well with non-spherical data

**Limitations:**
- Sensitive to hyperparameters (eps, min_samples)
- Struggles with clusters of different densities
- Performance depends on distance metric choice
- Can be sensitive to data scaling
- May struggle with high-dimensional data

### 💻 Code Example

> **Problem**: We have GPS coordinates of crime incidents in a city. Let's use DBSCAN to identify crime hotspots (dense areas) and isolated incidents (noise), without knowing beforehand how many hotspots exist.

In [None]:
# Generate synthetic crime data with different shaped clusters and noise
np.random.seed(42)

# Create irregular shaped clusters (crime hotspots)
# Hotspot 1: Dense circular area (downtown)
cluster1 = np.random.multivariate_normal([2, 2], [[0.5, 0], [0, 0.5]], 50)

# Hotspot 2: Elongated area (along a highway)
cluster2 = np.random.multivariate_normal([8, 6], [[2, 1.5], [1.5, 1]], 40)

# Hotspot 3: Small dense area (near a mall)
cluster3 = np.random.multivariate_normal([5, 10], [[0.3, 0], [0, 0.8]], 30)

# Add random noise points (isolated crimes)
noise = np.random.uniform(0, 12, (25, 2))

# Combine all data
X_crime = np.vstack([cluster1, cluster2, cluster3, noise])

print(" Crime Data Information:")
print(f"Total crime incidents: {len(X_crime)}")
print(f"Expected hotspots: 3")
print(f"Expected noise points: 25")

# Visualize original data
plt.figure(figsize=(10, 8))
plt.scatter(X_crime[:, 0], X_crime[:, 1], alpha=0.6, s=50)
plt.title(' Crime Incidents in City (GPS Coordinates)', fontsize=14, fontweight='bold')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Apply DBSCAN with different parameter settings
dbscan_params = [
    {'eps': 0.5, 'min_samples': 5},   # Strict parameters
    {'eps': 1.0, 'min_samples': 5},   # Medium parameters
    {'eps': 1.5, 'min_samples': 3},   # Loose parameters
]

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

for idx, params in enumerate(dbscan_params):
    # Apply DBSCAN
    dbscan = DBSCAN(eps=params['eps'], min_samples=params['min_samples'])
    crime_clusters = dbscan.fit_predict(X_crime)
    
    # Count clusters and noise points
    n_clusters = len(set(crime_clusters)) - (1 if -1 in crime_clusters else 0)
    n_noise = list(crime_clusters).count(-1)
    
    # Create colors for visualization
    unique_labels = set(crime_clusters)
    colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
    
    # Plot results
    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Noise points in black
            col = 'black'
            marker = 'x'
            size = 50
            label = f'Noise ({n_noise} points)'
        else:
            marker = 'o'
            size = 60
            label = f'Hotspot {k}'
        
        class_member_mask = (crime_clusters == k)
        xy = X_crime[class_member_mask]
        axes[idx].scatter(xy[:, 0], xy[:, 1], c=[col], marker=marker, s=size, 
                         alpha=0.7, label=label)
    
    axes[idx].set_title(f'🎯 DBSCAN Results\neps={params["eps"]}, min_samples={params["min_samples"]}\n'
                       f'Clusters: {n_clusters}, Noise: {n_noise}', 
                       fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Longitude')
    axes[idx].set_ylabel('Latitude')
    axes[idx].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n Parameter Impact Analysis:")
for idx, params in enumerate(dbscan_params):
    dbscan = DBSCAN(eps=params['eps'], min_samples=params['min_samples'])
    crime_clusters = dbscan.fit_predict(X_crime)
    n_clusters = len(set(crime_clusters)) - (1 if -1 in crime_clusters else 0)
    n_noise = list(crime_clusters).count(-1)
    
    print(f"Configuration {idx+1}: eps={params['eps']}, min_samples={params['min_samples']}")
    print(f"  - Found {n_clusters} hotspots")
    print(f"  - {n_noise} isolated incidents")
    print(f"  - {len(X_crime) - n_noise} incidents in hotspots")
    print()

In [None]:
# Compare DBSCAN with K-means on the same data
# Use optimal DBSCAN parameters
dbscan_optimal = DBSCAN(eps=1.0, min_samples=5)
dbscan_labels = dbscan_optimal.fit_predict(X_crime)

# K-means with 3 clusters (we know there should be 3 hotspots)
kmeans_crime = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_labels = kmeans_crime.fit_predict(X_crime)

# Visualization comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

# DBSCAN results
unique_dbscan = set(dbscan_labels)
colors_db = plt.cm.Spectral(np.linspace(0, 1, len(unique_dbscan)))

for k, col in zip(unique_dbscan, colors_db):
    if k == -1:
        col = 'black'
        marker = 'x'
        alpha = 0.5
        label = 'Noise'
    else:
        marker = 'o'
        alpha = 0.7
        label = f'Hotspot {k}'
    
    class_mask = (dbscan_labels == k)
    ax1.scatter(X_crime[class_mask, 0], X_crime[class_mask, 1], 
               c=[col], marker=marker, s=60, alpha=alpha, label=label)

ax1.set_title(' DBSCAN: Finds Irregular Hotspots + Noise', fontsize=14, fontweight='bold')
ax1.set_xlabel('Longitude')
ax1.set_ylabel('Latitude')
ax1.legend()
ax1.grid(True, alpha=0.3)

# K-means results
colors_km = ['red', 'blue', 'green']
for i in range(3):
    cluster_mask = (kmeans_labels == i)
    ax2.scatter(X_crime[cluster_mask, 0], X_crime[cluster_mask, 1], 
               c=colors_km[i], s=60, alpha=0.7, label=f'Cluster {i}')

# Plot K-means centroids
centroids = kmeans_crime.cluster_centers_
ax2.scatter(centroids[:, 0], centroids[:, 1], marker='X', s=200, 
           c='black', linewidths=3, label='Centroids')

ax2.set_title(' K-Means: Assumes Spherical Clusters', fontsize=14, fontweight='bold')
ax2.set_xlabel('Longitude')
ax2.set_ylabel('Latitude')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary comparison
n_dbscan_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_dbscan_noise = list(dbscan_labels).count(-1)

print("\n Algorithm Comparison Summary:")
print("="*50)
print(f"DBSCAN Results:")
print(f"  - Hotspots found: {n_dbscan_clusters}")
print(f"  - Noise points: {n_dbscan_noise}")
print(f"  - Can handle irregular shapes: ✅")
print(f"  - Automatically finds clusters: ✅")
print()
print(f"K-Means Results:")
print(f"  - Clusters found: 3 (predefined)")
print(f"  - Noise points: 0 (assigns everything)")
print(f"  - Can handle irregular shapes: ❌")
print(f"  - Need to specify cluster count: ❌")
print()
print(" DBSCAN is better for this crime hotspot analysis!")

## 4. Principal Component Analysis (PCA)

### 🧠 Intuitive Explanation

Imagine you're a photographer trying to capture the essence of a 3D sculpture in a 2D photo. **PCA** is like finding the best angle to take that photo:

- You want to capture **the most important features** while losing as little information as possible
- You rotate your camera to find the angle that shows **maximum variation** in the sculpture
- Some details will be lost, but you keep the **essential characteristics**

**Real-world analogy**: Imagine describing people using 100 characteristics (height, weight, age, income, etc.). PCA finds that maybe just 10 "super-characteristics" (like "size", "wealth", "lifestyle") can capture most of what makes people different from each other.

### ⚙️ How It Works (Mechanism)

PCA finds new axes (Principal Components) that capture maximum variance in your data:

**Key Concepts:**
- **Variance**: How spread out data points are (high variance = more information)
- **Principal Components**: New axes ordered by importance (PC1 captures most variance)
- **Eigenvalues**: How much variance each component captures
- **Eigenvectors**: The direction of each principal component

**Mathematical Process:**
1. **Standardize** data (mean=0, std=1) to give equal weight to all features
2. **Calculate covariance matrix** showing how features relate to each other
3. **Find eigenvalues and eigenvectors** of the covariance matrix
4. **Sort** by eigenvalues (largest first = most important components)
5. **Transform** original data to new coordinate system
6. **Keep top k components** that explain desired variance (e.g., 95%)

### 📝 Pseudo Structure / Workflow

```
ALGORITHM: Principal Component Analysis (PCA)
INPUT: Dataset X (n_samples × n_features)
OUTPUT: Transformed data, principal components

1. STANDARDIZE data: X_std = (X - mean) / std
2. COMPUTE covariance matrix: C = (X_std^T × X_std) / (n-1)
3. FIND eigenvalues λ and eigenvectors v of C
4. SORT eigenvalues in descending order
5. SELECT top k eigenvectors (principal components)
6. CREATE projection matrix W from selected eigenvectors
7. TRANSFORM data: X_pca = X_std × W
8. RETURN transformed data and components
```

### ✅ Use Cases

• **Dimensionality Reduction**: Reduce features before machine learning  
• **Data Visualization**: Plot high-dimensional data in 2D/3D  
• **Image Compression**: Reduce image file sizes  
• **Face Recognition**: Extract key facial features  
• **Gene Analysis**: Find patterns in genetic data  
• **Stock Market**: Identify market factors  
• **Noise Reduction**: Remove less important variations  

### 💡 Why & When To Use

**Strengths:**
- Reduces computational complexity
- Removes correlated features
- Helps visualize high-dimensional data
- Can reduce overfitting in ML models
- Mathematically guaranteed to preserve maximum variance
- Fast and deterministic

**Limitations:**
- Components are hard to interpret (linear combinations)
- Assumes linear relationships
- Sensitive to feature scaling
- May lose important information
- Not suitable for sparse data
- Requires choosing number of components

### 💻 Code Example

> **Problem**: We have a dataset with many features about house characteristics (size, rooms, age, location factors, etc.). Let's use PCA to reduce dimensions while preserving the most important information for visualization and analysis.

In [None]:
# Create a synthetic high-dimensional dataset (house characteristics)
np.random.seed(42)
n_samples = 300

# Generate correlated features (realistic for house data)
# Primary factors that determine house characteristics
size_factor = np.random.normal(0, 1, n_samples)  # Overall house size
quality_factor = np.random.normal(0, 1, n_samples)  # Build quality
location_factor = np.random.normal(0, 1, n_samples)  # Location desirability

# Create 12 house features based on underlying factors (with noise)
house_data = {
    'sqft': 2000 + 500 * size_factor + np.random.normal(0, 100, n_samples),
    'bedrooms': 3 + 0.8 * size_factor + np.random.normal(0, 0.5, n_samples),
    'bathrooms': 2 + 0.6 * size_factor + 0.3 * quality_factor + np.random.normal(0, 0.3, n_samples),
    'garage_spaces': 2 + 0.4 * size_factor + np.random.normal(0, 0.4, n_samples),
    'lot_size': 8000 + 2000 * size_factor + np.random.normal(0, 500, n_samples),
    'age': 20 - 5 * quality_factor + np.random.normal(0, 5, n_samples),
    'renovation_score': 3 + quality_factor + np.random.normal(0, 0.5, n_samples),
    'appliance_quality': 3 + quality_factor + np.random.normal(0, 0.4, n_samples),
    'school_rating': 7 + location_factor + np.random.normal(0, 0.5, n_samples),
    'crime_safety': 6 + location_factor + np.random.normal(0, 0.4, n_samples),
    'walkability': 5 + 0.7 * location_factor + np.random.normal(0, 0.6, n_samples),
    'commute_time': 30 - 5 * location_factor + np.random.normal(0, 5, n_samples)
}

# Convert to DataFrame
df_houses = pd.DataFrame(house_data)

# Add house categories for coloring (based on overall desirability)
overall_score = size_factor + quality_factor + location_factor
df_houses['house_type'] = pd.cut(overall_score, bins=3, labels=['Basic', 'Standard', 'Premium'])

print("House Dataset Information:")
print(f"Number of houses: {len(df_houses)}")
print(f"Number of features: {len(df_houses.columns) - 1}")
print("\n Feature Summary:")
print(df_houses.describe().round(2))

print("\n House Type Distribution:")
print(df_houses['house_type'].value_counts())

In [None]:
# Prepare data for PCA
X_houses = df_houses.drop('house_type', axis=1)
house_types = df_houses['house_type']

# Standardize features (crucial for PCA!)
scaler = StandardScaler()
X_houses_scaled = scaler.fit_transform(X_houses)

print(" Data Standardization:")
print(f"Original data shape: {X_houses.shape}")
print(f"Standardized data shape: {X_houses_scaled.shape}")
print(f"Mean after scaling: {X_houses_scaled.mean(axis=0).round(3)}")
print(f"Std after scaling: {X_houses_scaled.std(axis=0).round(3)}")

In [None]:
# Apply PCA and analyze components
pca_full = PCA()
X_pca_full = pca_full.fit_transform(X_houses_scaled)

# Calculate explained variance
explained_variance_ratio = pca_full.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)

# Visualization of explained variance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Individual component variance
ax1.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, 
        alpha=0.7, color='steelblue')
ax1.set_title(' Variance Explained by Each Principal Component', 
              fontsize=14, fontweight='bold')
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Explained Variance Ratio')
ax1.grid(True, alpha=0.3)

# Add percentage labels on bars
for i, v in enumerate(explained_variance_ratio):
    ax1.text(i + 1, v + 0.01, f'{v*100:.1f}%', ha='center', va='bottom')

# Cumulative variance
ax2.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, 
         'bo-', linewidth=2, markersize=8)
ax2.axhline(y=0.95, color='red', linestyle='--', label='95% Variance')
ax2.axhline(y=0.90, color='orange', linestyle='--', label='90% Variance')
ax2.set_title(' Cumulative Explained Variance', fontsize=14, fontweight='bold')
ax2.set_xlabel('Number of Components')
ax2.set_ylabel('Cumulative Explained Variance')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Add percentage labels
for i, v in enumerate(cumulative_variance):
    if i % 2 == 0:  # Show every other label to avoid crowding
        ax2.text(i + 1, v + 0.02, f'{v*100:.1f}%', ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Find number of components for different variance thresholds
n_90 = np.argmax(cumulative_variance >= 0.90) + 1
n_95 = np.argmax(cumulative_variance >= 0.95) + 1

print("\n PCA Analysis Results:")
print(f"Original dimensions: {X_houses.shape[1]}")
print(f"Components for 90% variance: {n_90} (reduction: {((X_houses.shape[1] - n_90) / X_houses.shape[1] * 100):.1f}%)")
print(f"Components for 95% variance: {n_95} (reduction: {((X_houses.shape[1] - n_95) / X_houses.shape[1] * 100):.1f}%)")
print(f"\nTop 3 components explain: {cumulative_variance[2]*100:.1f}% of variance")

In [None]:
# Analyze what each principal component represents
# Get component loadings (how much each original feature contributes)
component_matrix = pca_full.components_
feature_names = X_houses.columns

# Create a heatmap of component loadings
plt.figure(figsize=(12, 8))
sns.heatmap(component_matrix[:4], 
            xticklabels=feature_names,
            yticklabels=[f'PC{i+1}' for i in range(4)],
            cmap='RdBu_r', center=0, annot=True, fmt='.2f',
            cbar_kws={'label': 'Component Loading'})
plt.title(' Principal Component Loadings (Top 4 Components)', 
          fontsize=14, fontweight='bold')
plt.xlabel('Original Features')
plt.ylabel('Principal Components')
plt.tight_layout()
plt.show()

# Interpret the components
print("\n Component Interpretation:")
print("="*50)
for i in range(min(4, len(component_matrix))):
    print(f"\n**Principal Component {i+1}** (explains {explained_variance_ratio[i]*100:.1f}% variance):")
    
    # Find features with highest absolute loadings
    loadings = component_matrix[i]
    feature_importance = list(zip(feature_names, loadings))
    feature_importance.sort(key=lambda x: abs(x[1]), reverse=True)
    
    print("  Most influential features:")
    for feat, loading in feature_importance[:5]:
        direction = "positively" if loading > 0 else "negatively"
        print(f"    - {feat}: {loading:.3f} ({direction} correlated)")
    
    # Provide interpretation
    top_positive = [f for f, l in feature_importance if l > 0.3]
    top_negative = [f for f, l in feature_importance if l < -0.3]
    
    if i == 0:
        print("   Interpretation: Likely represents 'Overall House Size/Quality'")
    elif i == 1:
        print("   Interpretation: Likely represents 'Location Desirability'")
    elif i == 2:
        print("   Interpretation: Likely represents 'House Age vs. Quality'")
    else:
        print("   Interpretation: Secondary factors or noise")

In [None]:
# Visualize data in reduced dimensions
# Apply PCA with different numbers of components
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_houses_scaled)

pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(X_houses_scaled)

# Create visualizations
fig = plt.figure(figsize=(18, 6))

# 2D PCA visualization
ax1 = plt.subplot(131)
colors = {'Basic': 'red', 'Standard': 'blue', 'Premium': 'green'}
for house_type in colors.keys():
    mask = house_types == house_type
    ax1.scatter(X_pca_2d[mask, 0], X_pca_2d[mask, 1], 
               c=colors[house_type], label=house_type, alpha=0.7, s=50)

ax1.set_title(f' 2D PCA Visualization\n({pca_2d.explained_variance_ratio_.sum()*100:.1f}% variance explained)', 
              fontsize=12, fontweight='bold')
ax1.set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}%)')
ax1.set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}%)')
ax1.legend()
ax1.grid(True, alpha=0.3)

# 3D PCA visualization
ax2 = plt.subplot(132, projection='3d')
for house_type in colors.keys():
    mask = house_types == house_type
    ax2.scatter(X_pca_3d[mask, 0], X_pca_3d[mask, 1], X_pca_3d[mask, 2],
               c=colors[house_type], label=house_type, alpha=0.7, s=30)

ax2.set_title(f' 3D PCA Visualization\n({pca_3d.explained_variance_ratio_.sum()*100:.1f}% variance explained)', 
              fontsize=12, fontweight='bold')
ax2.set_xlabel(f'PC1 ({pca_3d.explained_variance_ratio_[0]*100:.1f}%)')
ax2.set_ylabel(f'PC2 ({pca_3d.explained_variance_ratio_[1]*100:.1f}%)')
ax2.set_zlabel(f'PC3 ({pca_3d.explained_variance_ratio_[2]*100:.1f}%)')
ax2.legend()

# Comparison with original 2D projection (first 2 features)
ax3 = plt.subplot(133)
for house_type in colors.keys():
    mask = house_types == house_type
    ax3.scatter(X_houses.iloc[mask, 0], X_houses.iloc[mask, 1], 
               c=colors[house_type], label=house_type, alpha=0.7, s=50)

ax3.set_title(' Original Features\n(sqft vs bedrooms)', fontsize=12, fontweight='bold')
ax3.set_xlabel('Square Feet')
ax3.set_ylabel('Bedrooms')
ax3.legend()
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✨ PCA Benefits Demonstrated:")
print(f" Dimensionality: Reduced from {X_houses.shape[1]} to 2-3 dimensions")
print(f" Information Preserved: {pca_2d.explained_variance_ratio_.sum()*100:.1f}% with 2D, {pca_3d.explained_variance_ratio_.sum()*100:.1f}% with 3D")
print(f" Visualization: Can now easily visualize high-dimensional house data")
print(f" Patterns: Clear separation between house types in reduced space")
print(f" Efficiency: Reduced data size by {(1 - 2/X_houses.shape[1])*100:.1f}% (2D) for ML models")

## 5. t-SNE (t-Distributed Stochastic Neighbor Embedding)

### 🧠 Intuitive Explanation

Imagine you have a **giant sculpture** that you need to **flatten into a painting** while preserving how all the parts relate to each other:

**t-SNE** is like having a magical artist who:
- Looks at which parts of the sculpture are **close neighbors**
- Tries to keep those **same neighborships** in the flat painting  
- **Stretches apart** different groups so you can see them clearly
- **Compresses** similar groups together

Unlike PCA (which finds straight-line projections), t-SNE can:
- Handle **curved and complex relationships**
- Create **beautiful cluster separations**
- Reveal **hidden patterns** that linear methods miss

**Key insight**: t-SNE focuses on preserving **local neighborhoods** rather than global distances!

### ⚙️ How It Works (Mechanism)

t-SNE uses probability distributions to maintain neighborhood relationships:

**Two-Step Process:**

1. **High-Dimensional Space**: 
   - Calculate similarity between all pairs using Gaussian distribution
   - Convert similarities to probabilities (closer points = higher probability)
   - **Perplexity** parameter controls neighborhood size

2. **Low-Dimensional Space**:
   - Use t-distribution (heavy tails) for embedding space
   - Start with random low-D positions
   - Gradually adjust positions to match high-D probabilities
   - Use gradient descent to minimize **Kullback-Leibler divergence**

**Key Parameters:**
- **Perplexity**: Balance between local vs global structure (5-50 typical)
- **Learning Rate**: How fast to adjust positions (100-1000 typical)
- **Iterations**: Number of optimization steps (1000+ recommended)

### 📝 Pseudo Structure / Workflow

```
ALGORITHM: t-SNE
INPUT: High-dimensional data X, perplexity, learning_rate
OUTPUT: Low-dimensional embedding Y

1. COMPUTE pairwise distances in high-D space
2. CONVERT distances to probabilities using Gaussian:
   - p_ij = similarity between points i and j
   - Use perplexity to determine σ (bandwidth)
3. INITIALIZE random positions Y in low-D space
4. FOR each iteration:
   a. COMPUTE probabilities in low-D using t-distribution
   b. CALCULATE gradient of KL divergence
   c. UPDATE positions Y using gradient descent
   d. APPLY momentum for stability
5. RETURN final low-dimensional embedding Y
```

### ✅ Use Cases

• **Data Visualization**: Explore high-dimensional datasets  
• **Single Cell Analysis**: Visualize gene expression patterns  
• **Image Recognition**: Visualize learned features  
• **Natural Language Processing**: Visualize word embeddings  
• **Customer Segmentation**: Reveal hidden customer groups  
• **Genomics**: Visualize genetic variations  
• **Social Network Analysis**: Visualize community structures  

### 💡 Why & When To Use

**Strengths:**
- Excellent for visualization (2D/3D)
- Reveals complex non-linear patterns
- Great cluster separation
- Handles curved manifolds well
- Creates beautiful, interpretable plots
- Works well with various data types

**Limitations:**
- Computationally expensive O(n²)
- Non-deterministic (different runs = different results)
- Cannot embed new points easily
- Global distances not preserved
- Sensitive to hyperparameters
- Only for visualization (not for ML pipelines)
- Can create false clusters

### 💻 Code Example

> **Problem**: We have high-dimensional data from different smartphone user behavior patterns (apps used, usage time, preferences, etc.). Let's use t-SNE to visualize user segments that might not be obvious in the original high-dimensional space.

In [None]:
# Generate synthetic smartphone user behavior data
np.random.seed(42)
n_users = 400

# Define user types with different behavior patterns
user_types = {
    'Business': {
        'email_hours': (6, 1.5),      # High email usage
        'social_hours': (1, 0.5),     # Low social media
        'gaming_hours': (0.5, 0.3),   # Very low gaming
        'productivity_apps': (8, 2),   # Many productivity apps
        'entertainment_apps': (3, 1),  # Few entertainment apps
        'news_hours': (2, 0.5),       # Moderate news consumption
        'calls_per_day': (15, 5),     # Many calls
        'battery_usage': (85, 10),    # High battery usage
    },
    'Social': {
        'email_hours': (1, 0.5),      # Low email
        'social_hours': (6, 2),       # High social media
        'gaming_hours': (2, 1),       # Moderate gaming
        'productivity_apps': (2, 1),   # Few productivity apps
        'entertainment_apps': (12, 3), # Many entertainment apps
        'news_hours': (0.5, 0.3),     # Low news
        'calls_per_day': (8, 3),      # Moderate calls
        'battery_usage': (75, 15),    # Moderate battery
    },
    'Gamer': {
        'email_hours': (0.5, 0.3),    # Very low email
        'social_hours': (3, 1),       # Moderate social
        'gaming_hours': (8, 3),       # Very high gaming
        'productivity_apps': (1, 0.5), # Very few productivity
        'entertainment_apps': (15, 4), # Many entertainment
        'news_hours': (0.3, 0.2),     # Very low news
        'calls_per_day': (3, 2),      # Few calls
        'battery_usage': (95, 8),     # Very high battery
    },
    'Senior': {
        'email_hours': (2, 1),        # Moderate email
        'social_hours': (1, 0.5),     # Low social media
        'gaming_hours': (1, 0.5),     # Low gaming
        'productivity_apps': (3, 1),   # Few apps overall
        'entertainment_apps': (2, 1),  # Few entertainment
        'news_hours': (3, 1),         # High news consumption
        'calls_per_day': (12, 4),     # Many calls
        'battery_usage': (40, 10),    # Low battery usage
    }
}

# Generate data for each user type
all_data = []
all_labels = []
users_per_type = n_users // len(user_types)

for user_type, characteristics in user_types.items():
    for _ in range(users_per_type):
        user_data = []
        for feature, (mean, std) in characteristics.items():
            value = np.random.normal(mean, std)
            user_data.append(max(0, value))  # Ensure non-negative values
        
        all_data.append(user_data)
        all_labels.append(user_type)

# Convert to numpy arrays
X_users = np.array(all_data)
y_users = np.array(all_labels)
feature_names = list(user_types['Business'].keys())

# Create DataFrame for better understanding
df_users = pd.DataFrame(X_users, columns=feature_names)
df_users['user_type'] = y_users

print(" Smartphone User Behavior Dataset:")
print(f"Total users: {len(df_users)}")
print(f"Features: {len(feature_names)}")
print(f"User types: {list(user_types.keys())}")
print("\n User Type Distribution:")
print(df_users['user_type'].value_counts())

print("\n Feature Summary by User Type:")
summary = df_users.groupby('user_type')[feature_names].mean().round(2)
print(summary)

In [None]:
# Standardize the data for t-SNE
scaler = StandardScaler()
X_users_scaled = scaler.fit_transform(X_users)

# Compare PCA vs t-SNE visualization
# First, apply PCA for comparison
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_users_scaled)

# Apply t-SNE with different perplexity values
perplexity_values = [5, 30, 50]
tsne_results = {}

print(" Running t-SNE with different perplexity values...")
print("(This may take a moment - t-SNE is computationally intensive)")

for perp in perplexity_values:
    print(f"  Computing t-SNE with perplexity={perp}...")
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42, 
                learning_rate=200, n_iter=1000, verbose=0)
    tsne_results[perp] = tsne.fit_transform(X_users_scaled)

print(" t-SNE computations completed!")

In [None]:
# Visualize PCA vs t-SNE results
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

# Color mapping for user types
color_map = {'Business': 'red', 'Social': 'blue', 'Gamer': 'green', 'Senior': 'orange'}
colors = [color_map[label] for label in y_users]

# Plot PCA results
scatter = axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=colors, alpha=0.7, s=40)
axes[0].set_title(f'PCA Visualization\n({pca.explained_variance_ratio_.sum()*100:.1f}% variance explained)', 
                  fontsize=12, fontweight='bold')
axes[0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
axes[0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
axes[0].grid(True, alpha=0.3)

# Plot t-SNE results with different perplexity values
for idx, perp in enumerate(perplexity_values):
    ax_idx = idx + 1
    X_tsne = tsne_results[perp]
    
    axes[ax_idx].scatter(X_tsne[:, 0], X_tsne[:, 1], c=colors, alpha=0.7, s=40)
    axes[ax_idx].set_title(f't-SNE (perplexity={perp})', fontsize=12, fontweight='bold')
    axes[ax_idx].set_xlabel('t-SNE 1')
    axes[ax_idx].set_ylabel('t-SNE 2')
    axes[ax_idx].grid(True, alpha=0.3)

# Add legend
legend_elements = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, 
                             markersize=10, label=user_type) 
                  for user_type, color in color_map.items()]
fig.legend(handles=legend_elements, loc='center', bbox_to_anchor=(0.5, 0.02), ncol=4)

plt.tight_layout()
plt.subplots_adjust(bottom=0.1)
plt.show()

print("\n Visualization Comparison:")
print("="*60)
print(" PCA Results:")
print("  - Linear projection preserving maximum variance")
print("  - User types somewhat overlapped")
print("  - Global structure preserved but local patterns unclear")
print()
print(" t-SNE Results:")
print("  - Non-linear embedding preserving local neighborhoods")
print("  - Much clearer separation between user types")
print("  - Different perplexity values show different granularity:")
print("    • Low perplexity (5): Very tight, local clusters")
print("    • Medium perplexity (30): Balanced local-global structure")
print("    • High perplexity (50): More global structure preserved")

In [None]:
# Detailed analysis of the best t-SNE result (perplexity=30)
best_tsne = tsne_results[30]

# Calculate cluster quality metrics
from sklearn.metrics import silhouette_score, calinski_harabasz_score

# Convert labels to numeric for scoring
label_encoder = {label: idx for idx, label in enumerate(color_map.keys())}
y_numeric = [label_encoder[label] for label in y_users]

# Calculate scores for original data vs t-SNE embedding
original_silhouette = silhouette_score(X_users_scaled, y_numeric)
tsne_silhouette = silhouette_score(best_tsne, y_numeric)

original_ch = calinski_harabasz_score(X_users_scaled, y_numeric)
tsne_ch = calinski_harabasz_score(best_tsne, y_numeric)

# Create detailed visualization of the best t-SNE result
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

# Scatter plot with better styling
for user_type, color in color_map.items():
    mask = y_users == user_type
    ax1.scatter(best_tsne[mask, 0], best_tsne[mask, 1], 
               c=color, label=user_type, alpha=0.7, s=60, edgecolors='black', linewidth=0.5)

ax1.set_title(' t-SNE User Behavior Segmentation\n(Perplexity=30)', 
              fontsize=14, fontweight='bold')
ax1.set_xlabel('t-SNE Component 1')
ax1.set_ylabel('t-SNE Component 2')
ax1.legend(title='User Types', loc='best')
ax1.grid(True, alpha=0.3)

# Feature importance heatmap (original vs t-SNE space correlation)
# Calculate correlation between original features and t-SNE components
feature_correlations = []
for i, feature in enumerate(feature_names):
    corr_comp1 = np.corrcoef(X_users_scaled[:, i], best_tsne[:, 0])[0, 1]
    corr_comp2 = np.corrcoef(X_users_scaled[:, i], best_tsne[:, 1])[0, 1]
    feature_correlations.append([corr_comp1, corr_comp2])

corr_matrix = np.array(feature_correlations)
im = ax2.imshow(corr_matrix, cmap='RdBu_r', aspect='auto', vmin=-1, vmax=1)
ax2.set_title(' Feature Contributions to t-SNE Components', fontsize=14, fontweight='bold')
ax2.set_xlabel('t-SNE Components')
ax2.set_ylabel('Original Features')
ax2.set_xticks([0, 1])
ax2.set_xticklabels(['t-SNE 1', 't-SNE 2'])
ax2.set_yticks(range(len(feature_names)))
ax2.set_yticklabels([name.replace('_', ' ').title() for name in feature_names])

# Add correlation values as text
for i in range(len(feature_names)):
    for j in range(2):
        text = ax2.text(j, i, f'{corr_matrix[i, j]:.2f}', 
                       ha='center', va='center', fontsize=10, fontweight='bold')

plt.colorbar(im, ax=ax2, label='Correlation')
plt.tight_layout()
plt.show()

# Print analysis summary
print("\n Quantitative Analysis:")
print("="*50)
print(f"Silhouette Score:")
print(f"  Original space: {original_silhouette:.3f}")
print(f"  t-SNE space: {tsne_silhouette:.3f}")
print(f"  Improvement: {((tsne_silhouette - original_silhouette) / original_silhouette * 100):+.1f}%")
print()
print(f"Calinski-Harabasz Score:")
print(f"  Original space: {original_ch:.1f}")
print(f"  t-SNE space: {tsne_ch:.1f}")
print(f"  Improvement: {((tsne_ch - original_ch) / original_ch * 100):+.1f}%")
print()
print(" Key Insights:")
print("• t-SNE clearly separates the 4 user behavior types")
print("• Gaming hours and battery usage strongly influence t-SNE Component 1")
print("• Social media and productivity apps influence t-SNE Component 2")
print("• Each user type forms distinct clusters in the embedding space")
print("• Much better separation than linear PCA for this non-linear data")

## 6. Autoencoders (Deep Learning Approach)

### 🧠 Intuitive Explanation

Imagine you have a **magic photocopier** that works in a special way:

1. **Encoder**: Takes your original document and compresses it into a **tiny summary** (like creating a Twitter summary of a book)
2. **Decoder**: Takes that tiny summary and tries to **recreate the original document**
3. **Training**: You keep adjusting the photocopier until it gets really good at recreating documents from summaries

**Autoencoder** works similarly:
- **Input**: Original high-dimensional data (like 784 pixels in an image)
- **Encoder**: Compresses to lower dimensions (like 64 numbers)
- **Decoder**: Tries to reconstruct the original from the compressed version
- **Goal**: Learn the most important features for reconstruction

The magic happens in the **bottleneck** (compressed representation) - it captures the essence of your data!

### ⚙️ How It Works (Mechanism)

Autoencoders use neural networks to learn efficient data representations:

**Architecture:**
- **Input Layer**: Original data dimensions (e.g., 100 features)
- **Encoder**: Gradually reduces dimensions (100 → 50 → 25 → 10)
- **Bottleneck/Latent Space**: Smallest layer (e.g., 10 dimensions)
- **Decoder**: Gradually increases dimensions (10 → 25 → 50 → 100)
- **Output Layer**: Reconstructed data (same size as input)

**Training Process:**
1. **Forward Pass**: Input → Encoder → Bottleneck → Decoder → Reconstruction
2. **Loss Calculation**: Compare reconstruction with original (MSE, BCE, etc.)
3. **Backpropagation**: Adjust weights to minimize reconstruction error
4. **Repeat**: Until the autoencoder learns good representations

**Types of Autoencoders:**
- **Vanilla**: Basic compression/reconstruction
- **Sparse**: Forces most neurons to be inactive (sparse representations)
- **Denoising**: Learns to remove noise from data
- **Variational (VAE)**: Learns probability distributions

### 📝 Pseudo Structure / Workflow

```
ALGORITHM: Autoencoder Training
INPUT: Training data X, architecture config
OUTPUT: Trained encoder and decoder networks

1. DEFINE architecture:
   - Encoder: [input_dim → hidden1 → hidden2 → latent_dim]
   - Decoder: [latent_dim → hidden2 → hidden1 → input_dim]
2. INITIALIZE network weights randomly
3. FOR each training epoch:
   a. FOR each batch of data:
      - Forward pass: x → encoder → z → decoder → x_reconstructed
      - Calculate loss: L = ||x - x_reconstructed||²
      - Backpropagate gradients
      - Update weights using optimizer (Adam, SGD, etc.)
4. RETURN trained encoder (for dimensionality reduction)
```

### ✅ Use Cases

• **Dimensionality Reduction**: Alternative to PCA for non-linear data  
• **Anomaly Detection**: Identify data that doesn't reconstruct well  
• **Image Denoising**: Remove noise from images  
• **Data Compression**: Compress data while preserving important features  
• **Feature Learning**: Learn representations for downstream tasks  
• **Generative Modeling**: Generate new data similar to training data  
• **Recommender Systems**: Learn user-item representations  

### 💡 Why & When To Use

**Strengths:**
- Handles non-linear relationships (unlike PCA)
- Flexible architecture for different data types
- Can learn complex patterns and features
- Useful for both supervised and unsupervised tasks
- Can be combined with other deep learning models
- Scalable to large datasets

**Limitations:**
- Requires large amounts of data
- Computationally expensive to train
- Many hyperparameters to tune
- Can overfit on small datasets
- "Black box" - difficult to interpret
- Requires deep learning expertise
- May not outperform simpler methods on simple data

### 💻 Code Example

> **Problem**: We have high-dimensional sensor data from manufacturing equipment. Let's build an autoencoder to learn compressed representations and detect anomalies (equipment malfunctions) by identifying data points that don't reconstruct well.

In [None]:
# Note: This example uses basic neural network concepts
# In practice, you'd use TensorFlow/PyTorch for more complex autoencoders

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

# Generate synthetic manufacturing sensor data
np.random.seed(42)
n_samples = 1000
n_features = 20

# Create normal operating conditions (most of the data)
# Simulate sensors that are correlated (realistic for manufacturing)
normal_data = []

for i in range(int(0.9 * n_samples)):
    # Base operating conditions
    temperature_base = np.random.normal(75, 5)  # °C
    pressure_base = np.random.normal(100, 10)   # PSI
    vibration_base = np.random.normal(2, 0.5)   # mm/s
    
    # Create 20 sensor readings with realistic correlations
    sensor_reading = [
        temperature_base + np.random.normal(0, 2),                    # Temp sensor 1
        temperature_base + np.random.normal(0, 2),                    # Temp sensor 2  
        temperature_base * 1.1 + np.random.normal(0, 3),            # Related temp sensor
        pressure_base + np.random.normal(0, 5),                      # Pressure sensor 1
        pressure_base + np.random.normal(0, 5),                      # Pressure sensor 2
        pressure_base * 0.95 + np.random.normal(0, 4),              # Related pressure
        vibration_base + np.random.normal(0, 0.3),                   # Vibration X
        vibration_base + np.random.normal(0, 0.3),                   # Vibration Y
        vibration_base * 1.2 + np.random.normal(0, 0.4),            # Vibration Z
        50 + 0.3 * temperature_base + np.random.normal(0, 5),       # RPM (temp dependent)
        30 + 0.2 * pressure_base + np.random.normal(0, 3),          # Flow rate
        200 + temperature_base * 2 + np.random.normal(0, 20),       # Power consumption
        np.random.normal(0.95, 0.05),                                # Efficiency
        np.random.normal(7.2, 0.3),                                  # pH level
        np.random.normal(45, 5),                                     # Humidity %
        temperature_base - 10 + np.random.normal(0, 2),             # Cooling temp
        vibration_base * 50 + np.random.normal(0, 10),              # Acoustic level
        pressure_base / 10 + np.random.normal(0, 1),                # Secondary pressure
        np.random.normal(12, 1),                                     # Voltage
        np.random.normal(5, 0.5),                                    # Current
    ]
    normal_data.append(sensor_reading)

# Create anomalous data (equipment malfunctions)
anomaly_data = []
anomaly_types = []

for i in range(int(0.1 * n_samples)):
    # Choose anomaly type
    anomaly_type = np.random.choice(['overheating', 'pressure_drop', 'high_vibration', 'power_spike'])
    anomaly_types.append(anomaly_type)
    
    # Start with normal base values
    temp_base = np.random.normal(75, 5)
    pressure_base = np.random.normal(100, 10)
    vibration_base = np.random.normal(2, 0.5)
    
    # Introduce specific anomalies
    if anomaly_type == 'overheating':
        temp_base += 25  # Significant temperature increase
    elif anomaly_type == 'pressure_drop':
        pressure_base -= 30  # Pressure drop
    elif anomaly_type == 'high_vibration':
        vibration_base += 3  # High vibration
    elif anomaly_type == 'power_spike':
        temp_base += 10  # Temperature increase from power spike
    
    # Create sensor readings with anomaly
    sensor_reading = [
        temp_base + np.random.normal(0, 2),
        temp_base + np.random.normal(0, 2),
        temp_base * 1.1 + np.random.normal(0, 3),
        pressure_base + np.random.normal(0, 5),
        pressure_base + np.random.normal(0, 5),
        pressure_base * 0.95 + np.random.normal(0, 4),
        vibration_base + np.random.normal(0, 0.3),
        vibration_base + np.random.normal(0, 0.3),
        vibration_base * 1.2 + np.random.normal(0, 0.4),
        50 + 0.3 * temp_base + np.random.normal(0, 5),
        30 + 0.2 * pressure_base + np.random.normal(0, 3),
        (300 if anomaly_type == 'power_spike' else 200) + temp_base * 2 + np.random.normal(0, 20),
        np.random.normal(0.85 if anomaly_type != 'normal' else 0.95, 0.05),
        np.random.normal(7.2, 0.3),
        np.random.normal(45, 5),
        temp_base - 10 + np.random.normal(0, 2),
        vibration_base * 50 + np.random.normal(0, 10),
        pressure_base / 10 + np.random.normal(0, 1),
        np.random.normal(12, 1),
        np.random.normal(5, 0.5),
    ]
    anomaly_data.append(sensor_reading)

# Combine normal and anomalous data
X_sensors = np.array(normal_data + anomaly_data)
y_labels = np.array(['normal'] * len(normal_data) + ['anomaly'] * len(anomaly_data))

# Create feature names
feature_names = [
    'temp_1', 'temp_2', 'temp_3', 'pressure_1', 'pressure_2', 'pressure_3',
    'vibration_x', 'vibration_y', 'vibration_z', 'rpm', 'flow_rate', 
    'power_consumption', 'efficiency', 'ph_level', 'humidity', 
    'cooling_temp', 'acoustic_level', 'pressure_secondary', 'voltage', 'current'
]

print("🏭 Manufacturing Sensor Dataset:")
print(f"Total samples: {len(X_sensors)}")
print(f"Features (sensors): {len(feature_names)}")
print(f"Normal samples: {np.sum(y_labels == 'normal')}")
print(f"Anomaly samples: {np.sum(y_labels == 'anomaly')}")
print(f"Anomaly rate: {np.mean(y_labels == 'anomaly') * 100:.1f}%")

# Display data statistics
df_sensors = pd.DataFrame(X_sensors, columns=feature_names)
df_sensors['label'] = y_labels

print("\n📊 Sensor Data Summary:")
summary_stats = df_sensors.groupby('label')[feature_names[:5]].agg(['mean', 'std']).round(2)
print(summary_stats)

In [None]:
# Standardize the sensor data
scaler = StandardScaler()
X_sensors_scaled = scaler.fit_transform(X_sensors)

# Split data: train only on normal data (realistic for anomaly detection)
normal_mask = y_labels == 'normal'
X_normal = X_sensors_scaled[normal_mask]
X_test = X_sensors_scaled  # Test on all data (normal + anomalies)

# Split normal data for training/validation
X_train, X_val = train_test_split(X_normal, test_size=0.2, random_state=42)

print("🔄 Building Autoencoder Architecture...")
print(f"Input dimension: {X_sensors.shape[1]}")
print(f"Training samples (normal only): {len(X_train)}")
print(f"Validation samples (normal only): {len(X_val)}")
print(f"Test samples (normal + anomalies): {len(X_test)}")

# Create a simple autoencoder using MLPRegressor
# Architecture: 20 → 10 → 5 → 10 → 20 (bottleneck dimension = 5)
# Note: This is a simplified approach. Production autoencoders use TensorFlow/PyTorch

print("\n Training Autoencoder...")
print("Architecture: Input(20) → Hidden(15) → Bottleneck(8) → Hidden(15) → Output(20)")

# Since MLPRegressor doesn't directly support encoder-decoder architecture,
# we'll train it to map input to input (identity mapping) with a bottleneck
autoencoder = MLPRegressor(
    hidden_layer_sizes=(15, 8, 15),  # Bottleneck architecture
    activation='tanh',
    solver='adam',
    learning_rate='adaptive',
    max_iter=500,
    random_state=42,
    early_stopping=True,
    validation_fraction=0.1
)

# Train autoencoder on normal data only
autoencoder.fit(X_train, X_train)  # Input = Output for autoencoder

print(" Autoencoder training completed!")
print(f"Final training score: {autoencoder.score(X_train, X_train):.4f}")
print(f"Validation score: {autoencoder.score(X_val, X_val):.4f}")

In [None]:
# Use autoencoder for anomaly detection
# Normal data should reconstruct well (low error)
# Anomalous data should reconstruct poorly (high error)

# Get reconstructions for all test data
X_reconstructed = autoencoder.predict(X_test)

# Calculate reconstruction errors (MSE per sample)
reconstruction_errors = np.mean((X_test - X_reconstructed) ** 2, axis=1)

# Create DataFrame for analysis
results_df = pd.DataFrame({
    'reconstruction_error': reconstruction_errors,
    'true_label': y_labels,
    'sample_id': range(len(y_labels))
})

# Visualize reconstruction errors
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Box plot of errors by true label
normal_errors = results_df[results_df['true_label'] == 'normal']['reconstruction_error']
anomaly_errors = results_df[results_df['true_label'] == 'anomaly']['reconstruction_error']

ax1.boxplot([normal_errors, anomaly_errors], labels=['Normal', 'Anomaly'], patch_artist=True,
           boxprops=dict(facecolor='lightblue', alpha=0.7),
           medianprops=dict(color='red', linewidth=2))
ax1.set_title(' Reconstruction Error Distribution', fontsize=14, fontweight='bold')
ax1.set_ylabel('Reconstruction Error (MSE)')
ax1.grid(True, alpha=0.3)

# Scatter plot of reconstruction errors
colors = {'normal': 'blue', 'anomaly': 'red'}
for label in ['normal', 'anomaly']:
    mask = results_df['true_label'] == label
    ax2.scatter(results_df[mask]['sample_id'], results_df[mask]['reconstruction_error'],
               c=colors[label], label=label, alpha=0.7, s=30)

# Add threshold line (we'll calculate this)
threshold = np.percentile(normal_errors, 95)  # 95th percentile of normal errors
ax2.axhline(y=threshold, color='orange', linestyle='--', linewidth=2, 
           label=f'Threshold ({threshold:.4f})')

ax2.set_title(' Anomaly Detection Results', fontsize=14, fontweight='bold')
ax2.set_xlabel('Sample ID')
ax2.set_ylabel('Reconstruction Error (MSE)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate performance metrics
predicted_anomalies = reconstruction_errors > threshold
true_anomalies = y_labels == 'anomaly'

from sklearn.metrics import classification_report, confusion_matrix

print("\n Anomaly Detection Performance:")
print("="*50)
print(f"Threshold (95th percentile of normal): {threshold:.6f}")
print(f"")
print("Reconstruction Error Statistics:")
print(f"Normal data - Mean: {normal_errors.mean():.6f}, Std: {normal_errors.std():.6f}")
print(f"Anomaly data - Mean: {anomaly_errors.mean():.6f}, Std: {anomaly_errors.std():.6f}")
print(f"Separation ratio: {anomaly_errors.mean() / normal_errors.mean():.2f}x")
print()

# Classification report
print("Classification Report:")
print(classification_report(true_anomalies, predicted_anomalies, 
                           target_names=['Normal', 'Anomaly']))

# Confusion matrix
cm = confusion_matrix(true_anomalies, predicted_anomalies)
print("\nConfusion Matrix:")
print(f"                 Predicted")
print(f"True       Normal  Anomaly")
print(f"Normal     {cm[0,0]:6d}  {cm[0,1]:6d}")
print(f"Anomaly    {cm[1,0]:6d}  {cm[1,1]:6d}")

In [None]:
# Analyze what the autoencoder learned - feature reconstruction quality
# Calculate reconstruction error per feature
feature_errors = np.mean((X_test - X_reconstructed) ** 2, axis=0)

# Create feature importance visualization
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10))

# Feature reconstruction errors
bars = ax1.bar(range(len(feature_names)), feature_errors, alpha=0.7, color='steelblue')
ax1.set_title(' Feature Reconstruction Difficulty', fontsize=14, fontweight='bold')
ax1.set_xlabel('Sensor Features')
ax1.set_ylabel('Average Reconstruction Error')
ax1.set_xticks(range(len(feature_names)))
ax1.set_xticklabels(feature_names, rotation=45, ha='right')
ax1.grid(True, alpha=0.3)

# Add value labels on bars
for i, bar in enumerate(bars):
    height = bar.get_height()
    if height > 0.01:  # Only show labels for significant errors
        ax1.text(bar.get_x() + bar.get_width()/2., height + 0.001,
                f'{height:.3f}', ha='center', va='bottom', fontsize=8)

# Compare original vs reconstructed for a few key features
sample_idx = 850  # Pick an anomalous sample
key_features = ['temp_1', 'pressure_1', 'vibration_x', 'power_consumption', 'efficiency']
key_indices = [feature_names.index(feat) for feat in key_features]

original_values = X_test[sample_idx, key_indices]
reconstructed_values = X_reconstructed[sample_idx, key_indices]

x = np.arange(len(key_features))
width = 0.35

ax2.bar(x - width/2, original_values, width, label='Original', alpha=0.8, color='green')
ax2.bar(x + width/2, reconstructed_values, width, label='Reconstructed', alpha=0.8, color='red')

ax2.set_title(f'Reconstruction Quality - Sample {sample_idx} ({y_labels[sample_idx]})', 
              fontsize=14, fontweight='bold')
ax2.set_xlabel('Key Features')
ax2.set_ylabel('Standardized Values')
ax2.set_xticks(x)
ax2.set_xticklabels(key_features)
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary of autoencoder capabilities
print("\n Autoencoder Analysis Summary:")
print("="*60)
print(f" Successfully compressed 20D sensor data to 8D bottleneck")
print(f" Learned to reconstruct normal operating conditions well")
print(f"Anomalies show {anomaly_errors.mean()/normal_errors.mean():.1f}x higher reconstruction error")
print(f"Achieved {((predicted_anomalies == true_anomalies).mean()*100):.1f}% accuracy in anomaly detection")
print()
print(" Key Insights:")
most_difficult = feature_names[np.argmax(feature_errors)]
easiest = feature_names[np.argmin(feature_errors)]
print(f"• Most difficult to reconstruct: {most_difficult} (complex patterns)")
print(f"• Easiest to reconstruct: {easiest} (predictable patterns)")
print(f"• Autoencoder learned compressed representation preserving anomaly detection capability")
print(f"• This approach scales to much larger datasets and more complex patterns")
print()
print(" Real-world Applications:")
print("• Continuous monitoring of manufacturing equipment")
print("• Early detection of equipment failures before breakdown")
print("• Reduced maintenance costs through predictive maintenance")
print("• Quality control in production lines")

## Conclusion

### 🎯 Key Takeaways from Unsupervised Learning

Congratulations! You've now explored the fascinating world of **unsupervised learning algorithms**. Each algorithm we've covered offers unique strengths and is suited for different types of problems and data characteristics.

### 📊 Algorithm Comparison Summary

| Algorithm | Best For | Strengths | Main Limitations |
|-----------|----------|-----------|------------------|
| **K-Means** | Spherical clusters, large datasets | Fast, simple, scalable | Need to choose k, assumes spherical clusters |
| **Hierarchical** | Small datasets, unknown cluster count | No k needed, creates dendrogram | Expensive O(n³), sensitive to outliers |
| **DBSCAN** | Arbitrary shapes, outlier detection | Finds any shape, detects noise | Sensitive to parameters, struggles with varying densities |
| **PCA** | Dimensionality reduction, visualization | Fast, preserves variance | Linear only, components hard to interpret |
| **t-SNE** | Non-linear visualization | Beautiful plots, reveals patterns | Slow, only for visualization, non-deterministic |
| **Autoencoders** | Complex patterns, anomaly detection | Handles non-linearity, flexible | Needs large data, computationally expensive |

### 🛠️ Choosing the Right Algorithm

**For Clustering:**
- Start with **K-Means** for quick exploration
- Use **Hierarchical** when you don't know the number of clusters
- Choose **DBSCAN** for irregular shapes and outlier detection

**For Dimensionality Reduction:**
- Use **PCA** for linear relationships and interpretability
- Apply **t-SNE** for stunning visualizations of complex data
- Consider **Autoencoders** for non-linear patterns with large datasets

### 🚀 Next Steps for Your Learning Journey

1. **Practice with Real Data**: Apply these algorithms to datasets from your domain of interest
2. **Experiment with Parameters**: Understanding hyperparameter tuning is crucial for success
3. **Combine Algorithms**: Use PCA before clustering, or t-SNE after clustering for visualization
4. **Learn Evaluation Metrics**: Study silhouette score, Davies-Bouldin index, and other clustering metrics
5. **Explore Advanced Variants**: HDBSCAN, UMAP, Variational Autoencoders, and more

### 🌟 The Power of Unsupervised Learning

Remember that unsupervised learning is often the **first step** in understanding your data. It helps you:
- Discover hidden patterns and structures
- Prepare data for supervised learning
- Generate insights for business decisions
- Create better features for machine learning models

### 💡 Final Advice

**"The best algorithm is the one you understand well and can explain to others."** 

Start simple, understand the fundamentals, and gradually move to more complex methods. Always validate your results and remember that domain expertise is just as important as algorithmic knowledge.

Good luck on your machine learning journey! 🚀✨