# Module 2 — Unsupervised Learning

**Created:** 2025-12-04 14:06:54 UTC

## Overview

Unsupervised learning finds structure in unlabeled data: clustering, dimensionality reduction, density estimation.

### Why Unsupervised Learning?

Unsupervised learning is particularly useful when labeled data is scarce or expensive to obtain. In many real-world scenarios, collecting labels for data can be time-consuming, costly, or even impossible (e.g., for historical data or new domains). Instead of relying on supervised methods that require labeled examples, unsupervised algorithms discover hidden patterns, group similar data points, or reduce data complexity without explicit guidance.

### Where is it Applied?

- **Clustering Customer Segments**: Grouping customers based on purchasing behavior to personalize marketing strategies.
- **Anomaly Detection**: Identifying unusual patterns in network traffic for cybersecurity or detecting fraud in financial transactions.
- **Image Segmentation**: Partitioning images into meaningful regions for computer vision tasks.
- **Topic Modeling**: Discovering themes in large text corpora for content recommendation.
- **Recommendation Systems**: Finding similar items or users without explicit ratings.

### When to Choose Unsupervised Learning?

- **Data Exploration**: When you want to understand the underlying structure of your data without predefined categories.
- **Pattern Discovery**: For finding hidden relationships or groupings that aren't obvious.
- **Dimensionality Reduction**: To simplify high-dimensional data for visualization or faster processing.
- **Pre-processing**: As a first step before applying supervised learning, to reduce noise or extract features.

## Learning objectives
- Understand clustering vs dimensionality reduction.
- Run KMeans on synthetic data (beginner).
- Try PCA and t-SNE (intermediate).
- Advanced: mixture models, clustering validation, and practical tips.


## Beginner — Concept + Simple Example

### What is Clustering?

Clustering is like organizing a messy room: you group similar items together without knowing exactly what each group should be called. Imagine you have a bunch of different fruits scattered around - apples with apples, oranges with oranges, etc. Clustering algorithms do this automatically by finding natural groupings based on similarities.

Another analogy: Think of clustering as sorting a pile of mixed coins into piles of the same denomination. You don't know the labels ("nickel," "dime"), but you can group them based on size, weight, and appearance.

**Key Concept:** No labels — algorithms try to group or summarize data based on patterns they discover.

### K-Means Clustering: The Basics

K-Means is the most popular clustering algorithm. Think of it as placing 'k' shop centers in a city so that each house is closest to its nearest center. The algorithm:
1. Randomly places k 'centroids' (centers)
2. Assigns each data point to the nearest centroid
3. Moves centroids to the average position of their assigned points
4. Repeats until centroids stop moving

**Simple Example:** KMeans clustering on synthetic data.


In [None]:
# Beginner example
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
print('Cluster centers:\n', kmeans.cluster_centers_)
print('Inertia (lower is better):', kmeans.inertia_)


### Hierarchical Clustering: Building a Tree of Clusters

Hierarchical clustering is like creating a family tree of your data. It starts by treating each data point as its own cluster, then merges the closest clusters step by step until one big cluster remains. Analogy: Like building a tree where leaves are individual items, and branches represent groups merging together.

**Key Concept:** Produces a dendrogram (tree diagram) showing the hierarchy of clusters.

**Example:** Hierarchical clustering on the same synthetic data.


In [None]:
# Import necessary libraries
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt

# Generate the same synthetic data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Perform hierarchical clustering using Ward's method (minimizes variance)
linked = linkage(X, method='ward')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.show()

# Cut the dendrogram to get 4 clusters
clusters = fcluster(linked, 4, criterion='maxclust')
print('Number of clusters found:', len(set(clusters)))


### DBSCAN: Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is like finding islands in the ocean. It groups dense regions of points together, leaving sparse areas as noise. Analogy: Imagine points as people at a party; dense groups are clusters, isolated people are outliers.

**Key Concept:** Doesn't require specifying the number of clusters beforehand. Good for arbitrary-shaped clusters and handling noise.

**Example:** DBSCAN on synthetic data with noise.


In [None]:
# Import necessary libraries
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt

# Generate data with some noise
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

# Add some noise points
X_noise = np.random.uniform(low=-10, high=10, size=(50, 2))
X_with_noise = np.vstack([X, X_noise])

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)  # eps: neighborhood distance, min_samples: core point threshold
clusters = dbscan.fit_predict(X_with_noise)

# Visualize results (noise points are labeled -1)
plt.scatter(X_with_noise[:, 0], X_with_noise[:, 1], c=clusters, cmap='plasma')
plt.title('DBSCAN Clustering (Noise points in gray)')
plt.show()

print('Number of clusters found (excluding noise):', len(set(clusters)) - (1 if -1 in clusters else 0))


## Intermediate — Dimensionality reduction & visualization

**What to learn:** PCA for compression, t-SNE/UMAP for visualization, interpreting components.

**Code idea:** Apply PCA then plot first two principal components.


In [None]:
# Intermediate example
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X2 = pca.fit_transform(X)
plt.scatter(X2[:,0], X2[:,1])
plt.title('PCA projection (2D)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()


## Advanced — Probabilistic models & cluster validation

**Topics:** Gaussian Mixture Models, Silhouette score, Davies–Bouldin, stability, choosing k.

**Advanced code sketch:** Fit a GMM and compute silhouette score.


In [None]:
# Advanced example (sketch)
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score

gmm = GaussianMixture(n_components=4, random_state=42)
gmm.fit(X)
labels = gmm.predict(X)
print('Silhouette score:', silhouette_score(X, labels))

# Tips:
# - Try multiple initializations
# - Scale features before distance-based clustering
