# L1c: Unsupervised Learning and Clustering Approaches
In this lecture, we explore unsupervised learning and the K-means clustering algorithm, a foundational method for discovering patterns and structure in unlabeled data.

> __Learning Objectives:__
> 
> By the end of this lecture, you should be able to:
>
> * __Understand unsupervised learning and clustering:__ Define unsupervised learning and describe how clustering techniques organize data points into groups based on similarity without predefined labels.
> * __Apply Lloyd's K-means algorithm:__ Implement the K-means clustering algorithm, including the assignment and update steps, and understand how it converges to cluster centroids.
> * __Evaluate K-means limitations and select clusters:__ Identify limitations of K-means such as sensitivity to initialization and the challenge of selecting the number of clusters, and apply methods to determine optimal cluster numbers.

Let's get started!
___

## Example
Today, we will use the following examples to illustrate key concepts:
 
> [▶ K-means clustering on a consumer spending dataset](CHEME-5820-L1c-Example-K-Means-Spring-2026.ipynb). In this example, we apply Lloyd's algorithm to customer demographics and spending behavior. We'll observe how K-means partitions customers into distinct segments, visualize the cluster assignments, and examine how centroid placement affects the final groupings.

___

## Background: What is unsupervised learning and clustering?
Unsupervised learning is a branch of machine learning that deals with unlabeled data. Its goal is to discover hidden patterns and structures without predefined target variables. One of the most common tasks in unsupervised learning is clustering.

> __What is clustering?__
> 
> __Clustering__ is an unsupervised machine learning technique that organizes data points into groups, or clusters, based on their similarities without prior knowledge of the group memberships. This method is widely used for exploratory data analysis, enabling the discovery of patterns and relationships within complex datasets.
> 
> * __Hierarchical clustering__ is an unsupervised machine learning technique that organizes data points into a tree-like structure of nested clusters. This allows for the identification of relationships and patterns within the dataset. This method can be implemented using two main approaches: agglomerative, which merges individual points into larger clusters, and divisive, which splits a single cluster into smaller ones.
> * __Density-based spatial clustering of applications with noise (DBSCAN)__ is a density-based clustering algorithm that groups closely packed data points while effectively identifying outliers, making it particularly useful for datasets with noise and clusters of arbitrary shapes. By defining clusters as dense regions separated by areas of lower density, DBSCAN can efficiently discover meaningful patterns in complex data distributions
> * __Gaussian mixture models (GMMs)__ are probabilistic models that represent a dataset as a combination of multiple Gaussian distributions, each characterized by its mean and covariance. This allows for the identification of underlying subpopulations within the data. This approach is useful in clustering and density estimation, providing a flexible framework for modeling complex, multimodal distributions.


Today, we'll consider [the K-means algorithm](https://en.wikipedia.org/wiki/K-means_clustering), arguably the most straightforward clustering algorithm. While relatively straightforward, we'll see that [K-means](https://en.wikipedia.org/wiki/K-means_clustering) has some shortcomings. Thus, in addition to the [K-means approach](https://en.wikipedia.org/wiki/K-means_clustering) there are several other algorithms:

___

## K-means clustering (Lloyd's algorithm)
The [K-means algorithm](https://en.wikipedia.org/wiki/K-means_clustering), originally developed by [Lloyd in the 1950s but not published until 1982](https://ieeexplore.ieee.org/document/1056489), is a foundational approach to unsupervised clustering. Suppose we have a dataset $\mathcal{D}=\left\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{n}\in\mathbb{R}^{m}\right\}$ where each $\mathbf{x}\in\mathbb{R}^{m}$ is an $m$-dimensional feature vector.

[K-means](https://en.wikipedia.org/wiki/K-means_clustering) partitions data points into $K$ distinct groups by minimizing the within-cluster sum of squared distances. The algorithm groups data points (feature vectors) $\mathbf{x}\in\mathcal{D}$ into clusters $\mathcal{C} = \left\{\mathcal{c}_{1},\dots,\mathcal{c}_{K}\right\}$ based on proximity to cluster centroids.

> __What is similarity in K-means?__
> 
> __Similarity__ is measured by distance in the feature space. The most common metric is [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance), $d(\mathbf{x},\mathbf{y}) = \left\|\mathbf{x} - \mathbf{y}\right\|_{2}$. Points are grouped with the centroid they are closest to. Other distance metrics (e.g., Manhattan, cosine) are possible, but Euclidean distance is standard for K-means.

#### Algorithm: Lloyd's K-means Clustering

__Initialize__: Dataset $\mathcal{D} = \{\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{n} \in \mathbb{R}^{m}\}$, number of clusters $K \in \mathbb{Z}^{+}$, maximum iterations $\texttt{maxiter} \in \mathbb{Z}^{+}$, and initial centroids $\{\boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \ldots, \boldsymbol{\mu}_{K} \in \mathbb{R}^{m}\}$. Set convergence flag $\texttt{converged} \leftarrow \texttt{false}$ and iteration counter $\texttt{iter} \leftarrow 0$.

__Output__: Cluster assignments $\mathcal{C} = \{c_{1}, c_{2}, \ldots, c_{K}\}$ and updated cluster centroids $\{\boldsymbol{\mu}_{1}, \boldsymbol{\mu}_{2}, \ldots, \boldsymbol{\mu}_{K}\}$.

While $\texttt{converged}$ is $\texttt{false}$ and $\texttt{iter} < \texttt{maxiter}$ __do__:
1. **Assignment step**: For each data point $\mathbf{x} \in \mathcal{D}$, assign it to the nearest cluster centroid using Euclidean distance:
   $$c_{i} \leftarrow \arg\min_{j} \underbrace{\lVert\mathbf{x} - \boldsymbol{\mu}_{j}\rVert_{2}^{2}}_{=\;d(\mathbf{x}, \boldsymbol{\mu}_{j})^2}$$
    where $c_{i}$ is the cluster assignment for data point $\mathbf{x}$.

2. **Update step**: Store the current centroids $\hat{\boldsymbol{\mu}} \leftarrow \boldsymbol{\mu}$. Then, for each cluster $j = 1$ to $K$, recompute the centroid as the mean of all points assigned to cluster $j$:
   $$\boldsymbol{\mu}_{j} \leftarrow \frac{1}{|c_{j}|} \sum_{\mathbf{x} \in c_{j}} \mathbf{x}$$
   where $|c_{j}|$ is the number of data points in cluster $j$.

3. **Convergence check**: If $\left\|\boldsymbol{\mu} - \hat{\boldsymbol{\mu}}\right\|_{2}^{2} \leq \epsilon$, set $\texttt{converged} \leftarrow \texttt{true}$ and terminate. 
    - If $\texttt{iter} \geq \texttt{maxiter}$, issue warning that maximum iterations reached without convergence and exit. Otherwise, increment $\texttt{iter} \leftarrow \texttt{iter} + 1$ and continue to the next iteration.
    - Typical values for the convergence tolerance are $\epsilon \in \{10^{-4}, 10^{-6}\}$. Smaller values yield tighter convergence at the cost of more iterations.

> __Practical considerations:__
>
> * __Convergence and multiple restarts__: K-means minimizes the objective function $J = \sum_{i=1}^{n} \left\|\mathbf{x}_{i} - \boldsymbol{\mu}_{c_{i}}\right\|_{2}^{2}$, where assignment and update steps each reduce (or maintain) this objective. Since $J$ is bounded below by zero and decreases monotonically with each iteration, the sequence of objective values cannot decrease indefinitely and must eventually stabilize, guaranteeing convergence to a local minimum. In practice, convergence occurs within 10–50 iterations, depending on data structure, initialization, and the convergence tolerance $\epsilon$. Because of sensitivity to initialization, practitioners often run K-means 10–20 times with different random seeds and select the result with the smallest $J$.
>
> * __Computational complexity and initialization strategy__: The computational complexity is $O(nKm \cdot t)$, where $t$ is the number of iterations, $n$ is the number of data points, and $m$ is the dimensionality. For very large datasets, this cost can be prohibitive; in such cases, mini-batch K-means offers a scalable alternative by updating centroids on small random subsets of data at each iteration, reducing per-iteration cost to $O(Bm)$ where $B$ is the mini-batch size. Initial centroid placement significantly affects results; random initialization can lead to poor local minima. More robust strategies like $k$-means++ choose initial centroids more carefully to improve convergence quality and reduce the number of restarts needed.

___

## What are the limitations of K-means?

K-means is effective for many clustering tasks, but several limitations can affect its performance. Understanding these constraints helps us recognize when K-means is appropriate and when alternative methods may be preferable.

> __Issues with K-means:__
>
> * __The number of clusters must be specified in advance__: K-means requires users to specify the number of clusters $K$ before running the algorithm. This is a fundamental choice with no automatic mechanism to determine the right value. If $K$ is too high, the algorithm may fragment natural groups into artificial subclusters. If $K$ is too low, distinct patterns may be merged together. Neither extreme is desirable, yet K-means provides no guidance.
> * __Sensitivity to initialization__: The algorithm is sensitive to the initial placement of centroids. Different random initializations can lead to convergence at different local minima, producing substantially different clustering results. In the example, you may observe how two runs with different starting centroids produce different final clusters, even on the same data. This variability makes results less reproducible unless a seed is set or a principled initialization strategy is used.
> * __Vulnerability to outliers__: Individual outliers can disproportionately pull centroid positions away from the main data clusters. Because the centroid is the mean of all assigned points, a single extreme value can shift it significantly. This causes nearby points to be misassigned and degrades cluster quality. In the example, observe whether any unusual data points distort the centroid positions. Handling outliers often requires preprocessing before clustering.
> * __Assumption of spherical, well-separated clusters__: K-means assumes clusters are roughly spherical and well-separated in the feature space. When clusters have different shapes, sizes, or overlap substantially, K-means may assign boundary points incorrectly. The algorithm's decision boundary between clusters is determined by the perpendicular bisector of the line connecting two centroids—a straight boundary that works well for spherical clusters but fails for elongated or crescent-shaped clusters.

#### Feature scaling and data preprocessing

Before applying K-means, it is important to consider feature scaling. Since the algorithm uses Euclidean distance, features with larger scales (e.g., income in dollars) will dominate distance calculations compared to features with smaller scales (e.g., age in years). A common approach is to standardize features to zero mean and unit variance:

$$\mathbf{x}_{\text{scaled}} = \frac{\mathbf{x} - \boldsymbol{\mu}}{\boldsymbol{\sigma}}$$

where $\boldsymbol{\mu}$ is the feature mean and $\boldsymbol{\sigma}$ is the feature standard deviation. Outliers should be identified and handled appropriately before clustering, as they can distort centroid calculations and degrade cluster quality.

### How many clusters should we choose?

Since we must specify $K$ in advance, we need principled methods to select it. Several approaches are available:

- **Elbow method**: Plot the within-cluster sum of squares (the objective function $J$) versus $K$. Look for an "elbow" or bend point where further increases in $K$ yield diminishing improvements in $J$. This method is visual and intuitive but subjective.
- **Silhouette method**: Measure how similar each point is to its own cluster compared to other clusters. Higher silhouette scores indicate better-defined clusters. This method is more objective than the elbow method.
- **Calinski-Harabasz index**: Computes the ratio of between-cluster to within-cluster variance. Higher values suggest more distinct clustering and more separation.

See the [Silhouette method](CHEME-5820-L1c-Advanced-SilhouetteScore-Spring-2026.ipynb) and [Calinski-Harabasz index](CHEME-5820-L1c-Advanced-CHI-Score-Spring-2026.ipynb) notebooks for detailed treatments of these approaches. Domain knowledge about the problem may also inform the choice of $K$, especially when multiple methods suggest different values.

___

## Summary
K-means clustering partitions data into $K$ groups by iteratively assigning points to nearest centroids and updating centroid positions until convergence.

> __Key Takeaways:__
>
> * **Lloyd's algorithm alternates between assignment and update steps**: The algorithm assigns each data point to the nearest centroid using Euclidean distance, then recomputes centroids as the mean of assigned points. This process repeats until convergence.
> * **K-means requires specifying the number of clusters in advance**: The algorithm requires users to specify $K$ before clustering. Poor choices of $K$ lead to either overfitting (too many clusters) or underfitting (too few clusters).
> * **K-means has practical limitations**: Sensitivity to initial centroid placement, vulnerability to outliers, and inability to handle overlapping clusters are key limitations. Feature scaling should be applied before clustering. The elbow method, silhouette method, and Calinski-Harabasz index are useful for selecting the number of clusters, and running multiple random restarts improves solution quality.

K-means provides a scalable approach to unsupervised clustering but requires careful consideration of its assumptions and limitations.
___