# L1c: Unsupervised Learning and Clustering Approaches
Fill me in. 

> __Learning Objectives:__
> 
> By the end of this lecture, you should be able to:
> Three learning objectives go here.

Let's get started!
___

## Examples
Today, we will use the following examples to illustrate key concepts:

> [▶ Let's explore a collection of simple and expert agents](CHEME-5660-L15a-Example-Wolfram-NetworkSimulation-Fall-2025.ipynb). In this example, we build a homogeneous collection of Wolfram market agents using simple rules. These agents watch experts and mimic their actions, leading to emergent market dynamics. We analyze how the collective behavior of these agents influences market stability and price movements.

___

## Background: What is unsupervised learning and clustering?
Unsupervised learning is a branch of machine learning that deals with unlabeled data. Its goal is to discover hidden patterns and structures without predefined target variables. One of the most common tasks in unsupervised learning is clustering.

> __What is clustering?__
> 
> __Clustering__ is an unsupervised machine learning technique that organizes data points into groups, or clusters, based on their similarities without prior knowledge of the group memberships. This method is widely used for exploratory data analysis, enabling the discovery of patterns and relationships within complex datasets.

### Clustering approaches
Today, we'll consider [the K-means algorithm](https://en.wikipedia.org/wiki/K-means_clustering), arguably the most straightforward clustering algorithm. While relatively straightforward, we'll see that [K-means](https://en.wikipedia.org/wiki/K-means_clustering) has some shortcomings. Thus, in addition to the [K-means approach](https://en.wikipedia.org/wiki/K-means_clustering) there are several other algorithms:
* __Hierarchical clustering__ is an unsupervised machine learning technique that organizes data points into a tree-like structure of nested clusters. This allows for the identification of relationships and patterns within the dataset. This method can be implemented using two main approaches: agglomerative, which merges individual points into larger clusters, and divisive, which splits a single cluster into smaller ones.
* __Density-based spatial clustering of applications with noise (DBSCAN)__ is a density-based clustering algorithm that groups closely packed data points while effectively identifying outliers, making it particularly useful for datasets with noise and clusters of arbitrary shapes. By defining clusters as dense regions separated by areas of lower density, DBSCAN can efficiently discover meaningful patterns in complex data distributions
* __Gaussian mixture models (GMMs)__ are probabilistic models that represent a dataset as a combination of multiple Gaussian distributions, each characterized by its mean and covariance. This allows for the identification of underlying subpopulations within the data. This approach is useful in clustering and density estimation, providing a flexible framework for modeling complex, multimodal distributions.

## K-means clustering
The [K-means algorithm](https://en.wikipedia.org/wiki/K-means_clustering), originally developed by [Lloyd in the 1950s but not published until much later in 1982](https://ieeexplore.ieee.org/document/1056489), is our first example of $\texttt{unsupervised learning}$. Suppose we have a dataset $\mathcal{D}=\left\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{n}\in\mathbb{R}^{m}\right\}$ where each $\mathbf{x}\in\mathbb{R}^{m}$ is an $m$-dimensional feature vector.
[K-means](https://en.wikipedia.org/wiki/K-means_clustering) is a popular unsupervised machine learning algorithm for clustering data points (feature vectors) $\mathbf{x}\in\mathcal{D}$ into distinct a set of groups (clusters) $\mathcal{C} = \left\{\mathcal{c}_{1},\dots,\mathcal{c}_{K}\right\}$ based on _similarity_.

> __What is similarity?__
> 
> __Similarity__ refers to how _close_ data points are to each other in the feature space, i.e., how close $\mathbf{x}_{i}$ is to $\mathbf{x}_{j}$ using a distance of similarity measure $d(\mathbf{x},\mathbf{y})$. _Close features are assumed to be similar_. The most commonly used similarity measure in [K-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) is [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). However, we can use other types of measures. The choice of similarity measure can significantly impact the resulting clusters.

Let's develop some pseudocode for the [K-means algorithm](https://en.wikipedia.org/wiki/K-means_clustering).

__Initialize__. You give the data set $\mathbf{x}\in\mathcal{D}$, which contains `n` data vectors (measurements, observations, etc) $\mathbf{x}_{1},\dots,\mathbf{x}_{n}$ where each vector $\mathbf{x}_{i}$ has `m` features, and the number and (initial) locations of K-clusters to the algorithm. Each cluster $\mathcal{c}_{k}$ is represented by a $\texttt{centroid}$, i.e., the mean of the set of data points in the cluster $\left\{\mathbf{x}_{i}\right\}_{i\in\mathcal{c}_{k}}$. In the [K-means approach](https://en.wikipedia.org/wiki/K-means_clustering), you have to tell the algorithm how many clusters and the initial location of each cluster. This initial guess is then iteratively refined.
* __Update__. The [K-means algorithm](https://en.wikipedia.org/wiki/K-means_clustering) employs an iterative process in which data points are assigned to the nearest cluster centroid, and the centroids are subsequently updated based on the mean of the assigned points. This process continues until a predetermined stopping criterion is satisfied.
* __Stopping__. There are several ways the [K-means algorithm](https://en.wikipedia.org/wiki/K-means_clustering) can terminate. The stopping criteria for the [K-means clustering algorithm](https://en.wikipedia.org/wiki/K-means_clustering) include when the cluster centroids do not change significantly, when data points remain in the same clusters across iterations, or when a maximum number of iterations is reached.

Let's look at an example of the [K-means algorithm](https://en.wikipedia.org/wiki/K-means_clustering) in action.

> __Example:__
> 
> [▶ Let's explore a collection of simple and expert agents](CHEME-5660-L15a-Example-Wolfram-NetworkSimulation-Fall-2025.ipynb). In this example, we build a homogeneous collection of Wolfram market agents using simple rules. These agents watch experts and mimic their actions, leading to emergent market dynamics. We analyze how the collective behavior of these agents influences market stability and price movements.

___

## What are the problems with K-means?
Our k-means implementation works well on this sample customer spending dataset. However, what problems could we encounter in practice with an arbitrary dataset? Let's explore a few of these possible issues:
* __Specified number of clusters__. K-means requires the user to specify the number of clusters, $K$, in advance. This requirement poses a challenge, as choosing an inappropriate $K$ can lead to poor clustering results. For example, if $K$ is set too high, it may cause overfitting, where noise is regarded as distinct clusters. Conversely, setting $K$ too low may result in losing vital data structure.
* __Sensitivity to initial conditions__. The K-means method is sensitive to the initial placement of centroids. Our implementation randomly initializes cluster centers, and different initializations can lead to convergence at various clustering outcomes. This variability may affect the reproducibility of the clustering.
* __Sensitivity to outliers__. The presence of outliers can significantly compromise the accuracy of K-means clustering results. Outliers can skew the centroids' positions, resulting in misleading cluster assignments. Preprocessing steps may be required to address the impact of outliers.
* __Overlapping clusters__. In cases where clusters overlap, K-means does not have an intrinsic mechanism for handling uncertainty regarding which cluster a data point belongs to. This can result in ambiguous assignments and reduced clarity in cluster definitions. 

### How many clusters should we choose?
Of K-means' shortcomings, the need to specify the number of clusters $K$ in advance can be addressed with several heuristic methods. 
There are several methods to estimate the number of clusters, including the [elbow method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)), the [silhouette method](https://en.wikipedia.org/wiki/Silhouette_(clustering)), or performance metrics 
such as the [Davies-Bouldin index](https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index), the [Dunn index](https://en.wikipedia.org/wiki/Dunn_index) or the [Calinski-Harabasz index](https://en.wikipedia.org/wiki/Calinski%E2%80%93Harabasz_index).

___

## Summary
One concise summary sentence goes here.

> __Key Takeaways:__
>
> Three key takeaways go here.

One concide, direct concluding sentence goes here.
___