# Clustering Methods


_Summarized by QH_  
_First version: 2022-11-02_  
_Last updated on : 2022-11-02_  

The objective for clustering is to find homogeneous subgroups among the observations.

## K-Means Clustering

K-Means clustering method can partition the datasets into $K$ distinct and non-overlapping segments/clusters. The idea of K-Means is to find a good way to partition the data that can make the _within-cluster variation_ as small as possible.

To define the _within-cluster variation_, the common way is to use Euclidean distance.
* Eclidean distance: $\sqrt{\sum_{j=1}^p (x_{ij} - x_{i'j})^2}$ for observation $i$ and $i'$.

And the _within-cluster variation_ for cluster $k$ is defined as the sum of Pair-wise squared Euclidean distance: 
$$ W(C_k) = \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2$$  

where $|C_k|$ denotes the number of observations in the $k$ th cluster.

Our objective is to minimize for the total $K$ clusters' _within-cluster variation_, mathematically:
$$ \text{min}_{c_1, \cdots, c_k} \sum_{k=1}^KW(C_k) $$ 



The algorithm for a dataset with number of observations being $n$ and number of features being $p$:
1. Randomly initial $K$ _centroid_.
2. Iterate until the cluster assignments stop changing:  
    a. Assign each observation to the closest cluster centroid. Closest is measured using Euclidean distance.  
    b. For each of the $K$ clusters, compute the cluster _centroid_. The $k$th cluster centroid is the vector of $p$ features, for each feature value being the average of the observations' feature value in that cluster.

Notes:
* The algorithm gurantees to decrease the _within-cluster variation_ since each steps is to find the observation that's closest to each other. And mathematically it is because:
$$ \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2 = 2\sum_{i \in C_k} \sum_{j=1}^p (x_{ij} - \bar{x}_{kj})^2$$ 
* It may get into local optimum given different initialization.

Advantages:
* Easy and straightforward to implement and understand

Drawbacks:
* Need to pre-specify number of clusters $K$. 
    * We can use __elbow method__ using sum of _within-cluster variation_ or _average silhoutte score_ to determine the $K$.
    * Use business knowledge together with mathematical methods to determine.
* Need to initialize the $K$ centroids which may results in different local optimum. 
    * Suggest to run multiple times with different initialization and choose the one that minimize the sum of _within-cluster variation_.

### Elbow method


## Agglomerative Hierarchical Clustering



# References
1. Introduction to Statistical Learning
2. Machine Learning Course by Andrew Ng on Coursera
3. Scikit-Learn Online Clustering documents