<font size="6"><b>CLUSTERING: BASICS</b></font>

<font size="5"><b>Serhat Ã‡evikel</b></font>

In [None]:
library(tidyverse)
library(data.table)
library(ggConvexHull)
library(BBmisc)
library(fields)
library(plotly)

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
#options(repr.plot.width = 15, repr.plot.height = 15)

![xkcd](../imagesba/chat_systems.png)

(https://xkcd.com/1810/)

In this session we will see two different clustering algorithms: K-means clustering and hierarchical clustering.

Clustering or cluster analysis is a method of unsupervised learning, a sub-domain of machine learning.

In supervised learning, a model is trained to fit the labels - or a response variable - of data. In unsupervised methods, the aim is not to fit or predict a response variable but to learn patterns in unlabeled data.

(https://en.wikipedia.org/wiki/Unsupervised_learning)

Cluster analysis, or clustering, is a data analysis technique aimed at partitioning a set of objects into groups such that objects within the same group (called a cluster) exhibit greater similarity to one another (in some specific sense defined by the analyst) than to those in other groups (clusters).

(https://en.wikipedia.org/wiki/Cluster_analysis)

First we will again simulate a toy dataset to demonstrate hierarchical and k-means clustering algorithms.

# Data Generation and Preparation

We first generate data which conform to a data generation process such that the data is drawn from three different bivariate normal distributions, so the true structure of the data is comprised of three clusters:

In [None]:
km <- 3 # number of clusters for data generation
sizex <- 100 # observations

In [None]:
# bivariate data, x and y values
# sample cluster means from normal, cluster sd's from exponential distribution
# multivariate cluster means are known as centroids
set.seed(15)
means1 <- rnorm(km, 0, 1.2)
sds1 <- rexp(km, 1.2)
means2 <- rnorm(km, 0, 2)
sds2 <- rexp(km, 1.5)

In [None]:
# randomly assign each observation to a cluster
set.seed(20)
datax <- data.table(clstr = sample(km, sizex, replace = T))

In [None]:
# add the x/y means and sds of each observation by its cluster
datax[, (c("meanx1", "sdx1")) := .(means1[clstr], sds1[clstr])]
datax[, (c("meanx2", "sdx2")) := .(means2[clstr], sds2[clstr])]

In [None]:
# sample x/y values using corresponding mean and sd of the respective cluster
set.seed(30)
datax[, xval := rnorm(.N, meanx1, sdx1)]
datax[, yval := rnorm(.N, meanx2, sdx2)]

In [None]:
# z-score normalize x and y values
datax2 <- datax %>% select(clstr, xval, yval) %>%
mutate_at(c("xval", "yval"), normalize) %>% copy()

In [None]:
datax2[, cl := .I] # create initial clusters to update later with algorithms

Let's visualize the data with the original clusters:

In [None]:
# visualize original clusters
ggplotly(datax2 %>%
ggplot(aes(x = xval, y = yval, color = as.factor(clstr), fill = as.factor(clstr), size = 5)) +
geom_point() +
geom_convexhull(alpha=.5, aes(color = NULL, fill = as.factor(clstr)))
        , height = 800)

# K-Means Clustering

K-means clustering is a method that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). k-means clustering minimizes within-cluster variances.

(https://en.wikipedia.org/wiki/K-means_clustering)

k-means iteratively applies a two step algorithm, basically known as expectation-maximization (EM):

1) Initial cluster centroids (multivariate means) are randomly chosen
2) Each observation is assigned to the cluster with the closest centroid (expectation step)
3) With the news assignments, the cluster centroids are recalculated (maximization step)

Steps 2-3 are repeated until convergence or maximum iterations

Now for Step 1, from the range of x and y values, initial centroids - multivariate centers of clusters - are drawn with uniform distribution. We have two objects, one for the data, the other for the centroids:

In [None]:
ranges <- lapply(datax2 %>% select(xval, yval), range) # ranges of x and y values of observations are calculated
k <- 3 # cluster number is given

# initial centroids are randomly sampled from uniform distribution
set.seed(40)
centroids <- lapply(ranges, function(x) runif(k, x[1], x[2]))
cent_dt <- as.data.table(centroids, cntr = 1, cl = 1:k) # make centroids a data table 
datax3 <- copy(datax2) # create a deep copy of data

## One iteration of algorithm

We first step into how a single iteration of the algorithm is implemented.

In [None]:
iterx <- 1 # the iteration count

In [None]:
maxtimes <- 20 # maximum number of iterations

We track the class assignments at each iteration and how many observations shifted clusters. When all values are settled in their respective clusters, we can early stop the algorithm:

In [None]:
deltas <- rep(NA, maxtimes) # object to hold number of observations that shift clusters in each iteration
classes <- rep(Inf, datax3[, .N]) # initiate object to hold cluster assignments

We records the iteration number in the objects:

In [None]:
cent_dt[, iter := iterx]
datax3[, iter := iterx]

While the expectation step (cluster assignment) is the first one of the iterative steps of the algorithm, since we started with random assignments, we initiate the iterative algorithm with the maximization step - recalculation of the centroids. The new centroid values are assigned to its respective data.table:

In [None]:
# the steps are reversed in this implementation since we already start with centroids
# in the first iteration do not recalculate centroids, use the random ones
if (iterx > 1) cent_dt <- datax3[, lapply(.SD, mean), by = cl, .SDcols = c("xval", "yval")][order(cl)]

For the expectation step, we have to calculate the distance of each observation to each centroid and then find the closest one for each value:

In [None]:
# calculate distance of each observation to each centroid
dists <- fields::rdist(datax3 %>% select(xval, yval), cent_dt %>% select(xval, yval)) # calculate the distances of each point to each centroid

# get new clusters of each observation
classes2 <- max.col(-dists) # get the closest centroid for each point, shuffles classes

classes2

We add the cluster assignment to the dataset, so expectation step is completed:

In [None]:
# add cluster assignments to data
datax3[, cl := classes2]

And the maximization step again, but the centroids are not assigned, this is just a single iteration demonstration:

In [None]:
# calculate new centroids. they are not assigned to cent_dt object since this is just a demonstration of a single iteration
datax3[, lapply(.SD, mean), by = cl, .SDcols = c("xval", "yval")][order(cl)]

Check how many values changed cluster. If there are no more cluster changes, the algorithm has converged:

In [None]:
# calculate the number of cluster changes
changes <- sum(classes2 != classes) # get the number of class changes
changes

Since the cluster assignments are initiated with Inf values, in the first iteration all the values change clusters. This will stabilize in subsequent iterations.

And some more steps that we will use in the actual run of the algorithm:

In [None]:
deltas[iterx] <- changes # log the change count
if (changes == 0) break # if no more class changes stop
classes <- classes2 # backup classes to compare in the next iteration

## Multi iteration algorithm

Now let's make the algorithm start from the beginning to end until either maximum number of iterations is reached or there are no more changes in cluster assignments.

Since it will involve multiple steps, we initiate empty lists to hold the results of each iteration, to visualize better afterwards:

In [None]:
# initiate empty lists to hold centroid and cluster assignments in each iteration
cent_l <- list()
datal_k <- list()

And create the initial centroid table and a fresh copy of the data:

In [None]:
cent_dt <- cbind(as.data.table(centroids), cntr = 1, cl = 1:k) # create centroid table
datax3 <- copy(datax2 %>% select(-clstr, -cl)) # create copy of data

To track each observation separately in visualization we create an ID from them:

In [None]:
# create observation id's for further manipulation for visualizations
datax3[, id := .I]

Maximum iteration number for early stop:

In [None]:
# maximum iteration to stop before convergence
maxtimes <- 20

And empty objects to track the cluster assignments and number of cluster changes in each iteration:

In [None]:
# initiate objects to hold reassignment numbers and assigned clusters
deltas <- rep(NA, maxtimes)
classes <- rep(Inf, datax3[, .N])

Now this is the main loop - the steps provided in the previous section bundled together into a loop:

In [None]:
# main loop
for (iterx in 1:maxtimes)
{
    datax3[, iter := iterx]
    # recalculate centroids: maximization step
    if (iterx > 1) cent_dt <- datax3[, lapply(.SD, mean), by = cl, .SDcols = c("xval", "yval")][order(cl)]
    cent_dt[, (c("iter", "cntr")) := .(iterx, 1)]
    
    # recalculate observation/centroid distances
    dists <- fields::rdist(datax3 %>% select(xval, yval), cent_dt %>% select(xval, yval)) # calculate the distances of each point to each centroid
    classes2 <- max.col(-dists) # get the closest centroid for each point, shuffles classes
    changes <- sum(classes2 != classes) # get the number of class changes
    deltas[iterx] <- changes # log the change count
    datax3[, cl := classes2]
    classes <- classes2 # backup classes
    
    # save the states into lists
    cent_l[[iterx]] <- copy(cent_dt)
    datal_k[[iterx]] <- copy(datax3)

    if (changes == 0) break # if no more class changes stop
}

See how the algorithm converged by checking the number of cluster changes in each iteration. Note that we stop when the number of changes is 0:

In [None]:
deltas # number of cluster reassignments in each iteration. Note that algorithm converges at the 6th iteration

And total number of iterations until convergence:

In [None]:
iterx

Since we did not wait until the maximum number of iterations, let's trim the empty parts of the lists:

In [None]:
# get the part of lists until convergence
cent_l <- cent_l[1:iterx]
datal_k <- datal_k[1:iterx]

And make the lists into single tables, in which each iteration's values will be tracked with `iterx` column:

In [None]:
# combine list parts into single data tables
cent_all <- rbindlist(cent_l, fill = T)
data_all_k <- datal_k %>% rbindlist(fill = T)

In the visualization it would be better to view the data and the centroids together:

In [None]:
# combine  observations with centroids for the sake of visualization
data_all_k2 <- rbind(data_all_k, cent_all, fill = T)

Now these are some steps to prepare the data for a better visualization, this is not a part of the main algorithm:

In [None]:
# some reordering and cleaning
setorder(data_all_k2, iter, cntr, id)
data_all_k2[is.na(cntr), cntr := 0]

In [None]:
# record forward and backward cluster changes for the purposes of visualization
data_all_k2[, chng := cl != lead(cl), by = id]
data_all_k2[is.na(chng), chng := F]
data_all_k2[, chng2 := cl != lag(cl), by = id]
data_all_k2[is.na(chng2), chng2 := F]
data_all_k2[, chng := chng * 2 + chng2]
data_all_k2[, chng := pmin(chng, 2)]
data_all_k2[, chng2 := NULL]
data_all_k2[cntr == 1, chng := NA]
data_all_k2[, cntr := cntr]

In [None]:
# set hex colors fixed to ensure consistency across iterations
colrs <- data_all_k2[, uniqueN(cl)]
set.seed(10)
colrs_hex <- do.call(rgb, replicate(3, runif(colrs), simplify = F))
data_all_k2[, clr := colrs_hex[cl]]

In [None]:
# set the order for color consistency
setorder(data_all_k2, iter, cl, cntr)

In [None]:
# serialize the data for shiny app
#saveRDS(data_all_k2, "~/databa/rds/data_all_k2.rds")

## Visualization

See all iterations in facets:

- The larger are more transparent points are centroids
- The polygons are convex hulls of clusters
- The points to shift in the next iteration are circled with bolder lines
- The points shidted from the last iteration are circled with lighter lines

In [None]:
p1 <- data_all_k2 %>%
mutate_at("cl", factor) %>%
ggplot(aes(x = xval, y = yval, fill = cl, size = 2 + cntr * 5)) +
geom_point(shape=21, color = "black", aes(stroke = chng * 0.5, fill = cl, alpha = 1 - cntr * 0.5)) +
scale_size_identity() +
geom_convexhull(alpha= 0.2, aes(color = NULL, fill = cl)) +
scale_fill_manual(labels = unique(data_all_k2$cl), values = unique(data_all_k2$clr)) +
scale_alpha_identity() +
facet_wrap(~ iter, ncol = 2)

In [None]:
ggplotly(p1, height=800)

Now you can 
- Open a shiny interface from launcher
- Navigate to 12_clustering/apps directory and select the "K-Means Clustering" tab to interact with an animation of iterations

# Hierarchical Clustering

Hierarchical Clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. The kind of hierarchical clustering we will study is "agglomerative clustering".

Agglomerative clustering, often referred to as a "bottom-up" approach, begins with each data point as an individual cluster. At each step, the algorithm merges the two most similar clusters based on a chosen distance metric (e.g., Euclidean distance) and linkage criterion (e.g., single-linkage, complete-linkage).
This process continues until all data points are combined into a single cluster or a stopping criterion is met.

(https://en.wikipedia.org/wiki/Hierarchical_clustering)

In this algorithm, all observations are initiated as separate clusters of their own

Then pairwise distances between clusters are calculated. There are several option but here the simplest centroid approach will be followed. So distances between cluster centroids will be calculated

In each iteration closest two clusters will be merged into one

At the end all observations will be merged into a single cluster

So there is no single number of clusters assigned but any number of clusters can be selected within all iterations for final analysis

Now let's make the preparation for the data, initiate the objects and make the initial cluster assignments, such that each observation gets its sequential unique cluster ID:

This object will hold which clusters will be merged in each iteration:

In [None]:
# initiate an empty data.table that will hold which clusters to be merged in each iteration
agglo_dt <- NULL

In [None]:
# create a deep copy of data
datax3 <- copy(datax2)
datax3[, clstr := NULL]

In [None]:
# observations are initially assigned to sequential cluster ids
datax3[, cl := .I]

To get the weighted means for centroid, we have to track the size of clusters and they are initiated as 1 values:

In [None]:
# in order to get weighted means for centroids, the size of clusters will be tracked.
# they are initiated with 1
datax3[, N := 1]

## One iteration of algorithm

Now let's start with a single iteration:

In [None]:
iterx <- 1 # iteration number

All pairwise distances are calculated. Initially each observation is a separate cluster, so this is basically pairwise distances between all observations:

In [None]:
# pairwise distances are calculated
distx <- datax3 %>% select(xval, yval) %>% dist

Cluster pairs are listed into a table (in subsequent iterations the number of clusters will be fewer):

In [None]:
# two-way combinations of remaining clusters are saved into a table
dist_dt <- t(combn(datax3$cl, 2)) %>% as.data.table

The indext of the minimum distance is taken, the respective clusters will be merged:

In [None]:
# the index of minimum distance is taken
minind <- which.min(distx)[1]

We record the iteration count and the ID's of clusters to be merged:

In [None]:
# the iteration, and the id's of clusters to be merged are assigned to an object
aggx <- dist_dt[minind, .(iterx, V1, V2)]

And we append this info to the respective object that tracks which clusters are merged in each iteration:

In [None]:
# the newly merged cluster info is appended
agglo_dt <- rbind(agglo_dt, aggx)
agglo_dt

Now we update the cluster info in the dataset. For convenience in the ID's of pair of clusters to be merged, higher cluster ID is changed to lower cluster ID, just for convenience and consistency across iterations:

In [None]:
# higher cluster id is converted to lower cluster id, for convenience
datax3[cl == aggx$V2, cl := aggx$V1]

Let's check the dimension of the data before completing the cluster merge:

In [None]:
dim(datax3)

Here we aggregate cluster centroids by weighting the centroid values of pre-merge clusters with their respective sizes:

In [None]:
# the total size and weighted centroids of merged two clusters are calculated, along with other clusters
datax3 <- datax3[, .(N = sum(N), xval = sum(xval * N), yval = sum(yval * N)), by = cl][, .(N, xval = xval / N, yval = yval / N)]

And check the dimensions again:

In [None]:
dim(datax3)

Since we merged the clusters with the lowest distance into a single cluster, now the number of clusters is decremented.

So basically `datax3` object will start with separate observations but in time these observations will be merged into clusters until we have a single remaining row.

## Multi iteration algorithm

Now let's start from the beginning where each observation constitutes a separate cluster until all observations are combined into a single cluster.

Initiate the objects:

In [None]:
# create a deep copy of data
datax3 <- copy(datax2)
datax3[, clstr := NULL]

In [None]:
# observations are initially assigned to sequential cluster ids
datax3[, cl := .I]

In [None]:
# in order to get weighted means for centroids, the size of clusters will be tracked.
# they are initiated with 1
datax3[, N := 1]

And initiate the table that will hold the information on which clusters to be merged in each iteration:

In [None]:
# initiate an empty data.table that will hold which clusters to be merged in each iteration
agglo_dt <- NULL

Initiate the iterations:

In [None]:
# initiate number of iterations
iterx <- 1

This is the main loop that will run until number of clusters is 1:

In [None]:
# main loop. will continue as long as number of clusters is > 1
while(datax3[, .N] > 1)
{
    distx <- datax3 %>% select(xval, yval) %>% dist # distances
    dist_dt <- t(combn(datax3$cl, 2)) %>% as.data.table # two-way combinations of cluster id's
    minind <- which.min(distx)[1] # get the index of minimum distance
    aggx <- dist_dt[minind, .(iterx, V1, V2)] # get the cluster id's to be merged
    agglo_dt <- rbind(agglo_dt, aggx) # append the merged clusters's info
    datax3[cl == aggx$V2, cl := aggx$V1] # merge cluster id's
    # recalculate weighted centroids and total sizes
    datax3 <- datax3[, .(N = sum(N), xval = sum(xval * N), yval = sum(yval * N)), by = cl][, .(N, xval = xval / N, yval = yval / N, cl)]
    iterx <- iterx + 1 # increment iteration
}

And at the end we have a single cluster holding all observations remaining:

In [None]:
datax3 # last iterations leaves a single cluster to hold all observations

## Visualization

### Data wrangling for visualization

Now we will create an object that will pair the cluster id of each observation at each iteration with the initial cluster id.

The steps are for a better visualization at the end.

Details are not critical, so don't mind what it does. Basically we track which cluster each original observation is assigned to in each subsequent iteration until we have a single remaning cluster at the end. You will see the final object:

In [None]:
datal <- rep(list(NULL), datax2[, .N]) # list with empty objects

hier_dt <- datax2 %>% select(cl) %>% mutate(cl0 = cl) %>% copy # get initial clusters for each observation
hier_dt[, iter := 0] # initiate iteration at zero
hier_dt[, linex := as.integer(NA)] # this will be the width of lines to show the observations inside the clusters to be merged

datal[[1]] <- copy(hier_dt) # cluster at initial iteration are saved 

In [None]:
# main loop through iterations
for (i in agglo_dt[, .I])
{
    aggx <- agglo_dt[i] # the id's of clusters to be merged
    hier_dt <- copy(hier_dt) # create a deep copy of the previous state of observation clusters. we need it otherwise all previous iterations will also be updated
    hier_dt[, linex := NA] # line width info that shows merged cluster membership is reset
    hier_dt[, iter := i] # save iteration
    hier_dt[cl == aggx$V2, cl := aggx$V1] # merge cluster id's
    if(i < agglo_dt[, .N]) # other than the last iteration
    {
        aggx2 <- agglo_dt[i + 1] # get the next iterations cluster id's to be merged
        hier_dt[cl == aggx2$V2, linex := 2] # the cluster to be merged will have bolder lines
        hier_dt[cl == aggx2$V1, linex := 1] # the cluster to merge will have lighter lines
    }    
    datal[[i + 1]] <- hier_dt # save the state
}

Here we see the cluster assignments of each observation in each iteration. We start with a separate cluster for each observation and at the end all observations will be assigned to the same remaining single cluster:

In [None]:
# combine all states into a single table
hier_all <- datal %>% rbindlist

## view the cluster asignments and changes in a wide table
hier_all %>% mutate(iter = paste("i", iter, sep = "")) %>%
dcast(cl0 ~ iter, value.var = "cl")

In [None]:
# join the observation-iteration-cluster assignments with the coordinate information
data_all <- datax2 %>% rename("cl0" = "cl") %>%
left_join(hier_all, by = "cl0")

# create hex color codes in advance to keep colors in each iteration consistent
colrs <- data_all[, uniqueN(cl0)]
set.seed(10)
colrs_hex <- do.call(rgb, replicate(3, runif(colrs), simplify = F))
data_all[, clr := colrs_hex[cl]]

In [None]:
# save the data for shiny app
#saveRDS(data_all, "~/databa/rds/data_all.rds")

Now let's visualize selected iterations:

In [None]:
# select among iterations
iters <- c(0, 20, 50, 80, 90, 96:100)

The clusters are wrapped inside colored convex hulls

The observations in the cluster to be merge in the next iteration are bold lined

The observations in the cluster to merge in the next iteration are light lined

In [None]:
data_all2 <- data_all %>%
mutate(cl = sprintf("%03d", cl)) %>%
filter(iter %in% iters)

p <-  data_all2 %>%
        ggplot(aes(x = xval, y = yval, stroke = linex, size = 5)) +
        geom_point(shape=21, color = "red", aes(fill = cl)) +
        geom_convexhull(alpha=.5, aes(fill = cl), colour = "black", linewidth = 0.1) +
        scale_fill_manual(labels = unique(data_all2$cl), values = unique(data_all2$clr)) +
        scale_size_identity() +
        theme(legend.position="none") +
        facet_wrap(~ iter, ncol = 2)

ggplotly(p, height = 1500)

Now you can 
- Open a shiny interface from launcher
- Navigate to 12_clustering/apps directory and select the "Hierarchical Clustering" tab to interact with an animation of iterations

# Resources

- Lantz 2015, Machine Learning with R, Ch. 9
- Garreth et al. 2023, An Introduction to Statistical Learning with Applications in R, Second Edition, Corrected Printing, Ch. 9
- Yu-Wei 2015, Machine Learning with R Cookbook, Ch. 9
- Agresti and Kateri 2021, Foundations of Statistics for Data Scientists With R and Python, Ch. 8