#1 Intro

This notebook illustrates how to run a k-means clustering analysis in R. We will use the country risk dataset. The dataset contains variables of risk measures for about 120 countries (year unknown). Our goal is to group/cluster these countries based on those risk measures.

#2 Load and prepare the data

We will

1. Load the dataset.
2. Perform a simple correlation analysis and decide which risk measures/variables to use for the clustering analysis.
3. Standardize the variables to prepare for the clustering  

In [None]:
# load the readxl package (for importing Excel datasets)
library(readxl)

In [None]:
# download the dataset first because read_excel() from readxl package doesn't support reading Excel file from a URL directly
data_url <- "https://github.com/tdmdal/datasets-teaching/raw/main/crisk/country_risk.xlsx"
download.file(url = data_url, destfile = "country_risk.xlsx")

In [None]:
# import the data to a dataframe
country_risk <- read_xlsx(path = "country_risk.xlsx", sheet = "raw_kmeans", skip = 1)
head(country_risk)

In [None]:
# take a look at the structure (str) of the dataframe/tibble
str(country_risk)

You can find the variable description in the Excel data file (in sheet `data_description`). I also copy them below.

| Variable   | Description                                                                      |
|------------|----------------------------------------------------------------------------------|
| Corruption | Corruption index is on a scale from 0 (high corruption) to 100 (no   corruption) |
| Peace      | Peace index is on a scale from 1 (very peaceful) to 5 (not at all   peaceful)   |
| Legal      | Legal risk index is on a scale from 0 (high legal risk) to 10 (no legal   risk)  |

The four most relevant risk variables/features for clustering are `Corruption`, `Peace`, `Legal`, and `GDP Growth`. We start by perform a correlation analysis on these four variables.

In [None]:
# correlation analysis for the four most relevant variables/features
cor(country_risk[c("Corruption", "Peace", "Legal", "GDP Growth")])

We see that `Corruption` and `Legal` are highly correlated. As a result, we will just use one of the two for our k-means clustering analysis.

We decide to choose `Peace`, `Legal`, and `GDP Growth` for our clustering analysis. We first standardize the three variables/features, i.e., subtract each data point by its column mean and scale the result by the column standard deviation. K-means clustering is a distance based unsupervised learning algorithm so features in large scales can have dominant influence on the clustering result if not scaled properly.

We use the `scale()` function in base R for the standardization.

In [None]:
crisk_3col_scaled <- scale(country_risk[c("Peace", "Legal", "GDP Growth")])
str(crisk_3col_scaled)

We see that `scale()` returns a matrix, with two attributes storing the feature means and standard deviations (SDs). Attribute vector `scale:center` stores the means, and attribute vector `scaled:scale` stores the SDs. As you will see later, we will use those means and SDs to unscale the cluster centroids for the purpose of interpretting (i.e., labeling/naming) the clusters.

Let us take a look at scaled data (first 3 rows), and verify that we indeed performed a mean-sd scaling.

In [None]:
# take a look at first three rows
crisk_3col_scaled[1:3, 1:3]

In [None]:
# verify it's indeed mean and SD scaling.
mean_peace <- mean(country_risk$Peace)
sd_peace <- sd(country_risk$Peace)
peace_scaled <- (country_risk$Peace - mean_peace) / sd_peace

peace_scaled[1:3]

#3 K-means clustering

## 3.1 Determine the $k$

The k-means clustering algorithm does not learn the number of clusters, $k$. We need to set the value of $k$ before we run the algorithm. There are many methods to determine $k$. We will use a *heuristic* and *visual* approach called elbow method. Although the elbow method is often used, just note that not all researchers are happy with this method ([Schubert, 2022](https://arxiv.org/abs/2212.12189)).

The elbow method plots total within-cluster sum of squares (Total WSS), the measure that the k-mean clustering algorithm minimizes, against the number of clusters. The plotted curve is guaranteed to be decreasing, i.e., total WSS decreases as number of clusters increases. The method picks the number of clusters corresponding to the "elbow" of the curve. The "elbow" is the point where an additional cluster won't reduce the total WSS too much comparing to the last additional cluster (i.e., the marginal gain of adding a cluster drops sharply at the "elbow").

Let us walk through the elbow method below.

In [None]:
# set a random seed so you can reproduce the result
set.seed(123)

# max number of clusters (k) to try
num_cluster <- 8

# a vector to hold "total within sum of squares" for each number of clusters tried
twss <- rep(0, times = 8)

# try 1 to num_cluster possible clusters
for (i in 1:num_cluster) {
  # fit the k-means model
  km_fit <- kmeans(crisk_3col_scaled, centers = i, nstart = 10)
  # save the total within cluster sum of squares
  twss[i] <- km_fit$tot.withinss
}

k <- 1:num_cluster
plot(k, twss, type = "b")

As expected, total WSS decreases as number of clusters increases. At $k=3$, an additional cluster seems to reduce total WSS much less that from $k=2$ to $3$. I will therefore pick $k=3$, i.e., the elbow is the point corresponding to $k=3$. Again, the elbow method is a *heuristic* way of determine $k$. Often time, it's not a clear cut and there is no precise formula to follow.

# 3.2 perform K-mean clustering ($k=3$)

Since we decided to set $k=3$, let us re-fit the model at $k=3$.

In [None]:
km_fit_3 <- kmeans(crisk_3col_scaled, centers = 3, nstart = 10)
km_fit_3

The clustering report prints out the number of data points (countries) in each cluster, the cluster means (i.e. cluster centroids), the cluster labels (1, 2, or 3 in our case), and the distance measures WSS.

Note that the returned fitted object (the variable `km_fit_3`) is a named-list, where you can retrieve all the stored information. For example, `km_fit_3$cluster` gives all the cluster labels, and `km_fit_3$tot.withinss` gives the total WSS.

In [None]:
# structure of km_fit_3: it is a named list
str(km_fit_3)

In [None]:
# create a new column "cluster" in the original data frame to store the cluster label (1, 2, or 3)
country_risk["cluster"] <- km_fit_3$cluster
country_risk

Let's see how many countries we have in each of these three clusters.

In [None]:
# count countries in each cluster (this is given in the clustering report too)
table(country_risk$cluster)

## 3.3 Interpret/name the clusters

To meaningfully label/name each cluster (e.g., high risk cluster, low risk cluster, etc.), we can take a look at the cluster centroids. Countries belong to a certain cluster/group tend to center round the cluster centroid.

In [None]:
# retrieve cluster centroids
centroid <- km_fit_3$center
centroid

Note that these centroids are obtained after variables being scaled. To better interpret them, we can scale the centroids back.

We will use the `attributes()` function to retrieve the attributes of the `crisk_3col_scaled` matrix. The `attributes(crisk_3col_scaled)` returns a named list so we can further retrieve the corresponding means and SDs used for the data scaling by referring to `scaled:center` and `scaled:scale` element.

In [None]:
# the three means and SDs used for scaling
str(attributes(crisk_3col_scaled))

In [None]:
# the means used for scaling
scaled_mean <- attributes(crisk_3col_scaled)$"scaled:center"
print(scaled_mean)

In [None]:
# turn the means used for scaling to a matrix with repeated rows
scaled_mean_mat <- matrix(scaled_mean, nrow = 3, ncol = 3, byrow = TRUE)
scaled_mean_mat

In [None]:
# turn the SDs used for scaling to a matrix with repeated rows
scaled_sd <- attributes(crisk_3col_scaled)$"scaled:scale"
scaled_sd_mat <- matrix(scaled_sd, nrow = 3, ncol = 3, byrow = TRUE)
scaled_sd_mat

We are ready to scale the centroids back. The inverse operation for the data standardization is simply multiplying SD and then add the mean back.

In [None]:
# scale each centroid back to the original scale
# that is, centroid * SD + mean
centroid_unscaled <- centroid * scaled_sd_mat + scaled_mean_mat
centroid_unscaled

### Exercise

Can you name the three clusters? Hint: take a look at the data description and understand how `Peace` and `Legal` are scored and their score ranges.