# Comparing Different Clustering Methods on Toy Datasets (R Version)

This notebook explores the performance of different clustering techniques on toy datasets of various shapes and properties, using R and tidyverse-friendly approaches. The structure and flow mirror the Python version for easy comparison.

## Step 1: Load Required Libraries

We use the `tidyverse` for data manipulation and plotting, and clustering packages such as `cluster`, `factoextra`, and `dbscan`.

- If you are new to R, you can install any missing packages with `install.packages("package_name")` in your R console.

In [ ]:
library(tidyverse)
library(cluster)
library(factoextra)
library(dbscan)
set.seed(42)

## Step 2: Generate Toy Datasets

We create several datasets to demonstrate clustering methods. Each dataset has a different structure:

- **Circles**: Two concentric circles (non-linearly separable)
- **Moons**: Two interleaving half circles (non-linearly separable)
- **Blobs**: Well-separated Gaussian blobs (good for basic clustering)
- **Uniform**: Random points with no structure
- **Anisotropic**: Blobs stretched in different directions
- **Varied Blobs**: Blobs with different variances

We use the `mlbench` package for some synthetic datasets.

In [ ]:
# Circles dataset
library(mlbench)
circles <- mlbench.circle(1500, d=2)
X_circles <- as_tibble(circles$x, .name_repair = 'minimal')
y_circles <- as.factor(circles$classes)

# Moons dataset
library(mlbench)
moons <- mlbench.smiley(1500)
X_moons <- as_tibble(moons$x, .name_repair = 'minimal')
y_moons <- as.factor(moons$classes)

# Blobs dataset
blobs <- tibble(
  x = c(rnorm(500, 0, 0.5), rnorm(500, 3, 0.5), rnorm(500, 6, 0.5)),
  y = c(rnorm(500, 0, 0.5), rnorm(500, 3, 0.5), rnorm(500, 6, 0.5))
)

# Uniform (no structure)
X_no_structure <- tibble(
  x = runif(1500),
  y = runif(1500)
)

# Anisotropic dataset
X_aniso <- blobs %>%
  as.matrix() %*% matrix(c(0.6, -0.6, -0.4, 0.8), nrow=2) %>%
  as_tibble(.name_repair = 'minimal')

# Blobs with varied variances
blobs_varied <- bind_rows(
  tibble(x = rnorm(500, 0, 1.0), y = rnorm(500, 0, 1.0)),
  tibble(x = rnorm(500, 3, 2.5), y = rnorm(500, 3, 2.5)),
  tibble(x = rnorm(500, 6, 0.5), y = rnorm(500, 6, 0.5))
)
X_varied <- blobs_varied

## Step 3: Visualize Toy Datasets

Let's plot each dataset to understand their structure. This helps us see why some clustering methods work better than others.

In [ ]:
# Circles
ggplot(X_circles, aes(V1, V2, color = y_circles)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Circles Dataset')

In [ ]:
# Moons
ggplot(X_moons, aes(V1, V2, color = y_moons)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Moons Dataset')

In [ ]:
# Blobs
ggplot(blobs, aes(x, y)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Blobs Dataset')

In [ ]:
# Uniform (no structure)
ggplot(X_no_structure, aes(x, y)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Uniform (No Structure) Dataset')

In [ ]:
# Anisotropic
ggplot(X_aniso, aes(V1, V2)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Anisotropic Dataset')

In [ ]:
# Blobs with varied variances
ggplot(X_varied, aes(x, y)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Blobs with Varied Variances')

## Step 4: Define Clustering Functions

We define reusable functions for each clustering method:

- **K-means**: Partitions data into k clusters by minimizing within-cluster variance.
- **Agglomerative (Hierarchical)**: Builds clusters by merging or splitting them successively.
- **DBSCAN**: Groups together points that are closely packed, marking as outliers points that lie alone.

Each function returns cluster assignments and a silhouette score (a measure of how well-separated the clusters are).

In [ ]:
# K-means clustering
cluster_kmeans <- function(df, nclust) {
  km <- kmeans(df, centers = nclust, nstart = 10)
  sil <- silhouette(km$cluster, dist(df))
  list(silhouette = mean(sil[, 3]), cluster = km$cluster, centers = km$centers)
}

In [ ]:
# Agglomerative clustering
cluster_agglom <- function(df, nclust, method = 'complete') {
  hc <- hclust(dist(df), method = method)
  cluster <- cutree(hc, k = nclust)
  sil <- silhouette(cluster, dist(df))
  centers <- df %>% mutate(cluster = cluster) %>% group_by(cluster) %>% summarise(across(everything(), mean)) %>% select(-cluster)
  list(silhouette = mean(sil[, 3]), cluster = cluster, centers = as.matrix(centers))
}

In [ ]:
# DBSCAN clustering
cluster_dbscan <- function(df, eps) {
  db <- dbscan(df, eps = eps)
  cluster <- db$cluster
  # Silhouette only makes sense if there are at least 2 clusters
  sil <- if(length(unique(cluster[cluster != 0])) > 1) mean(silhouette(cluster, dist(df))[,3]) else NA
  list(silhouette = sil, cluster = cluster)
}

## Step 5: Circles Dataset — Compare Clustering Methods

Let's apply each clustering method to the circles dataset and visualize the results. Notice how some methods struggle with non-linear shapes.

In [ ]:
# K-means
res_km <- cluster_kmeans(X_circles, 2)
X_circles %>% mutate(cluster = as.factor(res_km$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('K-means on Circles')

In [ ]:
# Agglomerative (complete)
res_agglom <- cluster_agglom(X_circles, 2, method = 'complete')
X_circles %>% mutate(cluster = as.factor(res_agglom$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Complete) on Circles')

In [ ]:
# DBSCAN
res_db <- cluster_dbscan(X_circles, eps = 0.2)
X_circles %>% mutate(cluster = as.factor(res_db$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('DBSCAN on Circles')

In [ ]:
# Agglomerative (single)
res_agglom_single <- cluster_agglom(X_circles, 2, method = 'single')
X_circles %>% mutate(cluster = as.factor(res_agglom_single$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Single) on Circles')

## Step 6: Moons Dataset — Compare Clustering Methods

Now, let's try the same clustering methods on the moons dataset. Again, some algorithms will perform better than others.

In [ ]:
# K-means
res_km <- cluster_kmeans(X_moons, 2)
X_moons %>% mutate(cluster = as.factor(res_km$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('K-means on Moons')

In [ ]:
# Agglomerative (complete)
res_agglom <- cluster_agglom(X_moons, 2, method = 'complete')
X_moons %>% mutate(cluster = as.factor(res_agglom$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Complete) on Moons')

In [ ]:
# DBSCAN
res_db <- cluster_dbscan(X_moons, eps = 0.2)
X_moons %>% mutate(cluster = as.factor(res_db$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('DBSCAN on Moons')

In [ ]:
# Agglomerative (single)
res_agglom_single <- cluster_agglom(X_moons, 2, method = 'single')
X_moons %>% mutate(cluster = as.factor(res_agglom_single$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Single) on Moons')

## Step 7: Blobs Dataset — Compare Clustering Methods

The blobs dataset is well-suited for clustering. Most algorithms should perform well here.

In [ ]:
# K-means
res_km <- cluster_kmeans(blobs, 3)
blobs %>% mutate(cluster = as.factor(res_km$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('K-means on Blobs')

In [ ]:
# Agglomerative (complete)
res_agglom <- cluster_agglom(blobs, 3, method = 'complete')
blobs %>% mutate(cluster = as.factor(res_agglom$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Complete) on Blobs')

In [ ]:
# DBSCAN
res_db <- cluster_dbscan(blobs, eps = 0.5)
blobs %>% mutate(cluster = as.factor(res_db$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('DBSCAN on Blobs')

In [ ]:
# Agglomerative (single)
res_agglom_single <- cluster_agglom(blobs, 3, method = 'single')
blobs %>% mutate(cluster = as.factor(res_agglom_single$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Single) on Blobs')

## Step 8: Uniform (No Structure) Dataset — Compare Clustering Methods

This dataset has no real clusters. Let's see how the algorithms behave when there is no structure.

In [ ]:
# K-means
res_km <- cluster_kmeans(X_no_structure, 3)
X_no_structure %>% mutate(cluster = as.factor(res_km$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('K-means on No Structure')

In [ ]:
# Agglomerative (complete)
res_agglom <- cluster_agglom(X_no_structure, 3, method = 'complete')
X_no_structure %>% mutate(cluster = as.factor(res_agglom$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Complete) on No Structure')

In [ ]:
# DBSCAN
res_db <- cluster_dbscan(X_no_structure, eps = 0.1)
X_no_structure %>% mutate(cluster = as.factor(res_db$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('DBSCAN on No Structure')

In [ ]:
# Agglomerative (single)
res_agglom_single <- cluster_agglom(X_no_structure, 2, method = 'single')
X_no_structure %>% mutate(cluster = as.factor(res_agglom_single$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Single) on No Structure')

## Step 9: Anisotropic Dataset — Compare Clustering Methods

Here, the clusters are stretched in different directions. Some algorithms may struggle with this kind of data.

In [ ]:
# K-means
res_km <- cluster_kmeans(X_aniso, 3)
X_aniso %>% mutate(cluster = as.factor(res_km$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('K-means on Anisotropic')

In [ ]:
# Agglomerative (complete)
res_agglom <- cluster_agglom(X_aniso, 3, method = 'complete')
X_aniso %>% mutate(cluster = as.factor(res_agglom$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Complete) on Anisotropic')

In [ ]:
# DBSCAN
res_db <- cluster_dbscan(X_aniso, eps = 0.3)
X_aniso %>% mutate(cluster = as.factor(res_db$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('DBSCAN on Anisotropic')

In [ ]:
# Agglomerative (single)
res_agglom_single <- cluster_agglom(X_aniso, 3, method = 'single')
X_aniso %>% mutate(cluster = as.factor(res_agglom_single$cluster)) %>%
  ggplot(aes(V1, V2, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Single) on Anisotropic')

## Step 10: Blobs with Varied Variances — Compare Clustering Methods

Finally, let's see how the algorithms handle blobs with different variances (spread).

In [ ]:
# K-means
res_km <- cluster_kmeans(X_varied, 3)
X_varied %>% mutate(cluster = as.factor(res_km$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('K-means on Varied Blobs')

In [ ]:
# Agglomerative (complete)
res_agglom <- cluster_agglom(X_varied, 3, method = 'complete')
X_varied %>% mutate(cluster = as.factor(res_agglom$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Complete) on Varied Blobs')

In [ ]:
# DBSCAN
res_db <- cluster_dbscan(X_varied, eps = 0.6)
X_varied %>% mutate(cluster = as.factor(res_db$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('DBSCAN on Varied Blobs')

In [ ]:
# Agglomerative (single)
res_agglom_single <- cluster_agglom(X_varied, 3, method = 'single')
X_varied %>% mutate(cluster = as.factor(res_agglom_single$cluster)) %>%
  ggplot(aes(x, y, color = cluster)) +
  geom_point(size=1.5) +
  theme_minimal() +
  ggtitle('Agglomerative (Single) on Varied Blobs')