# Clustering evaluation on high dimensional data

The goal of this notebook is to provide a basic template walkthrough of obtaining and preparing a number of (simple) high dimensional datasets that can reasonably used to clustering evaluation. The datasets chosen have associated class labels that *should* be meaningful in terms of how the data clusters, and thus we can use label based clustering evaluation such as ARI and AMI to determine how well different clustering approaches are performing.

The primary purpose of this notebook is to provide a set of baseline datasets that clustering algorithm developers can try their algorithms out on. Performing reasinably well on these datasets is a necessary but not sufficient condition of a good clustering algorithm.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from src import paths
data_folder = paths['data_path']

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.datasets import load_digits
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score, silhouette_score
from sklearn.decomposition import PCA
from sklearn import cluster

import numpy as np
import pandas as pd
import requests
import zipfile
import imageio
import os
from PIL import Image
from glob import glob
import re
import rarfile

import matplotlib.pyplot as plt
import seaborn as sns

import hdbscan
import umap
from sklearn.neighbors import KNeighborsTransformer
import pynndescent

import networkx as nx
import cdlib.algorithms as cd

sns.set()

# MNIST, USPS and Pendigits are easy

We can use the sklearn API to fetch data for the Pendigits, MNIST and USPS datasets.

Of these datasets pendigits is the smallest, with only 1797 samples, and is only 64 dimensional. This makes a good first dataset to test things out on -- the dataset is small enough that practically anything should be able to run on this efficiently.

USPS provides a slightly more challenging dataset, with almost 10,000 samples and 256 dimensions, but is still samall enough to be tractable for even naive clustering implementations.

MNIST provides a good basic scaling test with 70,000 samples in 784 dimensions. In practice this is not a very large dataset compared to many that people want to cluster, although the dimensionality may provide some challenges.

In [None]:
digits = load_digits()
mnist = fetch_openml("MNIST_784")
usps = fetch_openml("USPS")

# Buildings and COIL are harder

The buildings and COIL-20 datasets provide some slightly more challenging image based problems, with more complex images to be dealt with. Both are still small in number of samples, so should be easily tractable. COIL *should* be relatively easy to cluster since the different classes should provide fairly tight and distinct clusters (being 72 images of the same object from different angles for each class). The buildings dataset, which has colour images from many angles and different lighting conditions, should be a much more challenging problem to cluster if using simple euclidean distance on the flattened vectors.

In [None]:
if not os.path.isdir(data_folder):
    bashCommand = f"mkdir {data_folder}"
    os.system(bashCommand)

### COIL-20

In [None]:
%%time
if not os.path.isfile(data_folder / 'coil20.zip'):
    results = requests.get('http://www.cs.columbia.edu/CAVE/databases/SLAM_coil-20_coil-100/coil-20/coil-20-proc.zip')
    with open(data_folder / 'coil20.zip', "wb") as code:
        code.write(results.content)

In [None]:
images_zip = zipfile.ZipFile(data_folder / 'coil20.zip')
mylist = images_zip.namelist()
r = re.compile(".*\.png$")
filelist = list(filter(r.match, mylist))
images_zip.extractall(str(data_folder) + '/.')

In [None]:
%%time
coil_feature_vectors = []
for filename in filelist:
    im = imageio.imread(data_folder / filename)
    coil_feature_vectors.append(im.flatten())
coil_20_data = np.asarray(coil_feature_vectors)
coil_20_target = pd.Series(filelist).str.extract("obj([0-9]+)", expand=False).values.astype(np.int32)

## Buildings

In [None]:
if not os.path.isfile(data_folder / 'buildings.rar'):
    results = requests.get('http://eprints.lincoln.ac.uk/id/eprint/16079/1/dataset.rar')
    with open(data_folder / 'buildings.rar', "wb") as code:
        code.write(results.content)

In [None]:
if not os.path.isfile(data_folder / 'sheffield_buildings/Dataset/Dataset/1/S1-01.jpeg'):
    rf = rarfile.RarFile(f'{data_folder}/buildings.rar')
    rf.extractall(f'{data_folder}/sheffield_buildings')

In [None]:
buildings_data = []
buildings_target = []
for i in range(1, 41):
    directory = data_folder / f"sheffield_buildings/Dataset/Dataset/{i}"
    images = np.vstack([np.asarray(Image.open(filename).resize((96, 96))).flatten() for filename in glob(f"{directory}/*")])
    labels = np.full(len(glob(f"{directory}/*")), i, dtype=np.int32)
    buildings_data.append(images)
    buildings_target.append(labels)
buildings_data = np.vstack(buildings_data)
buildings_target = np.hstack(buildings_target)

# Clustering metric eval

To make things easier later we will write some short functions to evaluate clusterings (with some special handling of singleton clusters or noise points for clusterign algorithms that support such things), and to plot the results for easy comparison.

In [None]:
def eval_clusters(cluster_labels, true_labels, raw_data, cluster_method="None", min_cluster_size=5):
    unique_labels = np.unique(cluster_labels)
    cluster_sizes, size_ids = np.histogram(cluster_labels, bins=unique_labels)
    if np.any(cluster_sizes == 1): # Has singleton clusters -- call them noise
        singleton_clusters = size_ids[:-1][cluster_sizes <= min_cluster_size]
        for c in singleton_clusters:
            cluster_labels[cluster_labels == c] = -1
    if np.any(cluster_labels < 0): # Has noise points
        clustered_points = (cluster_labels >= 0)
        ari = adjusted_rand_score(true_labels[clustered_points], cluster_labels[clustered_points])
        ami = adjusted_mutual_info_score(true_labels[clustered_points], cluster_labels[clustered_points])
        sil = silhouette_score(raw_data[clustered_points], cluster_labels[clustered_points])
        pct_clustered = (np.sum(clustered_points) / cluster_labels.shape[0])
        print(f"ARI: {ari:.4f}\nAMI: {ami:.4f}\nSilhouette: {sil:.4f}\nPct clustered: {pct_clustered * 100:.2f}%")
    else:
        ari = adjusted_rand_score(true_labels, cluster_labels)
        ami = adjusted_mutual_info_score(true_labels, cluster_labels)
        sil = silhouette_score(raw_data, cluster_labels)
        print(f"ARI: {ari:.4f}\nAMI: {ami:.4f}\nSilhouette: {sil:.4f}")
        pct_clustered = 1.0
    
    return {"Method": cluster_method, "ARI": ari, "AMI": ami, "Silhouette": sil, "Pct Clustered": pct_clustered}

In [None]:
def plot_scores(results_dataframe, score_types=("ARI", "AMI"), colors=list(sns.color_palette()), width=0.75):
    fig, axs = plt.subplots(1, len(score_types), figsize=(8 * len(score_types), 8))
    x_ticklabels = results_dataframe.Method.unique()
    x_positions = np.arange(len(x_ticklabels), dtype=np.float32) - width / 2
    dim_red_types = results_dataframe["Dim Reduction"].unique()
    bar_width = width / len(dim_red_types)
    for offset_idx, dim_red in enumerate(dim_red_types):
        color = colors[offset_idx]
        for i, score_type in enumerate(score_types):
            sub_dataframe = results_dataframe[
                (results_dataframe["Score Type"] == score_type) &
                (results_dataframe["Dim Reduction"] == dim_red)
            ]
            axs[i].bar(
                x=x_positions,
                height=sub_dataframe["Score"],
                width=bar_width,
                align="edge",
                color=[(*color, v) for v in sub_dataframe["Pct Clustered"]],
                label=dim_red if i ==0 else None,
            )
            axs[i].set_xlabel("Cluster Method")
            axs[i].set_xticks(np.arange(len(x_ticklabels)))
            axs[i].set_xticklabels(x_ticklabels)
            axs[i].set_ylabel(f"{score_type} Score")
            axs[i].set_title(score_type, fontsize=20)
            axs[i].grid(visible=False, axis="x")
            axs[i].set_ylim([0, 1.05])
        x_positions += bar_width
        
    if len(dim_red_types) > 1:
        fig.legend(loc="center right", bbox_to_anchor=(1.125, 0.5), borderaxespad=0.0, fontsize=20)
        
    fig.tight_layout()

# Pendigits clustering scores

In [None]:
raw_pendigits = digits.data.astype(np.float32)

In [None]:
%%time
km_labels = cluster.KMeans(n_clusters=10).fit_predict(raw_pendigits)
cl_labels = cluster.AgglomerativeClustering(n_clusters=10, linkage="complete").fit_predict(raw_pendigits)
sl_labels = cluster.AgglomerativeClustering(n_clusters=160, linkage="single").fit_predict(raw_pendigits)
db_labels = cluster.DBSCAN(eps=20.0).fit_predict(raw_pendigits)
hd_labels = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=100).fit_predict(raw_pendigits)

In [None]:
pendigits_raw_results = pd.DataFrame(
    [
        eval_clusters(km_labels, digits.target, raw_pendigits, cluster_method="K-Means"),
        eval_clusters(cl_labels, digits.target, raw_pendigits, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_labels, digits.target, raw_pendigits, cluster_method="Single\nLinkage"),
        eval_clusters(db_labels, digits.target, raw_pendigits, cluster_method="DBSCAN"),
        eval_clusters(hd_labels, digits.target, raw_pendigits, cluster_method="HDBSCAN"),
    ]
)
pendigits_raw_results_long = pendigits_raw_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
pendigits_raw_results_long["Dim Reduction"] = "None"

In [None]:
plot_scores(pendigits_raw_results_long)

In [None]:
pca_pendigits = PCA(n_components=16).fit_transform(raw_pendigits)

In [None]:
%%time
km_pca_labels = cluster.KMeans(n_clusters=10).fit_predict(pca_pendigits)
cl_pca_labels = cluster.AgglomerativeClustering(n_clusters=10, linkage="complete").fit_predict(pca_pendigits)
sl_pca_labels = cluster.AgglomerativeClustering(n_clusters=160, linkage="single").fit_predict(pca_pendigits)
db_pca_labels = cluster.DBSCAN(eps=15.0).fit_predict(pca_pendigits)
hd_pca_labels = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=100).fit_predict(pca_pendigits)

In [None]:
pendigits_pca_results = pd.DataFrame(
    [
        eval_clusters(km_pca_labels, digits.target, raw_pendigits, cluster_method="K-Means"),
        eval_clusters(cl_pca_labels, digits.target, raw_pendigits, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_pca_labels, digits.target, raw_pendigits, cluster_method="Single\nLinkage"),
        eval_clusters(db_pca_labels, digits.target, raw_pendigits, cluster_method="DBSCAN"),
        eval_clusters(hd_pca_labels, digits.target, raw_pendigits, cluster_method="HDBSCAN"),
    ]
)
pendigits_pca_results_long = pendigits_pca_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
pendigits_pca_results_long["Dim Reduction"] = "PCA"

In [None]:
plot_scores(pd.concat([pendigits_raw_results_long, pendigits_pca_results_long]))

In [None]:
umap_pendigits = umap.UMAP(n_components=4, min_dist=1e-8, random_state=0).fit_transform(raw_pendigits)

In [None]:
%%time
km_umap_labels = cluster.KMeans(n_clusters=10).fit_predict(umap_pendigits)
cl_umap_labels = cluster.AgglomerativeClustering(n_clusters=10, linkage="complete").fit_predict(umap_pendigits)
sl_umap_labels = cluster.AgglomerativeClustering(n_clusters=20, linkage="single").fit_predict(umap_pendigits)
db_umap_labels = cluster.DBSCAN(eps=0.5).fit_predict(umap_pendigits)
hd_umap_labels = hdbscan.HDBSCAN(min_samples=5, min_cluster_size=100).fit_predict(umap_pendigits)

In [None]:
pendigits_umap_results = pd.DataFrame(
    [
        eval_clusters(km_umap_labels, digits.target, raw_pendigits, cluster_method="K-Means"),
        eval_clusters(cl_umap_labels, digits.target, raw_pendigits, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_umap_labels, digits.target, raw_pendigits, cluster_method="Single\nLinkage"),
        eval_clusters(db_umap_labels, digits.target, raw_pendigits, cluster_method="DBSCAN"),
        eval_clusters(hd_umap_labels, digits.target, raw_pendigits, cluster_method="HDBSCAN"),
    ]
)
pendigits_umap_results_long = pendigits_umap_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
pendigits_umap_results_long["Dim Reduction"] = "UMAP"

In [None]:
plot_scores(pd.concat([pendigits_raw_results_long, pendigits_pca_results_long, pendigits_umap_results_long]))

In [None]:
plot_scores(pd.concat([pendigits_raw_results_long, pendigits_pca_results_long, pendigits_umap_results_long]), 
            score_types=("ARI", "AMI", "Silhouette"))

# COIL-20 Clustering

In [None]:
raw_coil = coil_20_data.astype(np.float32)

In [None]:
%%time
km_labels = cluster.KMeans(n_clusters=20).fit_predict(raw_coil)

In [None]:
%%time
cl_labels = cluster.AgglomerativeClustering(n_clusters=20, linkage="complete").fit_predict(raw_coil)

In [None]:
%%time
sl_labels = cluster.AgglomerativeClustering(n_clusters=80, linkage="single").fit_predict(raw_coil)

In [None]:
%%time
db_labels = cluster.DBSCAN(eps=5000.0).fit_predict(raw_coil)

In [None]:
np.unique(db_labels)

In [None]:
%%time
hd_labels = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=20).fit_predict(raw_coil)

In [None]:
coil_raw_results = pd.DataFrame(
    [
        eval_clusters(km_labels, coil_20_target, raw_coil, cluster_method="K-Means"),
        eval_clusters(cl_labels, coil_20_target, raw_coil, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_labels, coil_20_target, raw_coil, cluster_method="Single\nLinkage"),
        eval_clusters(db_labels, coil_20_target, raw_coil, cluster_method="DBSCAN"),
        eval_clusters(hd_labels, coil_20_target, raw_coil, cluster_method="HDBSCAN"),
    ]
)
coil_raw_results_long = coil_raw_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
coil_raw_results_long["Dim Reduction"] = "None"

In [None]:
plot_scores(coil_raw_results_long)

In [None]:
pca_coil = PCA(n_components=64).fit_transform(raw_coil)

In [None]:
%%time
km_pca_labels = cluster.KMeans(n_clusters=20).fit_predict(pca_coil)
cl_pca_labels = cluster.AgglomerativeClustering(n_clusters=20, linkage="complete").fit_predict(pca_coil)
sl_pca_labels = cluster.AgglomerativeClustering(n_clusters=80, linkage="single").fit_predict(pca_coil)
db_pca_labels = cluster.DBSCAN(eps=4000.0).fit_predict(pca_coil)
hd_pca_labels = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=20).fit_predict(pca_coil)

In [None]:
coil_pca_results = pd.DataFrame(
    [
        eval_clusters(km_pca_labels, coil_20_target, raw_coil, cluster_method="K-Means"),
        eval_clusters(cl_pca_labels, coil_20_target, raw_coil, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_pca_labels, coil_20_target, raw_coil, cluster_method="Single\nLinkage"),
        eval_clusters(db_pca_labels, coil_20_target, raw_coil, cluster_method="DBSCAN"),
        eval_clusters(hd_pca_labels, coil_20_target, raw_coil, cluster_method="HDBSCAN"),
    ]
)
coil_pca_results_long = coil_pca_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
coil_pca_results_long["Dim Reduction"] = "PCA"

In [None]:
plot_scores(pd.concat([coil_raw_results_long, coil_pca_results_long]))

In [None]:
umap_coil = umap.UMAP(n_neighbors=5, n_components=4, min_dist=1e-8, random_state=0, n_epochs=1000).fit_transform(raw_coil)

In [None]:
%%time
km_umap_labels = cluster.KMeans(n_clusters=20).fit_predict(umap_coil)
cl_umap_labels = cluster.AgglomerativeClustering(n_clusters=20, linkage="complete").fit_predict(umap_coil)
sl_umap_labels = cluster.AgglomerativeClustering(n_clusters=80, linkage="single").fit_predict(umap_coil)
db_umap_labels = cluster.DBSCAN(eps=0.3).fit_predict(umap_coil)
hd_umap_labels = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=20).fit_predict(umap_coil)

In [None]:
coil_umap_results = pd.DataFrame(
    [
        eval_clusters(km_umap_labels, coil_20_target, raw_coil, cluster_method="K-Means"),
        eval_clusters(cl_umap_labels, coil_20_target, raw_coil, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_umap_labels, coil_20_target, raw_coil, cluster_method="Single\nLinkage"),
        eval_clusters(db_umap_labels, coil_20_target, raw_coil, cluster_method="DBSCAN"),
        eval_clusters(hd_umap_labels, coil_20_target, raw_coil, cluster_method="HDBSCAN"),
    ]
)
coil_umap_results_long = coil_umap_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
coil_umap_results_long["Dim Reduction"] = "UMAP"

In [None]:
plot_scores(pd.concat([coil_raw_results_long, coil_pca_results_long, coil_umap_results_long]))

In [None]:
plot_scores(
    pd.concat([coil_raw_results_long, coil_pca_results_long, coil_umap_results_long]),
    score_types=("ARI", "AMI", "Silhouette")
)

# MNIST Clustering

In [None]:
raw_mnist = mnist.data.astype(np.float32)[:35000]

In [None]:
%%time
km_labels = cluster.KMeans(n_clusters=10).fit_predict(raw_mnist)
cl_labels = cluster.AgglomerativeClustering(n_clusters=10, linkage="complete").fit_predict(raw_mnist)
sl_labels = cluster.AgglomerativeClustering(n_clusters=80, linkage="single").fit_predict(raw_mnist)
db_labels = cluster.DBSCAN(eps=1000.0).fit_predict(raw_mnist)
hd_labels = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=100).fit_predict(raw_mnist)

In [None]:
mnist_raw_results = pd.DataFrame(
    [
        eval_clusters(km_labels, mnist.target[:35000], raw_mnist, cluster_method="K-Means"),
        eval_clusters(cl_labels, mnist.target[:35000], raw_mnist, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_labels, mnist.target[:35000], raw_mnist, cluster_method="Single\nLinkage"),
        eval_clusters(db_labels, mnist.target[:35000], raw_mnist, cluster_method="DBSCAN"),
        eval_clusters(hd_labels, mnist.target[:35000], raw_mnist, cluster_method="HDBSCAN"),
    ]
)
mnist_raw_results_long = mnist_raw_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
mnist_raw_results_long["Dim Reduction"] = "None"

In [None]:
plot_scores(mnist_raw_results_long)

In [None]:
pca_mnist = PCA(n_components=32).fit_transform(raw_mnist)

In [None]:
%%time
km_pca_labels = cluster.KMeans(n_clusters=10).fit_predict(pca_mnist)
cl_pca_labels = cluster.AgglomerativeClustering(n_clusters=10, linkage="complete").fit_predict(pca_mnist)
sl_pca_labels = cluster.AgglomerativeClustering(n_clusters=80, linkage="single").fit_predict(pca_mnist)
db_pca_labels = cluster.DBSCAN(eps=600.0).fit_predict(pca_mnist)
hd_pca_labels = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=100).fit_predict(pca_mnist)

In [None]:
mnist_pca_results = pd.DataFrame(
    [
        eval_clusters(km_pca_labels, mnist.target[:35000], raw_mnist, cluster_method="K-Means"),
        eval_clusters(cl_pca_labels, mnist.target[:35000], raw_mnist, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_pca_labels, mnist.target[:35000], raw_mnist, cluster_method="Single\nLinkage"),
        eval_clusters(db_pca_labels, mnist.target[:35000], raw_mnist, cluster_method="DBSCAN"),
        eval_clusters(hd_pca_labels, mnist.target[:35000], raw_mnist, cluster_method="HDBSCAN"),
    ]
)
mnist_pca_results_long = mnist_pca_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
mnist_pca_results_long["Dim Reduction"] = "PCA"

In [None]:
plot_scores(pd.concat([mnist_raw_results_long, mnist_pca_results_long]))

In [None]:
umap_mnist = umap.UMAP(n_neighbors=10, n_components=4, min_dist=1e-8, random_state=42, n_epochs=500).fit_transform(raw_mnist)

In [None]:
%%time
km_umap_labels = cluster.KMeans(n_clusters=10).fit_predict(umap_mnist)
cl_umap_labels = cluster.AgglomerativeClustering(n_clusters=10, linkage="complete").fit_predict(umap_mnist)
sl_umap_labels = cluster.AgglomerativeClustering(n_clusters=80, linkage="single").fit_predict(umap_mnist)
db_umap_labels = cluster.DBSCAN(eps=0.1).fit_predict(umap_mnist)
hd_umap_labels = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=100).fit_predict(umap_mnist)

In [None]:
mnist_umap_results = pd.DataFrame(
    [
        eval_clusters(km_umap_labels, mnist.target[:35000], raw_mnist, cluster_method="K-Means"),
        eval_clusters(cl_umap_labels, mnist.target[:35000], raw_mnist, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_umap_labels, mnist.target[:35000], raw_mnist, cluster_method="Single\nLinkage"),
        eval_clusters(db_umap_labels, mnist.target[:35000], raw_mnist, cluster_method="DBSCAN"),
        eval_clusters(hd_umap_labels, mnist.target[:35000], raw_mnist, cluster_method="HDBSCAN"),
    ]
)
mnist_umap_results_long = mnist_umap_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
mnist_umap_results_long["Dim Reduction"] = "UMAP"

In [None]:
plot_scores(pd.concat([mnist_raw_results_long, mnist_pca_results_long, mnist_umap_results_long]))

In [None]:
plot_scores(
    pd.concat([mnist_raw_results_long, mnist_pca_results_long, mnist_umap_results_long]),
    score_types=("ARI", "AMI", "Silhouette")
)

# USPS Clustering

In [None]:
raw_usps = usps.data.astype(np.float32)

In [None]:
%%time
km_labels = cluster.KMeans(n_clusters=10).fit_predict(raw_usps)
cl_labels = cluster.AgglomerativeClustering(n_clusters=10, linkage="complete").fit_predict(raw_usps)
sl_labels = cluster.AgglomerativeClustering(n_clusters=80, linkage="single").fit_predict(raw_usps)
db_labels = cluster.DBSCAN(eps=3.5).fit_predict(raw_usps)
hd_labels = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=100).fit_predict(raw_usps)

In [None]:
usps_raw_results = pd.DataFrame(
    [
        eval_clusters(km_labels, usps.target, raw_usps, cluster_method="K-Means"),
        eval_clusters(cl_labels, usps.target, raw_usps, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_labels, usps.target, raw_usps, cluster_method="Single\nLinkage"),
        eval_clusters(db_labels, usps.target, raw_usps, cluster_method="DBSCAN"),
        eval_clusters(hd_labels, usps.target, raw_usps, cluster_method="HDBSCAN"),
    ]
)
usps_raw_results_long = usps_raw_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
usps_raw_results_long["Dim Reduction"] = "None"

In [None]:
plot_scores(usps_raw_results_long)

In [None]:
pca_usps = PCA(n_components=32).fit_transform(raw_usps)

In [None]:
%%time
km_pca_labels = cluster.KMeans(n_clusters=10).fit_predict(pca_usps)
cl_pca_labels = cluster.AgglomerativeClustering(n_clusters=10, linkage="complete").fit_predict(pca_usps)
sl_pca_labels = cluster.AgglomerativeClustering(n_clusters=80, linkage="single").fit_predict(pca_usps)
db_pca_labels = cluster.DBSCAN(eps=2.0).fit_predict(pca_usps)
hd_pca_labels = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=100).fit_predict(pca_usps)

In [None]:
usps_pca_results = pd.DataFrame(
    [
        eval_clusters(km_pca_labels, usps.target, raw_usps, cluster_method="K-Means"),
        eval_clusters(cl_pca_labels, usps.target, raw_usps, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_pca_labels, usps.target, raw_usps, cluster_method="Single\nLinkage"),
        eval_clusters(db_pca_labels, usps.target, raw_usps, cluster_method="DBSCAN"),
        eval_clusters(hd_pca_labels, usps.target, raw_usps, cluster_method="HDBSCAN"),
    ]
)
usps_pca_results_long = usps_pca_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
usps_pca_results_long["Dim Reduction"] = "PCA"

In [None]:
plot_scores(pd.concat([usps_raw_results_long, usps_pca_results_long]))

In [None]:
umap_usps = umap.UMAP(n_neighbors=10, n_components=4, min_dist=1e-8, random_state=42, n_epochs=500).fit_transform(raw_usps)

In [None]:
%%time
km_umap_labels = cluster.KMeans(n_clusters=10).fit_predict(umap_usps)
cl_umap_labels = cluster.AgglomerativeClustering(n_clusters=10, linkage="complete").fit_predict(umap_usps)
sl_umap_labels = cluster.AgglomerativeClustering(n_clusters=80, linkage="single").fit_predict(umap_usps)
db_umap_labels = cluster.DBSCAN(eps=0.15).fit_predict(umap_usps)
hd_umap_labels = hdbscan.HDBSCAN(min_samples=10, min_cluster_size=100).fit_predict(umap_usps)

In [None]:
usps_umap_results = pd.DataFrame(
    [
        eval_clusters(km_umap_labels, usps.target, raw_usps, cluster_method="K-Means"),
        eval_clusters(cl_umap_labels, usps.target, raw_usps, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_umap_labels, usps.target, raw_usps, cluster_method="Single\nLinkage"),
        eval_clusters(db_umap_labels, usps.target, raw_usps, cluster_method="DBSCAN"),
        eval_clusters(hd_umap_labels, usps.target, raw_usps, cluster_method="HDBSCAN"),
    ]
)
usps_umap_results_long = usps_umap_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
usps_umap_results_long["Dim Reduction"] = "UMAP"

In [None]:
plot_scores(pd.concat([usps_raw_results_long, usps_pca_results_long, usps_umap_results_long]))

In [None]:
plot_scores(
    pd.concat([usps_raw_results_long, usps_pca_results_long, usps_umap_results_long]),
    score_types=("ARI", "AMI", "Silhouette")
)

# Buildings Clustering

In [None]:
raw_buildings = buildings_data.astype(np.float32)

In [None]:
%%time
km_labels = cluster.KMeans(n_clusters=40).fit_predict(raw_buildings)
cl_labels = cluster.AgglomerativeClustering(n_clusters=40, linkage="complete").fit_predict(raw_buildings)
sl_labels = cluster.AgglomerativeClustering(n_clusters=120, linkage="single").fit_predict(raw_buildings)
db_labels = cluster.DBSCAN(eps=6000).fit_predict(raw_buildings)
hd_labels = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=20).fit_predict(raw_buildings)

In [None]:
buildings_raw_results = pd.DataFrame(
    [
        eval_clusters(km_labels, buildings_target, raw_buildings, cluster_method="K-Means"),
        eval_clusters(cl_labels, buildings_target, raw_buildings, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_labels, buildings_target, raw_buildings, cluster_method="Single\nLinkage"),
        eval_clusters(db_labels, buildings_target, raw_buildings, cluster_method="DBSCAN"),
        eval_clusters(hd_labels, buildings_target, raw_buildings, cluster_method="HDBSCAN"),
    ]
)
buildings_raw_results_long = buildings_raw_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
buildings_raw_results_long["Dim Reduction"] = "None"

In [None]:
plot_scores(buildings_raw_results_long)

In [None]:
pca_buildings = PCA(n_components=32).fit_transform(raw_buildings)

In [None]:
%%time
km_pca_labels = cluster.KMeans(n_clusters=40).fit_predict(pca_buildings)
cl_pca_labels = cluster.AgglomerativeClustering(n_clusters=40, linkage="complete").fit_predict(pca_buildings)
sl_pca_labels = cluster.AgglomerativeClustering(n_clusters=120, linkage="single").fit_predict(pca_buildings)
db_pca_labels = cluster.DBSCAN(eps=2000.0).fit_predict(pca_buildings)
hd_pca_labels = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=20).fit_predict(pca_buildings)

In [None]:
buildings_pca_results = pd.DataFrame(
    [
        eval_clusters(km_pca_labels, buildings_target, raw_buildings, cluster_method="K-Means"),
        eval_clusters(cl_pca_labels, buildings_target, raw_buildings, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_pca_labels, buildings_target, raw_buildings, cluster_method="Single\nLinkage"),
        eval_clusters(db_pca_labels, buildings_target, raw_buildings, cluster_method="DBSCAN"),
        eval_clusters(hd_pca_labels, buildings_target, raw_buildings, cluster_method="HDBSCAN"),
    ]
)
buildings_pca_results_long = buildings_pca_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
buildings_pca_results_long["Dim Reduction"] = "PCA"

In [None]:
plot_scores(pd.concat([buildings_raw_results_long, buildings_pca_results_long]))

In [None]:
umap_buildings = umap.UMAP(n_neighbors=8, n_components=4, min_dist=1e-8, random_state=42, n_epochs=1000).fit_transform(raw_buildings)

In [None]:
%%time
km_umap_labels = cluster.KMeans(n_clusters=40).fit_predict(umap_buildings)
cl_umap_labels = cluster.AgglomerativeClustering(n_clusters=40, linkage="complete").fit_predict(umap_buildings)
sl_umap_labels = cluster.AgglomerativeClustering(n_clusters=120, linkage="single").fit_predict(umap_buildings)
db_umap_labels = cluster.DBSCAN(eps=0.25).fit_predict(umap_buildings)
hd_umap_labels = hdbscan.HDBSCAN(min_samples=3, min_cluster_size=20).fit_predict(umap_buildings)

In [None]:
buildings_umap_results = pd.DataFrame(
    [
        eval_clusters(km_umap_labels, buildings_target, raw_buildings, cluster_method="K-Means"),
        eval_clusters(cl_umap_labels, buildings_target, raw_buildings, cluster_method="Complete\nLinkage"),
        eval_clusters(sl_umap_labels, buildings_target, raw_buildings, cluster_method="Single\nLinkage"),
        eval_clusters(db_umap_labels, buildings_target, raw_buildings, cluster_method="DBSCAN"),
        eval_clusters(hd_umap_labels, buildings_target, raw_buildings, cluster_method="HDBSCAN"),
    ]
)
buildings_umap_results_long = buildings_umap_results.melt(["Method", "Pct Clustered"], var_name="Score Type", value_name="Score")
buildings_umap_results_long["Dim Reduction"] = "UMAP"

In [None]:
plot_scores(pd.concat([buildings_raw_results_long, buildings_pca_results_long, buildings_umap_results_long]))

In [None]:
plot_scores(
    pd.concat([buildings_raw_results_long, buildings_pca_results_long, buildings_umap_results_long]),
    score_types=("ARI", "AMI", "Silhouette")
)