# HDBSCAN Soft Clustering Benchmark

This notebook is intended to provide a quick benchmark comparing RAPIDS cuML's HDBSCAN soft clustering on the GPU against the Scikit-learn-contrib version on the CPU.

This benchmark uses the [A Million News Headlines dataset](https://www.kaggle.com/datasets/therohk/million-headlines) from Kaggle, which contains over 1 million news article headlines from the Australian Broadcasting Corporation. The dataset will need to be downloaded to run this notebook.

To run this notebook, you will need RAPIDS cuML installed in addition to HDBSCAN and the `sententence-transformers` library. All of these libraries can be installed with `conda`

In [1]:
import os
import time
import json
from datetime import datetime

import numpy as np
import pandas as pd

import cuml
import hdbscan

from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


Adjust the path below to point to the zip file of the Million News Headlines dataset downloaded from Kaggle. 

In [2]:
million_articles_path = "/home/cjnolet/Downloads/archive.zip"

Adjust the path below to rename the output file

In [3]:
DATE_TAG = datetime.now().strftime("%Y-%m-%d")

outpath = f"hdbscan-apmv-benchmark-results-{DATE_TAG}.json1"
if os.path.exists(outpath):
    os.remove(outpath)    

Some options and settings for controlling the benchmarking behavior. 

In [4]:
benchmark_soft_cluster = True

MIN_SAMPLES = 50
MIN_CLUSTER_SIZE = 5

BACKENDS = {
    "cuml": cuml.cluster.hdbscan,
    "hdbscan": hdbscan
}

SIZES = [
    25000,
    50000,
    100000,
    200000,
    400000,
    800000,
    1600000
]

The GPU can have a small overhead for creating a CUDA context. Warm up the GPU to remove this overhead from the benchmarks

In [5]:
%%time
clusterer = cuml.cluster.hdbscan.HDBSCAN(
    prediction_data=True
)
clusterer.fit(np.arange(1000).reshape(50,20))

CPU times: user 3.84 s, sys: 2.03 s, total: 5.87 s
Wall time: 5.89 s


HDBSCAN()

Create a lightweight Python context manager to time the HDBSCAN soft clustering steps

In [6]:
class Timer:    
    def __enter__(self):
        self.tick = time.time()
        return self

    def __exit__(self, *args, **kwargs):
        self.tock = time.time()
        self.elapsed = self.tock - self.tick

Read the dataset into a Pandas Dataframe

In [7]:
df = pd.read_csv(million_articles_path)

Reduce original embedding dimensions with cuML's UMAP

In [12]:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df.headline_text)
umap = cuml.manifold.UMAP(n_components=15, n_neighbors=15, min_dist=0.0, random_state=12)
reduced_data = umap.fit_transform(embeddings)
np.random.shuffle(reduced_data)

In [13]:
k = reduced_data.shape[1]

Perform benchmark over configured number of data points

In [None]:
for n in SIZES:
    for library, backend in BACKENDS.items():
        bench_data = reduced_data[:n,:]

        benchmark_payload = {}
        benchmark_payload["backend"] = library
        
        with Timer() as fit_timer:
            clusterer = backend.HDBSCAN(
                min_samples=MIN_SAMPLES,
                min_cluster_size=MIN_CLUSTER_SIZE,
                metric='euclidean',
                prediction_data=True
            )
            clusterer.fit(bench_data)
            nclusters = len(np.unique(clusterer.labels_))
        benchmark_payload["fit_time"] = fit_timer.elapsed

        if benchmark_soft_cluster:
            with Timer() as membership_timer:
                soft_clusters = backend.all_points_membership_vectors(clusterer)
            benchmark_payload["membership_time"] = membership_timer.elapsed

        benchmark_payload["ncols"] = k
        benchmark_payload["nrows"] = bench_data.shape[0]
        benchmark_payload["min_samples"] = MIN_SAMPLES
        benchmark_payload["min_cluster_size"] = MIN_CLUSTER_SIZE
        benchmark_payload["num_clusters"] = nclusters
        print(benchmark_payload)

        with open(outpath, "a") as fh:
            fh.write(json.dumps(benchmark_payload))
            fh.write("\n")

        time.sleep(1)

{'backend': 'cuml', 'fit_time': 1.1983146667480469, 'membership_time': 0.004135608673095703, 'ncols': 15, 'nrows': 25000, 'min_samples': 50, 'min_cluster_size': 5, 'num_clusters': 58}
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKEN