# Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

The DBSCAN algorithm is a clustering algorithm that works really well for datasets that have regions of high density.

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or _cuda_array_interface_compliant), as well  as cuDF DataFrames.

For information about the cuDF format, refer to the [cuDF documentation](https://rapidsai.github.io/projects/cudf/en/latest/)

For information about cuML's DBSCAN implementation: https://rapidsai.github.io/projects/cuml/en/latest/api.html#dbscan

In [1]:
import os

import numpy as np

from sklearn import datasets

import pandas as pd
import cudf as gd

from cuml.datasets import make_blobs

from sklearn.metrics import adjusted_rand_score

from sklearn.cluster import DBSCAN as skDBSCAN
from cuml.cluster import DBSCAN as cumlDBSCAN

## Generate Data

In [30]:
n_samples = 100000
n_features = 128

In [31]:
device_data, device_labels = make_blobs(
   n_samples=n_samples, n_features=n_features, centers=5, random_state=7)

device_data = gd.DataFrame.from_gpu_matrix(device_data)
device_labels = gd.Series(device_labels)

In [32]:
host_data = device_data.to_pandas()
host_labels = device_labels.to_pandas()

## Define Parameters

In [33]:
eps = 3
min_samples = 2

## Scikit-learn Model

In [34]:
%%time
clustering_sk = skDBSCAN(eps=eps,
                         min_samples=min_samples,
                         algorithm="brute",
                         n_jobs=-1)
clustering_sk.fit(host_data)

CPU times: user 31min 50s, sys: 42min 26s, total: 1h 14min 16s
Wall time: 2min 3s


## cuML Model

In [22]:
%%time
clustering_cuml = cumlDBSCAN(eps=eps,
                             min_samples=min_samples)
clustering_cuml.fit(device_data)

CPU times: user 244 ms, sys: 12 ms, total: 256 ms
Wall time: 258 ms


## Evaluate Results

In [23]:
%%time
cuml_score = adjusted_rand_score(host_labels, clustering_cuml.labels_)
sk_score = adjusted_rand_score(host_labels, clustering_sk.labels_)

CPU times: user 2.83 s, sys: 28 ms, total: 2.86 s
Wall time: 2.85 s


In [29]:
passed = (cuml_score - sk_score) < 1e-10
print('compare kmeans: cuml vs sklearn labels_ are ' + ('equal' if passed else 'NOT equal'))

compare kmeans: cuml vs sklearn labels_ are equal
