# Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

The DBSCAN algorithm is a clustering algorithm that works really well for datasets that have regions of high density.

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well  as cuDF DataFrames.

For information about cuDF, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable).

For information about cuML's DBSCAN implementation: https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.DBSCAN.

## Imports

In [None]:
import cudf
import matplotlib.pyplot as plt
import numpy as np
from cuml.datasets import make_blobs
from cuml.cluster import DBSCAN as cuDBSCAN
from sklearn.cluster import DBSCAN as skDBSCAN
from sklearn.metrics import adjusted_rand_score

%matplotlib inline

## Define Parameters

In [None]:
n_samples = 10**4
n_features = 2

eps = 0.15
min_samples = 3
random_state = 23

## Generate Data

In [None]:
%%time
device_data, device_labels = make_blobs(n_samples=n_samples, 
                                        n_features=n_features,
                                        centers=5,
                                        cluster_std=0.1,
                                        random_state=random_state)

device_data = cudf.DataFrame.from_gpu_matrix(device_data)
device_labels = cudf.Series(device_labels)

In [None]:
# Copy dataset from GPU memory to host memory.
# This is done to later compare CPU and GPU results.
host_data = device_data.to_pandas()
host_labels = device_labels.to_pandas()

## Scikit-learn Model

### Fit

In [None]:
%%time
clustering_sk = skDBSCAN(eps=eps,
                         min_samples=min_samples,
                         algorithm="brute",
                         n_jobs=-1)

clustering_sk.fit(host_data)

## cuML Model

### Fit

In [None]:
%%time
clustering_cuml = cuDBSCAN(eps=eps,
                           min_samples=min_samples,
                           verbose=True,
                           max_mbytes_per_batch=13e3)

clustering_cuml.fit(device_data, out_dtype="int32")

## Visualize Centroids

Chart the resulting clusters from cuML's DBSCAN, where each color represents one cluster found by the algorithm and black points are those not assigned to any cluster. (Unlike many clustering algorithms, DBSCAN can label some outlier points as "noise" that do not belong to a cluster.)

In [None]:
fig = plt.figure(figsize=(16, 10))

X = np.array(host_data)
labels = clustering_cuml.labels_

n_clusters_ = len(labels)

# Black removed and is used for noise instead.
unique_labels = labels.unique()
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markersize=5, markeredgecolor=tuple(col))

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

## Evaluate Results

Use the `adjusted_rand_score` to compare the two results, making sure the clusters are labeled similarly by both algorithms even if the exact numerical labels are not identical. 

In [None]:
%%time
sk_score = adjusted_rand_score(host_labels, clustering_sk.labels_)
cuml_score = adjusted_rand_score(host_labels, clustering_cuml.labels_)

In [None]:
passed = (cuml_score - sk_score) < 1e-10
print('compare dbscan: cuml vs sklearn labels_ are ' + ('equal' if passed else 'NOT equal'))