# Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

The DBSCAN algorithm is a clustering algorithm that works really well for datasets that have regions of high density.

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or _cuda_array_interface_compliant), as well  as cuDF DataFrames.

For information about the cuDF format, refer to the [cuDF documentation](https://rapidsai.github.io/projects/cudf/en/latest/)

For information about cuML's DBSCAN implementation, refer to the [cuML documentation](https://rapidsai.github.io/projects/cuml/en/latest/index.html)

In [None]:
import os

import numpy as np

from sklearn import datasets

import pandas as pd
import cudf as gd

from sklearn.cluster import DBSCAN as skDBSCAN
from cuml.cluster import DBSCAN as cumlDBSCAN

## Generate Data

In [None]:
n_samples = 100000
n_features = 2

In [5]:
data, labels = datasets.make_blobs(
   n_samples=n_samples, n_features=n_features, centers=5, random_state=7)

## Define Parameters

**eps:** maximum distance between 2 sample points for them to be in the same neighborhood

**min_samples:** number of samples that should be present in a neighborhood for it to be considered as a core poin*

In [None]:
eps = 3
min_samples = 2

## Fit Scikit-learn Model

In [None]:
%%time
clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)

## Fit cuML Model

In [2]:
%%time
X = gd.DataFrame.from_pandas(X)

NameError: name 'gd' is not defined

In [None]:
%%time
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(X)

## Evaluate Results

In [None]:
%%time
cuml_score = adjusted_rand_score(labels, clustering_cuml.labels_)
sk_score = adjusted_rand_score(labels, clustering_sk.labels_)

In [None]:
threshold = 1e-5

passed = (cuml_score - sk_score) < threshold
message = 'compare kmeans: cuml vs sklearn labels_ are ' + ('equal' if passed else 'NOT equal')
print(message)