Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBSCAN produces different number of clusters using cuML compared to sklearn #63

Closed
raghavmi opened this issue Dec 13, 2018 · 3 comments
Closed
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@raghavmi
Copy link

DBSCAN generates different # of clusters when using cuML compared to when using sklearn.

Dataset to reproduce:
https://github.com/PatWalters/gpu_kmeans/blob/master/fp.csv

Code to reproduce:

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from cuml import DBSCAN as cumlDBSCAN
import cudf
import os

X = pd.read_csv("fp.csv")
print('data',X.shape)

eps = 3
min_samples = 2

clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)
print("# of sklearn clusters", len(set(clustering_sk.labels_)))

X = cudf.DataFrame.from_pandas(X)
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(X)
print("# of cuML clusters", clustering_cuml.labels_.unique_count())
@dantegd
Copy link
Member

dantegd commented Dec 13, 2018

@cjnolet has a new version of dbscan that is almost ready for the next version. Corey, if you get the chance, could you see if this issue is still there on the new dbscan?

@cjnolet
Copy link
Member

cjnolet commented Dec 17, 2018

I am able to reproduce the inconsistency in the clusters with the new algorithm. I believe this related to how DBSCAN is batching large datasets to scale on a single GPU. Deleting half the data (making a single batch), either from the beginning or the end, yields consistent clusters.

Running the new DBSCAN on the exact same size of randomly generated (and dense) data consistently yields the same results as Sklearn.

@mike-wendt mike-wendt added the ? - Needs Triage Need team to review and classify label Dec 20, 2018
@mike-wendt mike-wendt added this to Needs prioritizing in Other Issue Triage Dec 20, 2018
@dantegd dantegd added the bug Something isn't working label Dec 22, 2018
@datametrician
Copy link

Closing this issue as it will be resolved in 0.5 cuML

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
Development

No branches or pull requests

5 participants