dbscan crashed when the data set grow large #31

michaelkingdom · 2018-11-11T02:23:39Z

GPU used: 1u Tesla P40;
OS and version: Ubuntu 18.04;
CUDA version: 9.2;
Driver: 410.48;
gcc version: 7.3;
python version: 3.5

I have a data set which content 180,914 rows and 48 columns, most of them are integer (from 0 to 10,000):

I convert this full data frame to a float64 data type.
It works well when we use the “sklearn” lib (CPU) to run.
I tried to run use the DBSCAN lib in cuML, it crashed, and no response at all, I have to restart the whole kernel.
Then I tried to reduce the rows in our data sets from 180K to 10K, it works, but very slow, it cost about 3s, and then 20K rows data for 6s, 30K rows for 9s, and then it will crash when the data become to 70K rows.

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from sklearn.datasets.samples_generator import make_blobs
from cuML import DBSCAN as cumlDBSCAN
import pygdf
import os
import dask
#import dask_gdf
from dask.delayed import delayed
from dask.distributed import Client, wait
import pygdf
from pygdf.dataframe import DataFrame
from collections import OrderedDict
from glob import glob
from sklearn.cluster import KMeans
import re
from itertools import cycle
from sklearn.preprocessing import StandardScaler

##LOAD DATA
X = load_data()

eps = 0.5
min_samples = 3

run using sklearn lib

clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)

run using cuML lib

Y = pygdf.DataFrame.from_pandas(X)

clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
Z = Y.head(70000)
clustering_cuml.fit(Z)

datametrician · 2018-11-11T03:58:45Z

Thank you. Others have experienced this and we're actively debugging the issue.

dantegd · 2018-11-14T02:29:53Z

@michaelkingdom is this still the case after the recent PRs that included bug fixes for dbscan?

In particular this got recently merged and potentially affect this bug:
#33 and
#30

michaelkingdom · 2018-11-14T02:55:02Z

@dantegd ,we tried the new dbscan file.
it is not crashed, but the speed is slow and the result is NOT equal with sklearn dbscan lib result.
Should we file another issue for the speed and result issue?

datametrician · 2018-11-16T04:42:45Z

@dantegd any update on this issue?

datametrician · 2018-11-16T06:15:38Z

#30
#34

dantegd · 2018-11-16T13:27:01Z

@michaelkingdom I believe the latest merged PR #35 which reworked dbscan should alleviate the issues, at least it did so for me and a few others. Let us know if that is the case, if not we'll dig further. Thanks!

michaelkingdom · 2018-11-29T14:34:03Z

@dantegd Yes, the crash issue is fixed. But the GPU speed is slow, could you please help to figure it out?
My code as below:
`
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from cuml import DBSCAN as cumlDBSCAN
import cudf
from cudf.dataframe import DataFrame

from timeit import default_timer

class Timer(object):
def init(self):
self._timer = default_timer

def __enter__(self):
    self.start()
    return self

def __exit__(self, *args):
    self.stop()

def start(self):
    """Start the timer."""
    self.start = self._timer()

def stop(self):
    """Stop the timer. Calculate the interval in seconds."""
    self.end = self._timer()
    self.interval = self.end - self.start

import gzip
def load_data(cached = '/datasets/mr_clean.xlsx'):
mr_data = pd.read_excel(cached, header=0)
mr_cols = mr_data.columns
print('data.columns = \n', mr_data.columns)
mr_len = mr_data.iloc[:, 0].size
print("mr length:", mr_len)

X=mr_data.fillna(mr_data.mean())
X = X.astype(np.float64)

print("test data info:", mr_data.head())

return X

##LOAD DATA 1 TIME
X = load_data()

from sklearn.metrics import mean_squared_error
def array_equal(a,b,threshold=5e-3,with_sign=True):
a = to_nparray(a)
b = to_nparray(b)
if with_sign == False:
a,b = np.abs(a),np.abs(b)
res = mean_squared_error(a,b)<threshold
return res

def to_nparray(x):
if isinstance(x,np.ndarray) or isinstance(x,pd.DataFrame):
return np.array(x)
elif isinstance(x,np.float64):
return np.array([x])
elif isinstance(x,cudf.DataFrame) or isinstance(x,cudf.Series):
return x.to_pandas().values
return x

##Run tests
eps = 3
min_samples = 3

%%time
clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)

%%time
Y = cudf.DataFrame.from_pandas(X)

%%time
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(Y)

passed = array_equal(clustering_sk.labels_,clustering_cuml.labels_)
message = 'compare dbscan: cuml vs sklearn labels_ %s'%('equal'if passed else 'NOT equal')
print(message)
`
My dataset:
https://drive.google.com/file/d/1Gap8_O7rralnN_TvO6WpU6ijXL0-ZtAd/view?usp=sharing

CPU cost 7.5s, but GPU cost 65s
Could you please help me to find out the problem?

Thank you.

dantegd · 2018-11-30T02:45:11Z

Thanks for the update @michaelkingdom. I managed to reproduced the situation in my workstation with a newer version of cuML (less difference since I'm running in a gv100, but still slower than CPU), so we're currently looking into it.

dantegd · 2018-11-30T03:26:33Z

@michaelkingdom at first run I missed that you are using the default dbscan algorithm in sklearn (which means it is using algorithm=‘auto’ if I'm not mistaken). The equivalent algorithm implemented in cuML is algorithm='brute' which in your example in my workstation I get the following times:

CPU auto: 9 seconds
GPU brute: 48 seconds
CPU brute: 238 seconds

So the code is working for the current implementation, though improvements are being worked on actively.

dantegd · 2018-12-02T13:20:18Z

Since this issue seems to be solved now and the time difference is actually normal behavior due to the currently implemented algorithm, I’ll close the issue. If any other issues arise please submit additional issue, thanks @michaelkingdom !

michaelkingdom · 2018-12-03T02:31:30Z

@dantegd Thank you for your great help!

michaelkingdom · 2019-02-14T04:18:36Z

@dantegd This issue happen again in cuml-0.5.1.
cumlDBSCAN crashed when the data set grow up to 50,000 rows using my data set (https://drive.google.com/file/d/1Gap8_O7rralnN_TvO6WpU6ijXL0-ZtAd/view?usp=sharing).
Could you please help to figure out?

dantegd · 2019-02-14T04:40:55Z

@teju85 I think you had mentioned that this might be related to an issue in the distance calculation that I think I heard you mention before? I think you might have insights about this issue.

cjnolet · 2019-02-14T07:27:17Z

I believe this is fixed in PR #211.

michaelkingdom · 2019-02-14T14:39:37Z

@cjnolet @dantegd I tried demo code in cuml 0.5.1, it works. But it failed using my data set.
Could you please try my data set?
https://drive.google.com/file/d/1Gap8_O7rralnN_TvO6WpU6ijXL0-ZtAd/view?usp=sharing

cjnolet · 2019-02-14T15:14:15Z

@michaelkingdom, PR #221 will be included in release 0.6.0. We are currently considering whether or not there should be a 0.5.2, which includes it.

Here's my output from your data (I'm using a GV100):

%%time
X = pd.read_csv("/home/cjnolet/workspace/notebooks/cuml/data/mr_clean.csv", skiprows = 1, dtype = np.float32).fillna(0.0)
X = cudf.DataFrame.from_pandas(X)
CPU times: user 2.95 s, sys: 90 ms, total: 3.04 s
Wall time: 764 ms

X
<cudf.DataFrame ncols=48 nrows=182822 >

%%time
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(X)
n_rows: 182822
CPU times: user 10.2 s, sys: 6.22 s, total: 16.4 s
Wall time: 16.4 s

This PR will likely be merged in today. I know this issue has been open for a couple months now, however, you do have the option of building from source if you would like to use the patch before it is included in a release.

cjnolet · 2019-02-14T15:35:31Z

In regards to the differences in the labels, we have noticed the results between sklearn and cuml are effected differently by only slight variations of the eps parameter. On toy datasets, like the embedded circles, we were able to reproduce (near) exact matches to sklearn by scaling the eps parameter by +-10%.

We are working to figure out where and why this happens and, more importantly, which one (sklearn or cuml) is the absolute ground truth, if either. Even though dbscan is deterministic, there are places in both algorithms where very small decisions (e.g. order of elements presented for labeling) can effect the resulting labels. We have some hypotheses about why this behavior is happening and do really appreciate your bearing with us through this process.

For now, it might make more sense, when doing a direct comparison, to look at the densities of the labels that result and compare them by the proportion of inputs that were placed in the same cluster between the algorithms, rather than exact matches.

michaelkingdom · 2019-02-14T16:04:03Z

@cjnolet Thank you for your great feedback.
I'll try to build from source later.

Branch 0.8

dantegd added the bug Something isn't working label Nov 14, 2018

dantegd added ? - Needs Triage Need team to review and classify and removed ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 30, 2018

dantegd closed this as completed Dec 2, 2018

MurrayData mentioned this issue Dec 6, 2018

cuml dbscan terminating on large datasets 'invalid configuration argument' #54

Closed

dantegd reopened this Feb 14, 2019

cjnolet mentioned this issue Feb 14, 2019

Fixing DBSCAN crash issue by using the proper batch sizing consistently through the algorithm #211

Merged

cjnolet closed this as completed in #211 Feb 15, 2019

cjnolet mentioned this issue Feb 18, 2019

[BUG] DBSCAN fit_predict hangs on 60000 rows by 2 columns #223

Closed

dantegd pushed a commit that referenced this issue Jun 26, 2019

Merge pull request #31 from rapidsai/branch-0.8

9dd1591

Branch 0.8

JasonAtNvidia mentioned this issue Jan 27, 2020

KalmanFilter pytest code failing on my RTX workstation #1603

Closed

hershkoy mentioned this issue Jul 15, 2020

[BUG] Semi-Supervised UMAP, with euclidean target_metric, reduction errors when input passes a certain size. #2333

Open

tkpudgy mentioned this issue Jul 29, 2021

[QST] Is there a size limit to the input data for the RandomForestClassifier's fit function? #4132

Closed

ztf-ucas mentioned this issue Jan 8, 2022

[BUG] terminate called after throwing an instance of 'raft::cuda_error' #4474

Open

VibhuJawa mentioned this issue Jan 13, 2022

[BUG]cuML using memory outside of RMM Pool #4485

Open

seo-jaeyong mentioned this issue Jul 26, 2023

[QST] The error message is printed, but i can't solve this problem. Please help me. #5525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbscan crashed when the data set grow large #31

dbscan crashed when the data set grow large #31

michaelkingdom commented Nov 11, 2018

datametrician commented Nov 11, 2018

dantegd commented Nov 14, 2018

michaelkingdom commented Nov 14, 2018

datametrician commented Nov 16, 2018

datametrician commented Nov 16, 2018

dantegd commented Nov 16, 2018

michaelkingdom commented Nov 29, 2018

dantegd commented Nov 30, 2018

dantegd commented Nov 30, 2018 •

edited

Loading

dantegd commented Dec 2, 2018

michaelkingdom commented Dec 3, 2018

michaelkingdom commented Feb 14, 2019

dantegd commented Feb 14, 2019

cjnolet commented Feb 14, 2019

michaelkingdom commented Feb 14, 2019

cjnolet commented Feb 14, 2019

cjnolet commented Feb 14, 2019 •

edited

Loading

michaelkingdom commented Feb 14, 2019

dbscan crashed when the data set grow large #31

dbscan crashed when the data set grow large #31

Comments

michaelkingdom commented Nov 11, 2018

run using sklearn lib

run using cuML lib

datametrician commented Nov 11, 2018

dantegd commented Nov 14, 2018

michaelkingdom commented Nov 14, 2018

datametrician commented Nov 16, 2018

datametrician commented Nov 16, 2018

dantegd commented Nov 16, 2018

michaelkingdom commented Nov 29, 2018

dantegd commented Nov 30, 2018

dantegd commented Nov 30, 2018 • edited Loading

dantegd commented Dec 2, 2018

michaelkingdom commented Dec 3, 2018

michaelkingdom commented Feb 14, 2019

dantegd commented Feb 14, 2019

cjnolet commented Feb 14, 2019

michaelkingdom commented Feb 14, 2019

cjnolet commented Feb 14, 2019

cjnolet commented Feb 14, 2019 • edited Loading

michaelkingdom commented Feb 14, 2019

dantegd commented Nov 30, 2018 •

edited

Loading

cjnolet commented Feb 14, 2019 •

edited

Loading