Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbscan crashed when the data set grow large #31

Closed
michaelkingdom opened this issue Nov 11, 2018 · 18 comments
Closed

dbscan crashed when the data set grow large #31

michaelkingdom opened this issue Nov 11, 2018 · 18 comments

Comments

@michaelkingdom
Copy link

GPU used: 1u Tesla P40;
OS and version: Ubuntu 18.04;
CUDA version: 9.2;
Driver: 410.48;
gcc version: 7.3;
python version: 3.5

I have a data set which content 180,914 rows and 48 columns, most of them are integer (from 0 to 10,000):
pic1

I convert this full data frame to a float64 data type.
It works well when we use the “sklearn” lib (CPU) to run.
I tried to run use the DBSCAN lib in cuML, it crashed, and no response at all, I have to restart the whole kernel.
Then I tried to reduce the rows in our data sets from 180K to 10K, it works, but very slow, it cost about 3s, and then 20K rows data for 6s, 30K rows for 9s, and then it will crash when the data become to 70K rows.

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from sklearn.datasets.samples_generator import make_blobs
from cuML import DBSCAN as cumlDBSCAN
import pygdf
import os
import dask
#import dask_gdf
from dask.delayed import delayed
from dask.distributed import Client, wait
import pygdf
from pygdf.dataframe import DataFrame
from collections import OrderedDict
from glob import glob
from sklearn.cluster import KMeans
import re
from itertools import cycle
from sklearn.preprocessing import StandardScaler

##LOAD DATA
X = load_data()

eps = 0.5
min_samples = 3

run using sklearn lib

clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)

run using cuML lib

Y = pygdf.DataFrame.from_pandas(X)

clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
Z = Y.head(70000)
clustering_cuml.fit(Z)

@datametrician
Copy link

Thank you. Others have experienced this and we're actively debugging the issue.

@dantegd
Copy link
Member

dantegd commented Nov 14, 2018

@michaelkingdom is this still the case after the recent PRs that included bug fixes for dbscan?

In particular this got recently merged and potentially affect this bug:
#33 and
#30

@dantegd dantegd added the bug Something isn't working label Nov 14, 2018
@michaelkingdom
Copy link
Author

@dantegd ,we tried the new dbscan file.
it is not crashed, but the speed is slow and the result is NOT equal with sklearn dbscan lib result.
Should we file another issue for the speed and result issue?

@datametrician
Copy link

@dantegd any update on this issue?

@datametrician
Copy link

#30
#34

@dantegd
Copy link
Member

dantegd commented Nov 16, 2018

@michaelkingdom I believe the latest merged PR #35 which reworked dbscan should alleviate the issues, at least it did so for me and a few others. Let us know if that is the case, if not we'll dig further. Thanks!

@michaelkingdom
Copy link
Author

@dantegd Yes, the crash issue is fixed. But the GPU speed is slow, could you please help to figure it out?
My code as below:
`
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from cuml import DBSCAN as cumlDBSCAN
import cudf
from cudf.dataframe import DataFrame

from timeit import default_timer

class Timer(object):
def init(self):
self._timer = default_timer

def __enter__(self):
    self.start()
    return self

def __exit__(self, *args):
    self.stop()

def start(self):
    """Start the timer."""
    self.start = self._timer()

def stop(self):
    """Stop the timer. Calculate the interval in seconds."""
    self.end = self._timer()
    self.interval = self.end - self.start

import gzip
def load_data(cached = '/datasets/mr_clean.xlsx'):
mr_data = pd.read_excel(cached, header=0)
mr_cols = mr_data.columns
print('data.columns = \n', mr_data.columns)
mr_len = mr_data.iloc[:, 0].size
print("mr length:", mr_len)

X=mr_data.fillna(mr_data.mean())
X = X.astype(np.float64)

print("test data info:", mr_data.head())

return X

##LOAD DATA 1 TIME
X = load_data()

from sklearn.metrics import mean_squared_error
def array_equal(a,b,threshold=5e-3,with_sign=True):
a = to_nparray(a)
b = to_nparray(b)
if with_sign == False:
a,b = np.abs(a),np.abs(b)
res = mean_squared_error(a,b)<threshold
return res

def to_nparray(x):
if isinstance(x,np.ndarray) or isinstance(x,pd.DataFrame):
return np.array(x)
elif isinstance(x,np.float64):
return np.array([x])
elif isinstance(x,cudf.DataFrame) or isinstance(x,cudf.Series):
return x.to_pandas().values
return x

##Run tests
eps = 3
min_samples = 3

%%time
clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)

%%time
Y = cudf.DataFrame.from_pandas(X)

%%time
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(Y)

passed = array_equal(clustering_sk.labels_,clustering_cuml.labels_)
message = 'compare dbscan: cuml vs sklearn labels_ %s'%('equal'if passed else 'NOT equal')
print(message)
`
My dataset:
https://drive.google.com/file/d/1Gap8_O7rralnN_TvO6WpU6ijXL0-ZtAd/view?usp=sharing

CPU cost 7.5s, but GPU cost 65s
Could you please help me to find out the problem?

Thank you.

@dantegd
Copy link
Member

dantegd commented Nov 30, 2018

Thanks for the update @michaelkingdom. I managed to reproduced the situation in my workstation with a newer version of cuML (less difference since I'm running in a gv100, but still slower than CPU), so we're currently looking into it.

@dantegd dantegd added ? - Needs Triage Need team to review and classify and removed ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 30, 2018
@dantegd
Copy link
Member

dantegd commented Nov 30, 2018

@michaelkingdom at first run I missed that you are using the default dbscan algorithm in sklearn (which means it is using algorithm=‘auto’ if I'm not mistaken). The equivalent algorithm implemented in cuML is algorithm='brute' which in your example in my workstation I get the following times:

CPU auto: 9 seconds
GPU brute: 48 seconds
CPU brute: 238 seconds

So the code is working for the current implementation, though improvements are being worked on actively.

@dantegd
Copy link
Member

dantegd commented Dec 2, 2018

Since this issue seems to be solved now and the time difference is actually normal behavior due to the currently implemented algorithm, I’ll close the issue. If any other issues arise please submit additional issue, thanks @michaelkingdom !

@dantegd dantegd closed this as completed Dec 2, 2018
@michaelkingdom
Copy link
Author

@dantegd Thank you for your great help!

@michaelkingdom
Copy link
Author

@dantegd This issue happen again in cuml-0.5.1.
cumlDBSCAN crashed when the data set grow up to 50,000 rows using my data set (https://drive.google.com/file/d/1Gap8_O7rralnN_TvO6WpU6ijXL0-ZtAd/view?usp=sharing).
Could you please help to figure out?

@dantegd dantegd reopened this Feb 14, 2019
@dantegd
Copy link
Member

dantegd commented Feb 14, 2019

@teju85 I think you had mentioned that this might be related to an issue in the distance calculation that I think I heard you mention before? I think you might have insights about this issue.

@cjnolet
Copy link
Member

cjnolet commented Feb 14, 2019

I believe this is fixed in PR #211.

@michaelkingdom
Copy link
Author

@cjnolet @dantegd I tried demo code in cuml 0.5.1, it works. But it failed using my data set.
Could you please try my data set?
https://drive.google.com/file/d/1Gap8_O7rralnN_TvO6WpU6ijXL0-ZtAd/view?usp=sharing

@cjnolet
Copy link
Member

cjnolet commented Feb 14, 2019

@michaelkingdom, PR #221 will be included in release 0.6.0. We are currently considering whether or not there should be a 0.5.2, which includes it.

Here's my output from your data (I'm using a GV100):

%%time
X = pd.read_csv("/home/cjnolet/workspace/notebooks/cuml/data/mr_clean.csv", skiprows = 1, dtype = np.float32).fillna(0.0)
X = cudf.DataFrame.from_pandas(X)
CPU times: user 2.95 s, sys: 90 ms, total: 3.04 s
Wall time: 764 ms

X
<cudf.DataFrame ncols=48 nrows=182822 >

%%time
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(X)
n_rows: 182822
CPU times: user 10.2 s, sys: 6.22 s, total: 16.4 s
Wall time: 16.4 s

This PR will likely be merged in today. I know this issue has been open for a couple months now, however, you do have the option of building from source if you would like to use the patch before it is included in a release.

@cjnolet
Copy link
Member

cjnolet commented Feb 14, 2019

In regards to the differences in the labels, we have noticed the results between sklearn and cuml are effected differently by only slight variations of the eps parameter. On toy datasets, like the embedded circles, we were able to reproduce (near) exact matches to sklearn by scaling the eps parameter by +-10%.

We are working to figure out where and why this happens and, more importantly, which one (sklearn or cuml) is the absolute ground truth, if either. Even though dbscan is deterministic, there are places in both algorithms where very small decisions (e.g. order of elements presented for labeling) can effect the resulting labels. We have some hypotheses about why this behavior is happening and do really appreciate your bearing with us through this process.

For now, it might make more sense, when doing a direct comparison, to look at the densities of the labels that result and compare them by the proportion of inputs that were placed in the same cluster between the algorithms, rather than exact matches.

@michaelkingdom
Copy link
Author

@cjnolet Thank you for your great feedback.
I'll try to build from source later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants