-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbscan crashed when the data set grow large #31
Comments
Thank you. Others have experienced this and we're actively debugging the issue. |
@michaelkingdom is this still the case after the recent PRs that included bug fixes for dbscan? In particular this got recently merged and potentially affect this bug: |
@dantegd ,we tried the new dbscan file. |
@dantegd any update on this issue? |
@michaelkingdom I believe the latest merged PR #35 which reworked dbscan should alleviate the issues, at least it did so for me and a few others. Let us know if that is the case, if not we'll dig further. Thanks! |
@dantegd Yes, the crash issue is fixed. But the GPU speed is slow, could you please help to figure it out? from timeit import default_timer class Timer(object):
import gzip
##LOAD DATA 1 TIME from sklearn.metrics import mean_squared_error def to_nparray(x): ##Run tests %%time %%time %%time passed = array_equal(clustering_sk.labels_,clustering_cuml.labels_) CPU cost 7.5s, but GPU cost 65s Thank you. |
Thanks for the update @michaelkingdom. I managed to reproduced the situation in my workstation with a newer version of cuML (less difference since I'm running in a gv100, but still slower than CPU), so we're currently looking into it. |
@michaelkingdom at first run I missed that you are using the default dbscan algorithm in sklearn (which means it is using CPU So the code is working for the current implementation, though improvements are being worked on actively. |
Since this issue seems to be solved now and the time difference is actually normal behavior due to the currently implemented algorithm, I’ll close the issue. If any other issues arise please submit additional issue, thanks @michaelkingdom ! |
@dantegd Thank you for your great help! |
@dantegd This issue happen again in cuml-0.5.1. |
@teju85 I think you had mentioned that this might be related to an issue in the distance calculation that I think I heard you mention before? I think you might have insights about this issue. |
I believe this is fixed in PR #211. |
@cjnolet @dantegd I tried demo code in cuml 0.5.1, it works. But it failed using my data set. |
@michaelkingdom, PR #221 will be included in release 0.6.0. We are currently considering whether or not there should be a 0.5.2, which includes it. Here's my output from your data (I'm using a GV100):
This PR will likely be merged in today. I know this issue has been open for a couple months now, however, you do have the option of building from source if you would like to use the patch before it is included in a release. |
In regards to the differences in the labels, we have noticed the results between sklearn and cuml are effected differently by only slight variations of the eps parameter. On toy datasets, like the embedded circles, we were able to reproduce (near) exact matches to sklearn by scaling the eps parameter by +-10%. We are working to figure out where and why this happens and, more importantly, which one (sklearn or cuml) is the absolute ground truth, if either. Even though dbscan is deterministic, there are places in both algorithms where very small decisions (e.g. order of elements presented for labeling) can effect the resulting labels. We have some hypotheses about why this behavior is happening and do really appreciate your bearing with us through this process. For now, it might make more sense, when doing a direct comparison, to look at the densities of the labels that result and compare them by the proportion of inputs that were placed in the same cluster between the algorithms, rather than exact matches. |
@cjnolet Thank you for your great feedback. |
GPU used: 1u Tesla P40;
OS and version: Ubuntu 18.04;
CUDA version: 9.2;
Driver: 410.48;
gcc version: 7.3;
python version: 3.5
I have a data set which content 180,914 rows and 48 columns, most of them are integer (from 0 to 10,000):
I convert this full data frame to a float64 data type.
It works well when we use the “sklearn” lib (CPU) to run.
I tried to run use the DBSCAN lib in cuML, it crashed, and no response at all, I have to restart the whole kernel.
Then I tried to reduce the rows in our data sets from 180K to 10K, it works, but very slow, it cost about 3s, and then 20K rows data for 6s, 30K rows for 9s, and then it will crash when the data become to 70K rows.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from sklearn.datasets.samples_generator import make_blobs
from cuML import DBSCAN as cumlDBSCAN
import pygdf
import os
import dask
#import dask_gdf
from dask.delayed import delayed
from dask.distributed import Client, wait
import pygdf
from pygdf.dataframe import DataFrame
from collections import OrderedDict
from glob import glob
from sklearn.cluster import KMeans
import re
from itertools import cycle
from sklearn.preprocessing import StandardScaler
##LOAD DATA
X = load_data()
eps = 0.5
min_samples = 3
run using sklearn lib
clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)
run using cuML lib
Y = pygdf.DataFrame.from_pandas(X)
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
Z = Y.head(70000)
clustering_cuml.fit(Z)
The text was updated successfully, but these errors were encountered: