Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBSCAN seems not to use multiple processors (n_jobs argument ignored) #8003

Closed
pfaucon opened this issue Dec 7, 2016 · 9 comments
Closed

DBSCAN seems not to use multiple processors (n_jobs argument ignored) #8003

pfaucon opened this issue Dec 7, 2016 · 9 comments

Comments

@pfaucon
Copy link

@pfaucon pfaucon commented Dec 7, 2016

Description

DBSCAN seems not to use multiple processors (n_jobs argument ignored)
it looks like dbscan hands the arguments off to nearest neighbor, but NN only uses the n_jobs arguments for certain clustering types (presumably not ones that dbscan calls by default). It would be good to mention how to change input to use the n_jobs parameter, and possibly modify the default values to make it useful.

Steps/Code to Reproduce

code taken from:
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=1000000, centers=centers, cluster_std=0.4,
random_state=0)

X = StandardScaler().fit_transform(X)

db = DBSCAN(eps=0.3, min_samples=10, n_jobs=-1).fit(X)

Expected Results

answer is correct but the job should be split between processors, and time consumed should be significantly less.

Actual Results

seems to run on only one processor

Versions

import platform; print(platform.platform())
Linux-3.13.0-101-generic-x86_64-with-Ubuntu-14.04-trusty
import sys; print("Python", sys.version)
Python 3.4.3 (default, Nov 17 2016, 01:08:31)
[GCC 4.8.4]
import numpy; print("NumPy", numpy.version)
NumPy 1.11.2
import scipy; print("SciPy", scipy.version)
SciPy 0.18.1
import sklearn; print("Scikit-Learn", sklearn.version)
Scikit-Learn 0.18.1

@amueller
Copy link
Member

@amueller amueller commented Dec 7, 2016

you can set algorithm="brute" to use multiple cores but that will probably make it slower. The neighbors module decides it wants to use a tree, which we haven't parallelized yet.

How many cores do you have? And can you report times for the default setting an for algorithm="brute"?

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 8, 2016

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 8, 2016

@pfaucon, a clarification in the documentation is welcome. Please submit a pull request. Otherwise I'm closing this as something we can't do much about.

@jnothman jnothman closed this Dec 8, 2016
@jnothman
Copy link
Member

@jnothman jnothman commented Dec 8, 2016

Actually, as you seem to be requesting documentation changes, I'll leave it open and you or someone else can contribute a fix.

@Don86
Copy link
Contributor

@Don86 Don86 commented Dec 12, 2016

Hi, I'm new to scikit learn, but I'd like to contribute to this.

@jnothman
Copy link
Member

@jnothman jnothman commented Dec 12, 2016

Go ahead

@kushagraagrawal
Copy link

@kushagraagrawal kushagraagrawal commented Dec 16, 2016

is this issue still open? I'm new to scikit-learn, and would like to try

@amueller
Copy link
Member

@amueller amueller commented Dec 16, 2016

@recamshak recamshak mentioned this issue Mar 28, 2018
4 of 4 tasks complete
@sp7412
Copy link

@sp7412 sp7412 commented Jan 29, 2020

Just wondering if there's any update on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

6 participants