New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBSCAN seems not to use multiple processors (n_jobs argument ignored) #8003

Closed
pfaucon opened this Issue Dec 7, 2016 · 8 comments

Comments

Projects
None yet
5 participants
@pfaucon

pfaucon commented Dec 7, 2016

Description

DBSCAN seems not to use multiple processors (n_jobs argument ignored)
it looks like dbscan hands the arguments off to nearest neighbor, but NN only uses the n_jobs arguments for certain clustering types (presumably not ones that dbscan calls by default). It would be good to mention how to change input to use the n_jobs parameter, and possibly modify the default values to make it useful.

Steps/Code to Reproduce

code taken from:
http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=1000000, centers=centers, cluster_std=0.4,
random_state=0)

X = StandardScaler().fit_transform(X)

db = DBSCAN(eps=0.3, min_samples=10, n_jobs=-1).fit(X)

Expected Results

answer is correct but the job should be split between processors, and time consumed should be significantly less.

Actual Results

seems to run on only one processor

Versions

import platform; print(platform.platform())
Linux-3.13.0-101-generic-x86_64-with-Ubuntu-14.04-trusty
import sys; print("Python", sys.version)
Python 3.4.3 (default, Nov 17 2016, 01:08:31)
[GCC 4.8.4]
import numpy; print("NumPy", numpy.version)
NumPy 1.11.2
import scipy; print("SciPy", scipy.version)
SciPy 0.18.1
import sklearn; print("Scikit-Learn", sklearn.version)
Scikit-Learn 0.18.1

@amueller

This comment has been minimized.

Member

amueller commented Dec 7, 2016

you can set algorithm="brute" to use multiple cores but that will probably make it slower. The neighbors module decides it wants to use a tree, which we haven't parallelized yet.

How many cores do you have? And can you report times for the default setting an for algorithm="brute"?

@jnothman

This comment has been minimized.

Member

jnothman commented Dec 8, 2016

@jnothman

This comment has been minimized.

Member

jnothman commented Dec 8, 2016

@pfaucon, a clarification in the documentation is welcome. Please submit a pull request. Otherwise I'm closing this as something we can't do much about.

@jnothman jnothman closed this Dec 8, 2016

@jnothman

This comment has been minimized.

Member

jnothman commented Dec 8, 2016

Actually, as you seem to be requesting documentation changes, I'll leave it open and you or someone else can contribute a fix.

@Don86

This comment has been minimized.

Contributor

Don86 commented Dec 12, 2016

Hi, I'm new to scikit learn, but I'd like to contribute to this.

@jnothman

This comment has been minimized.

Member

jnothman commented Dec 12, 2016

Go ahead

@kushagraagrawal

This comment has been minimized.

kushagraagrawal commented Dec 16, 2016

is this issue still open? I'm new to scikit-learn, and would like to try

@amueller

This comment has been minimized.

Member

amueller commented Dec 16, 2016

@recamshak recamshak referenced this issue Mar 28, 2018

Merged

[MRG+1] Parallel radius neighbors #10887

4 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment