Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Add DBSCAN support for additional metric params #8139

Merged
merged 2 commits into from Jan 2, 2017

Conversation

@naoyak
Copy link
Contributor

@naoyak naoyak commented Jan 1, 2017

Fixes #4520

  • Added support for the additional metrics_params argument in DBSCAN (i.e. passing {'p': p} when using 'minkowski' for the metric)
  • Added a test to confirm that this works
@naoyak naoyak changed the title Add DBSCAN support for additional metric params [MRG] Add DBSCAN support for additional metric params Jan 1, 2017
@naoyak naoyak force-pushed the naoyak:dbscan-metric-params branch from f8a83fe to 5cd2357 Jan 1, 2017
Copy link
Member

@jnothman jnothman left a comment

Thanks for the contribution

@@ -50,6 +50,11 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.
metric_params : dict, optional (default = None)

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017
Member

I think "optional" is sufficient when the semantics of the default are not clear.

@@ -184,6 +190,11 @@ class DBSCAN(BaseEstimator, ClusterMixin):
.. versionadded:: 0.17
metric *precomputed* to accept precomputed sparse matrix.
metric_params : dict, optional (default = None)

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017
Member

I think "optional" is sufficient when the semantics of the default are not clear.

# Different eps to other test, because distance is not normalised.
eps = 0.8
min_samples = 10
p = 2

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017
Member

this isn't a very strong test as p=2 is the default for minkowski in our nearest neighbors implementation. trying this for p=1 would be more demonstrative. Or both.

This comment has been minimized.

@naoyak

naoyak Jan 1, 2017
Author Contributor

Changing to p=1.


# number of clusters, ignoring noise if present
n_clusters_1 = len(set(labels)) - int(-1 in labels)
assert_equal(n_clusters_1, n_clusters)

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017
Member

I would've thought that the complete set of labels were identical. Why are we using as weak an assertion as this?

This comment has been minimized.

@naoyak

naoyak Jan 1, 2017
Author Contributor

That struck me as odd too until I just read up on DBSCAN and its implementation:

The algorithm is non-deterministic, but the core samples will always belong to the same clusters (although the labels may be different). The non-determinism comes from deciding to which cluster a non-core sample belongs. A non-core sample can have a distance lower than eps to two core samples in different clusters. By the triangular inequality, those two core samples must be more distant than eps from each other, or they would be in the same cluster. The non-core sample is assigned to whichever cluster is generated first, where the order is determined randomly. Other than the ordering of the dataset, the algorithm is deterministic, making the results relatively stable between runs on the same data.

So if you run the algorithm with the same parameters twice, it can differ in two ways:

  • The cluster labels themselves (meaning cluster 1 can be cluster 2 when run again, I think)
  • The non-core samples (which are included here in labels returned by dbscan()

Even then I agree that the tests for DBSCAN seem pretty weak. It should probably at least test whether the core samples end up in the same clusterings, though you have to match the cluster labels between runs.

This comment has been minimized.

@naoyak

naoyak Jan 1, 2017
Author Contributor

I'm not sure that I'd be the best person to undertake a refactoring of the DBSCAN tests, though. If they aren't up to par it might be best addressed in a separate issue.

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017
Member

For a given random state the labels should be deterministic. The only difference here might be numerical precision errors, but they shouldn't be an issue, I'd think.

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017
Member

You can start by making this one better.

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017
Member

Note that that blurb on determinism has been changed in master: http://scikit-learn.org/dev/modules/clustering.html#dbscan

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017
Member

And apologies: no random_state involved.

This comment has been minimized.

@naoyak

naoyak Jan 1, 2017
Author Contributor

Oh wow, the new explanation is much stronger (did the implementation change?).

You're right, the labels (as well as the core sample subset) are deterministic - I rewrote the test to validate this in the latest commit. Maybe at some point I can get around to refactoring some of the other tests.

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017
Member

Oh wow, the new explanation is much stronger (did the implementation change?).

The implementation didn't change. @jbednar, who was frustrated at the previous description's inaccuracy, is to thank.

@naoyak
Copy link
Contributor Author

@naoyak naoyak commented Jan 1, 2017

Updated the new test to hopefully be a bit more rigorous.

Copy link
Member

@jnothman jnothman left a comment

LGTM

@jnothman jnothman changed the title [MRG] Add DBSCAN support for additional metric params [MRG+1] Add DBSCAN support for additional metric params Jan 1, 2017
@agramfort agramfort merged commit 543b056 into scikit-learn:master Jan 2, 2017
3 checks passed
3 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@agramfort
Copy link
Member

@agramfort agramfort commented Jan 2, 2017

thx @naoyak

@naoyak naoyak deleted the naoyak:dbscan-metric-params branch Jan 3, 2017
raghavrv added a commit to raghavrv/scikit-learn that referenced this pull request Jan 5, 2017
sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.