New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Add DBSCAN support for additional metric params #8139

Merged
merged 2 commits into from Jan 2, 2017

Conversation

Projects
None yet
3 participants
@naoyak
Contributor

naoyak commented Jan 1, 2017

Fixes #4520

  • Added support for the additional metrics_params argument in DBSCAN (i.e. passing {'p': p} when using 'minkowski' for the metric)
  • Added a test to confirm that this works

@naoyak naoyak changed the title from Add DBSCAN support for additional metric params to [MRG] Add DBSCAN support for additional metric params Jan 1, 2017

@jnothman

Thanks for the contribution

Show outdated Hide outdated sklearn/cluster/dbscan_.py
@@ -50,6 +50,11 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.
metric_params : dict, optional (default = None)

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017

Member

I think "optional" is sufficient when the semantics of the default are not clear.

@jnothman

jnothman Jan 1, 2017

Member

I think "optional" is sufficient when the semantics of the default are not clear.

Show outdated Hide outdated sklearn/cluster/dbscan_.py
@@ -184,6 +190,11 @@ class DBSCAN(BaseEstimator, ClusterMixin):
.. versionadded:: 0.17
metric *precomputed* to accept precomputed sparse matrix.
metric_params : dict, optional (default = None)

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017

Member

I think "optional" is sufficient when the semantics of the default are not clear.

@jnothman

jnothman Jan 1, 2017

Member

I think "optional" is sufficient when the semantics of the default are not clear.

Show outdated Hide outdated sklearn/cluster/tests/test_dbscan.py
# Different eps to other test, because distance is not normalised.
eps = 0.8
min_samples = 10
p = 2

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017

Member

this isn't a very strong test as p=2 is the default for minkowski in our nearest neighbors implementation. trying this for p=1 would be more demonstrative. Or both.

@jnothman

jnothman Jan 1, 2017

Member

this isn't a very strong test as p=2 is the default for minkowski in our nearest neighbors implementation. trying this for p=1 would be more demonstrative. Or both.

This comment has been minimized.

@naoyak

naoyak Jan 1, 2017

Contributor

Changing to p=1.

@naoyak

naoyak Jan 1, 2017

Contributor

Changing to p=1.

Show outdated Hide outdated sklearn/cluster/tests/test_dbscan.py
# number of clusters, ignoring noise if present
n_clusters_1 = len(set(labels)) - int(-1 in labels)
assert_equal(n_clusters_1, n_clusters)

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017

Member

I would've thought that the complete set of labels were identical. Why are we using as weak an assertion as this?

@jnothman

jnothman Jan 1, 2017

Member

I would've thought that the complete set of labels were identical. Why are we using as weak an assertion as this?

This comment has been minimized.

@naoyak

naoyak Jan 1, 2017

Contributor

That struck me as odd too until I just read up on DBSCAN and its implementation:

The algorithm is non-deterministic, but the core samples will always belong to the same clusters (although the labels may be different). The non-determinism comes from deciding to which cluster a non-core sample belongs. A non-core sample can have a distance lower than eps to two core samples in different clusters. By the triangular inequality, those two core samples must be more distant than eps from each other, or they would be in the same cluster. The non-core sample is assigned to whichever cluster is generated first, where the order is determined randomly. Other than the ordering of the dataset, the algorithm is deterministic, making the results relatively stable between runs on the same data.

So if you run the algorithm with the same parameters twice, it can differ in two ways:

  • The cluster labels themselves (meaning cluster 1 can be cluster 2 when run again, I think)
  • The non-core samples (which are included here in labels returned by dbscan()

Even then I agree that the tests for DBSCAN seem pretty weak. It should probably at least test whether the core samples end up in the same clusterings, though you have to match the cluster labels between runs.

@naoyak

naoyak Jan 1, 2017

Contributor

That struck me as odd too until I just read up on DBSCAN and its implementation:

The algorithm is non-deterministic, but the core samples will always belong to the same clusters (although the labels may be different). The non-determinism comes from deciding to which cluster a non-core sample belongs. A non-core sample can have a distance lower than eps to two core samples in different clusters. By the triangular inequality, those two core samples must be more distant than eps from each other, or they would be in the same cluster. The non-core sample is assigned to whichever cluster is generated first, where the order is determined randomly. Other than the ordering of the dataset, the algorithm is deterministic, making the results relatively stable between runs on the same data.

So if you run the algorithm with the same parameters twice, it can differ in two ways:

  • The cluster labels themselves (meaning cluster 1 can be cluster 2 when run again, I think)
  • The non-core samples (which are included here in labels returned by dbscan()

Even then I agree that the tests for DBSCAN seem pretty weak. It should probably at least test whether the core samples end up in the same clusterings, though you have to match the cluster labels between runs.

This comment has been minimized.

@naoyak

naoyak Jan 1, 2017

Contributor

I'm not sure that I'd be the best person to undertake a refactoring of the DBSCAN tests, though. If they aren't up to par it might be best addressed in a separate issue.

@naoyak

naoyak Jan 1, 2017

Contributor

I'm not sure that I'd be the best person to undertake a refactoring of the DBSCAN tests, though. If they aren't up to par it might be best addressed in a separate issue.

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017

Member

For a given random state the labels should be deterministic. The only difference here might be numerical precision errors, but they shouldn't be an issue, I'd think.

@jnothman

jnothman Jan 1, 2017

Member

For a given random state the labels should be deterministic. The only difference here might be numerical precision errors, but they shouldn't be an issue, I'd think.

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017

Member

You can start by making this one better.

@jnothman

jnothman Jan 1, 2017

Member

You can start by making this one better.

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017

Member

Note that that blurb on determinism has been changed in master: http://scikit-learn.org/dev/modules/clustering.html#dbscan

@jnothman

jnothman Jan 1, 2017

Member

Note that that blurb on determinism has been changed in master: http://scikit-learn.org/dev/modules/clustering.html#dbscan

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017

Member

And apologies: no random_state involved.

@jnothman

jnothman Jan 1, 2017

Member

And apologies: no random_state involved.

This comment has been minimized.

@naoyak

naoyak Jan 1, 2017

Contributor

Oh wow, the new explanation is much stronger (did the implementation change?).

You're right, the labels (as well as the core sample subset) are deterministic - I rewrote the test to validate this in the latest commit. Maybe at some point I can get around to refactoring some of the other tests.

@naoyak

naoyak Jan 1, 2017

Contributor

Oh wow, the new explanation is much stronger (did the implementation change?).

You're right, the labels (as well as the core sample subset) are deterministic - I rewrote the test to validate this in the latest commit. Maybe at some point I can get around to refactoring some of the other tests.

This comment has been minimized.

@jnothman

jnothman Jan 1, 2017

Member

Oh wow, the new explanation is much stronger (did the implementation change?).

The implementation didn't change. @jbednar, who was frustrated at the previous description's inaccuracy, is to thank.

@jnothman

jnothman Jan 1, 2017

Member

Oh wow, the new explanation is much stronger (did the implementation change?).

The implementation didn't change. @jbednar, who was frustrated at the previous description's inaccuracy, is to thank.

@naoyak

This comment has been minimized.

Show comment
Hide comment
@naoyak

naoyak Jan 1, 2017

Contributor

Updated the new test to hopefully be a bit more rigorous.

Contributor

naoyak commented Jan 1, 2017

Updated the new test to hopefully be a bit more rigorous.

@jnothman jnothman changed the title from [MRG] Add DBSCAN support for additional metric params to [MRG+1] Add DBSCAN support for additional metric params Jan 1, 2017

@agramfort agramfort merged commit 543b056 into scikit-learn:master Jan 2, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort
Member

agramfort commented Jan 2, 2017

thx @naoyak

@naoyak naoyak deleted the naoyak:dbscan-metric-params branch Jan 3, 2017

raghavrv added a commit to raghavrv/scikit-learn that referenced this pull request Jan 5, 2017

[MRG+1] Add DBSCAN support for additional metric params (#8139)
* Add DBSCAN support for additional metric params

sergeyf added a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG+1] Add DBSCAN support for additional metric params (#8139)
* Add DBSCAN support for additional metric params

@Przemo10 Przemo10 referenced this pull request Mar 17, 2017

Closed

update fork (#1) #8606

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] Add DBSCAN support for additional metric params (#8139)
* Add DBSCAN support for additional metric params

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] Add DBSCAN support for additional metric params (#8139)
* Add DBSCAN support for additional metric params

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] Add DBSCAN support for additional metric params (#8139)
* Add DBSCAN support for additional metric params

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] Add DBSCAN support for additional metric params (#8139)
* Add DBSCAN support for additional metric params
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment