Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify DBSCAN eps parameter misunderstanding #13563

Merged
merged 1 commit into from
Apr 3, 2019

Conversation

kno10
Copy link
Contributor

@kno10 kno10 commented Apr 2, 2019

As seen here: https://stackoverflow.com/a/55388827/1939754
the old description of the eps parameter can be misunderstood as a maximum distance of any two points.
Also, people really need to tune this parameter, not rely on the bad default value.

@kno10 kno10 changed the title Clarify eps parameter misunderstanding Clarify DBSCAN eps parameter misunderstanding Apr 2, 2019
Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @kno10

as in the same neighborhood.
The maximum distance between two samples for one to be considered
as in the neighborhood of the other. This is not a maximum bound
on the distances of points within a cluster, and the most important
Copy link
Member

@qinhanmin2014 qinhanmin2014 Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the most important -> one of the most important / an important since min_samples is also important?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @kno10 that min_samples is not as important because having more or fewer core samples in a region is less essential than determining whether samples are in the same (or any) cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like "and" here as it's not clear what "the most important ..." applies to. Start a new sentence "This is the most important"

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might note the importance of tuning eps in the user guide (doc/modules/cluster.rst) or summary section of the docstring. Do we have an example illustrating the effect of this parameter? What would be a good dataset to illustrate with?

as in the same neighborhood.
The maximum distance between two samples for one to be considered
as in the neighborhood of the other. This is not a maximum bound
on the distances of points within a cluster, and the most important
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @kno10 that min_samples is not as important because having more or fewer core samples in a region is less essential than determining whether samples are in the same (or any) cluster.

as in the same neighborhood.
The maximum distance between two samples for one to be considered
as in the neighborhood of the other. This is not a maximum bound
on the distances of points within a cluster, and the most important
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like "and" here as it's not clear what "the most important ..." applies to. Start a new sentence "This is the most important"

As seen here: https://stackoverflow.com/a/55388827/1939754
the old description of the eps parameter can be misunderstood as a maximum distance of any two points.

Also add a reference that discusses parameterization.
@kno10
Copy link
Contributor Author

kno10 commented Apr 3, 2019

Made this two sentences, added a reference with discussion of parameterization, and added a paragraph to the user guide on parameterization, too.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement!

@jnothman jnothman merged commit 51b1852 into scikit-learn:master Apr 3, 2019
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants