Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add something about parameter selection to the notebooks? #31

Closed
gwerbin opened this issue Mar 30, 2016 · 4 comments
Closed

Add something about parameter selection to the notebooks? #31

gwerbin opened this issue Mar 30, 2016 · 4 comments

Comments

@gwerbin
Copy link

gwerbin commented Mar 30, 2016

I really appreciate the work you've done with the "How HDBSCAN Works" notebook, but they don't really touch on the issue of parameter selection. You do mention at the end of the clustering comparison notebook that HDBSCAN is not that sensitive to the choice of min_samples, but that isn't much to go on.

If I'm not mistaken, the alpha and min_samples parameters are both "inherited" from robust single linkage clustering. But Chaudhuri and Dasgupta are not terribly forthcoming about what their alpha parameter actually does, apart from stating that "we'd like alpha as small as possible" (pg. 6, "Rates of convergence for the cluster tree"). So I'm left scratching my head as to how to select these.

Do you think you'd be able to talk about how to intelligently choose the core size (min_samples) and that alpha parameter? You also mention sensible defaults, but you don't mention what makes them sensible.

Unfortunately, I don't have access to Campello, et al.'s original paper, so if it's explained there I won't be able to see it. Dr. Campello mentions on his site that unformatted copies of his work are available, and I'm planning on reaching out to him for that purpose if I can't get access through my old university. But I still think it would be a great addition to these already very helpful notebooks (and it can't hurt the adoption of HDBSCAN itself as a "sensible default" for clustering applications).

@lmcinnes
Copy link
Collaborator

You're right that it might be useful to have a notebook on parameter selection. I'll see if I can get something useful put together.

In summary though, the short answer is that HDBSCAN is ideally fairly robust to parameter selection. Some rules of thumb:

  • Don't touch alpha unless you know what you're doing; the default (1.0) should be fine.
  • Think of min_samples as being "how noisy is my data". Larger min_samples anticipates noisier data (to some degree).
  • Things should be largely robust to min_samples, and something in the 5-15 range should be good for a wide variety of data, and the exact choice shouldn't matter too much.
  • We set min_samples to min_cluster_size by default (if min_samples isn't specified), and this is usually the right choice.

I am also working on eliminating min_samples as a parameter altogether (along with alpha as well) leaving only min_cluster_size, which should be relatively intuitive to pick. This involves some theoretical considerations, and while I have the skeleton of the relevant theory, I haven't yet got the details worked out to make that an efficient implementation.

@lmcinnes
Copy link
Collaborator

As an added note, from Chaudhuri and Dasgupta one can find bounds of

alpha = sqrt(2)
min_samples ~ d log(n)

where d is the dimension of the data and n is the number of data points. These aren't actually very practical values however: they are relevant values for being able to prove convergence to the level set tree (a valuable theoretical result!), but in practice they are larger than you really want to use. It might help get you into an appropriate ballpark however.

@gwerbin
Copy link
Author

gwerbin commented Apr 1, 2016

Thanks for the reply!

Basically, you're arguing that convergence is less important than keeping those parameters small. Is that an appropriate summary?

Also, they mention in the paper that regular old single-linkage clustering always converges in one dimension. Do you think this would be true of the hierarchical clustering underlying HDBSCAN as well?

@lmcinnes
Copy link
Collaborator

lmcinnes commented Apr 1, 2016

The convergence guarantees are asymptotic; it will converge if you have infinite data. The bounds are merely bounds such that the method of proof they use will go through, not hard bounds required for convergence. They comment themselves that they believe better bounds may exist and still provide convergence, they just can't construct the proof for such at the time of writing the paper.

Given all that I wouldn't worry too much about the bounds for practical purposes.

I do also expect any convergence true of single linkage ought to go through for HDBSCAN since, for suitable choices of parameters HDBSCAN will replicate the hierarchy of single linkage. Of course I don't have a proof of comparable convergence, but I can handwave in the appropriate direction I believe.

@gwerbin gwerbin closed this as completed Apr 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants