Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Proper Way of Calculating Performance Metrics #283

Closed
psygo opened this issue Mar 20, 2019 · 7 comments
Closed

The Proper Way of Calculating Performance Metrics #283

psygo opened this issue Mar 20, 2019 · 7 comments

Comments

@psygo
Copy link

psygo commented Mar 20, 2019

I didn't find this in the documentation (which is awesome in my opinion) so I figured I could maybe ask it here.

Can you just simply use common performance metrics for HDBSCAN? For example, one thing I tried with the dataset I'm working on is to take out the data that's being considered as noise (the results with the noise weren't that much different) and apply the Silhouette Index, the Calinsky-Harabasz Index and the Davies-Bouldin Index to it. However, I have no way of knowing this is anything correct. CHI and DBI might be ok, but the SIL is really weird, because when I calculate it for K-Means, I get close to 0.8 and, for HDBSCAN, a horrible 0 (actually, -0.03). So what is the proper way of using these performance metrics (or any other) on HDBSCAN?

@lmcinnes
Copy link
Collaborator

Performance metric often match up with clustering algorithms -- in that there is almost always an algorithm that optimizes the same objective that the metric measures (or, less charitably, there is almost always a clustering metric that makes a given clustering algorithm the best). A lot of metrics are not geared toward density based clustering and tend to assume spherical clusters. You might consider DBCV which is a metric designed for density based clustering. There is an implementation in validation.py in the hdbscan package.

@psygo
Copy link
Author

psygo commented Mar 22, 2019

Thanks for the reply. (I think the file is called validity.py?)

Well, validating this yielded worse results than I expected T.T (-0.7).

On a sidenote -- it seems you're way way beyond me when it comes to clustering --, would you recommend scaling the data (which I did) before applying a density-based clustering algorithm?

@lmcinnes
Copy link
Collaborator

I think ultimately it is the comparison among hyperparameters and with other algorithms that will be most telling.

As to scaling data -- are you scaling features independently? If so then that can either be sensible, or a bad idea very much depending on context. Ultimately what matters for most clustering applications is the distance between points. Rescaling features (or the data in general) is an attempt to make easy distance measures, like euclidean distance, meaningful for your data. If you have features on different scales that are of roughly equal importance then you'll want to rescale them to all have similar ranges. On the other hand if you have, say, pixel values in an image then there is an inherent scale, and rescaling based on the range of values in the data would effectively over-weight the importance of pixels that are more likely to take on low values. The question is really about what distance should mean for your data, and what would make the most sense for your specific domain. That I can't answer for you unfortunately.

@psygo
Copy link
Author

psygo commented Mar 23, 2019

After thinking a little bit more about the problem of having multiple indexes that could be offering mixed opinions on which algorithm is best, I'm wondering about simply approximating the pdfs of the found cluster distributions and calculating the area under the curve of these functions' product (or maybe their subtraction?). The closest the area gets to zero, the better the clustering.

However, it would be difficult to analyze (graphically) high N-dimensional pdfs in those terms, visualizations would yield partially incorrect views of the data, since there might be interactions between the variables. Thus, I would suggest using PCA (or maybe kernel-PCA if the data is highly non-linear) to orthogonalize the data before doing the clustering algorithm, which would "guarantee" that there are no feature interactions.

What do you think about my idea? Do you see any mistakes (or, hopefully, improvements?)?

@lmcinnes
Copy link
Collaborator

What you describe is essentially what hdbscan does. First it attempts to approximate the pdf; then, given an approximate pdf, it attempts to essentially optimize the area of pdf given by a clustering (where the clustering is derived from the tree representation of the pdf). Of course all of this is largely computationally intractable, so it all comes down to how one performs these approximations. You can try reading section 2.1: "Statistically Motivated HDBSCAN*" of our paper on the topic and see if it aligns with what you have in mind.

@psygo
Copy link
Author

psygo commented Mar 23, 2019

It's clear that I need to study more. In my defense, I wouldn't say it's laziness, but rather the feeling that I didn't know what would be enough to solve my problems. Anyway, before I close this issue, do you mind answering some final thougths? :[]

So, HDBSCAN optimizes directly for the index I had mentioned... is there a way of extracting that index from HDBSCAN's fit? And wouldn't this index be useful for evaluating any clustering algorithms?

Which books do you recommend one to read when starting out in this area? Since the documentation of HDBSCAN has an educational feel to it, it would be nice to have a section on what (and in what order) to read to fully understand what you're doing.

@lmcinnes
Copy link
Collaborator

HDBSCAN optimizes for a computable approximation of the metric you are suggesting, where the approximation involves cutting significant corners. Which corners to cut, where, and how, make up most of the difference between various density based clustering algorithms. There isn't necessarily and obvious right answer.

On the other side of things, there isn't a correct metric for clustering either. Ultimately there is no ground truth, no single true objective measure, for unsupervised learning tasks. That is kind of the point -- for whatever measure you divine there is ostensibly an algorithm to optimize against that measure, or some acceptable approximation thereof. This is more os a philosophical argument, but I would recommend Christian Hennig's "What are the True Clusters" as a good resource on this.

@psygo psygo closed this as completed Mar 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants