Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add silhouette width using generalized mean #17817

Open
candalfigomoro opened this issue Jul 2, 2020 · 4 comments
Open

Add silhouette width using generalized mean #17817

candalfigomoro opened this issue Jul 2, 2020 · 4 comments

Comments

@candalfigomoro
Copy link

Describe the workflow you want to enable

According to https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient

The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

This seems to be true for all scores (available in sklearn) that are usable when ground truth labels are not known, and it could be an issue when you try to evaluate clusters obtained through algorithms such as DBSCAN and OPTICS.

Describe your proposed solution

In the following paper https://www.researchgate.net/publication/337370281_Silhouette_width_using_generalized_mean-A_flexible_method_for_assessing_clustering_efficiency there's a proposal to use the generalized mean (https://en.wikipedia.org/wiki/Generalized_mean) instead of the arithmetic mean to calculate average within‐ and between‐cluster distances. Changing the p parameter of the generalized mean (the arithmetic mean is a special case of the generalized mean with p=1), we can change the sensitivity of the index to compactness versus connectedness. This can be useful when we try to evaluate non-spherical clusters.

I know that this paper doesn't have 200+ citations, however it's a minor addition (it's a generalization of the silhouette width calculation), we can keep p=1 as default value for the p parameter to provide the current behavior, and it covers a scenario which is currently uncovered by sklearn (evaluation of non-convex clusters when ground truth labels are not available).

Is this something that we would like to have in sklearn?

@adrinjalali
Copy link
Member

silhouette_score already accepts a metric param which you can use to pass any of the metrics we have in sklearn, or even a callable (function), and the extra metric parameters through the **kwargs. So it seems to me that we already have what you need?

@candalfigomoro
Copy link
Author

candalfigomoro commented Jul 6, 2020

@adrinjalali
You can still use the metric param to pass any of the metrics we have in sklearn. What changes is how averages are computed (this is the simple summation for the arithmetic mean https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/cluster/_unsupervised.py#L138).

@adrinjalali
Copy link
Member

I'd be open to considering its inclusion if there's a PR and the change is simple and maintainable enough, and benchmarks to show its effectiveness. But I'd wait for at least a second opinion first, considering the article has only 1 citation I could find.

Maybe @amueller or @ogrisel could have a look?

@candalfigomoro
Copy link
Author

Related issue about clustering evaluation for non-convex clusters: #12937

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants