Add silhouette width using generalized mean #17817

candalfigomoro · 2020-07-02T16:19:42Z

Describe the workflow you want to enable

According to https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient

The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.

This seems to be true for all scores (available in sklearn) that are usable when ground truth labels are not known, and it could be an issue when you try to evaluate clusters obtained through algorithms such as DBSCAN and OPTICS.

Describe your proposed solution

In the following paper https://www.researchgate.net/publication/337370281_Silhouette_width_using_generalized_mean-A_flexible_method_for_assessing_clustering_efficiency there's a proposal to use the generalized mean (https://en.wikipedia.org/wiki/Generalized_mean) instead of the arithmetic mean to calculate average within‐ and between‐cluster distances. Changing the p parameter of the generalized mean (the arithmetic mean is a special case of the generalized mean with p=1), we can change the sensitivity of the index to compactness versus connectedness. This can be useful when we try to evaluate non-spherical clusters.

I know that this paper doesn't have 200+ citations, however it's a minor addition (it's a generalization of the silhouette width calculation), we can keep p=1 as default value for the p parameter to provide the current behavior, and it covers a scenario which is currently uncovered by sklearn (evaluation of non-convex clusters when ground truth labels are not available).

Is this something that we would like to have in sklearn?

The text was updated successfully, but these errors were encountered:

adrinjalali · 2020-07-05T11:48:19Z

silhouette_score already accepts a metric param which you can use to pass any of the metrics we have in sklearn, or even a callable (function), and the extra metric parameters through the **kwargs. So it seems to me that we already have what you need?

candalfigomoro · 2020-07-06T07:18:57Z

@adrinjalali
You can still use the metric param to pass any of the metrics we have in sklearn. What changes is how averages are computed (this is the simple summation for the arithmetic mean https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/cluster/_unsupervised.py#L138).

adrinjalali · 2020-07-06T08:58:21Z

I'd be open to considering its inclusion if there's a PR and the change is simple and maintainable enough, and benchmarks to show its effectiveness. But I'd wait for at least a second opinion first, considering the article has only 1 citation I could find.

Maybe @amueller or @ogrisel could have a look?

candalfigomoro · 2021-03-25T08:27:59Z

Related issue about clustering evaluation for non-convex clusters: #12937

candalfigomoro added the New Feature label Jul 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add silhouette width using generalized mean #17817

Add silhouette width using generalized mean #17817

candalfigomoro commented Jul 2, 2020

adrinjalali commented Jul 5, 2020

candalfigomoro commented Jul 6, 2020 •

edited

adrinjalali commented Jul 6, 2020

candalfigomoro commented Mar 25, 2021

Add silhouette width using generalized mean #17817

Add silhouette width using generalized mean #17817

Comments

candalfigomoro commented Jul 2, 2020

Describe the workflow you want to enable

Describe your proposed solution

adrinjalali commented Jul 5, 2020

candalfigomoro commented Jul 6, 2020 • edited

adrinjalali commented Jul 6, 2020

candalfigomoro commented Mar 25, 2021

candalfigomoro commented Jul 6, 2020 •

edited