You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Silhouette Coefficient is generally higher for convex clusters than other concepts of clusters, such as density based clusters like those obtained through DBSCAN.
This seems to be true for all scores (available in sklearn) that are usable when ground truth labels are not known, and it could be an issue when you try to evaluate clusters obtained through algorithms such as DBSCAN and OPTICS.
I know that this paper doesn't have 200+ citations, however it's a minor addition (it's a generalization of the silhouette width calculation), we can keep p=1 as default value for the p parameter to provide the current behavior, and it covers a scenario which is currently uncovered by sklearn (evaluation of non-convex clusters when ground truth labels are not available).
Is this something that we would like to have in sklearn?
The text was updated successfully, but these errors were encountered:
silhouette_score already accepts a metric param which you can use to pass any of the metrics we have in sklearn, or even a callable (function), and the extra metric parameters through the **kwargs. So it seems to me that we already have what you need?
I'd be open to considering its inclusion if there's a PR and the change is simple and maintainable enough, and benchmarks to show its effectiveness. But I'd wait for at least a second opinion first, considering the article has only 1 citation I could find.
Describe the workflow you want to enable
According to https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient
This seems to be true for all scores (available in sklearn) that are usable when ground truth labels are not known, and it could be an issue when you try to evaluate clusters obtained through algorithms such as DBSCAN and OPTICS.
Describe your proposed solution
In the following paper https://www.researchgate.net/publication/337370281_Silhouette_width_using_generalized_mean-A_flexible_method_for_assessing_clustering_efficiency there's a proposal to use the generalized mean (https://en.wikipedia.org/wiki/Generalized_mean) instead of the arithmetic mean to calculate average within‐ and between‐cluster distances. Changing the
p
parameter of the generalized mean (the arithmetic mean is a special case of the generalized mean with p=1), we can change the sensitivity of the index to compactness versus connectedness. This can be useful when we try to evaluate non-spherical clusters.I know that this paper doesn't have 200+ citations, however it's a minor addition (it's a generalization of the silhouette width calculation), we can keep p=1 as default value for the
p
parameter to provide the current behavior, and it covers a scenario which is currently uncovered by sklearn (evaluation of non-convex clusters when ground truth labels are not available).Is this something that we would like to have in sklearn?
The text was updated successfully, but these errors were encountered: