-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering based on cophenetic distance added #6234
Conversation
In sklearn.cluster.hierarchical.py clustering based on cophenetic distance was added to Agglomerative Clustering as defined in issue #6197
Sorry for the lack of response. Can you give some context? I'm not familiar with this metric. |
@BiaDarkia I would really like to see the cophenetic distance in scikit-learn - any chance you can respond to the comment and get the PR accepted? |
@Jeltje, are you able to help answer this query above? |
@jnothman The cophenetic distance is a measure of how similar two objects have to be in order to be grouped into the same cluster. So if you want to cluster based on this distance, you create a distance matrix in which the original pairwise distances between the objects are replaced by the computed distances between their clusters at the time of these clusters' merge. Cophenetic distances can also be used to determine the cophenetic correlation coefficient of any other clustering method. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points, and so cophenetic correlation can be used to evaluate clustering performance evaluation without knowing the ground truth. Currently only the silhouette score will do that in scikit-learn. The current PR only calculates the distance scores, but getting the coefficient should be relatively trivial, I think. |
Let's set the correlation coefficient aside for now (though we have an implementation of calinski-harabasz as well as silhouette). If cophenetic distances are merely the distance at cluster merge time, then I think #9069 is a simpler implementation of the same, which simply defines n_clusters as the number of merge distances above the threshold. Is this what you want? It is likely to be merged soon, imo. If the implementation needs to be more complex, can we just make it use fcluster? I've not yet understood the relationship between these implementations. |
Apologies for not replying in a timely manner. #9069 provides an elegant solution to cluster data into a specific number of clusters. Agglomerative clustering uses the same principle, i.e. clustering data into a specific number of cluster, but does not use euclidean distance as a measure of distance. Instead the cophenetic correlation coefficient is used as a measure of distance. However, adding the cophenetic correlation efficient as a measure of distance once #9069 is merged would provide the requested feature. If you could let me know once #9069 is merged, I could have a look into adding the cophenetic correlation efficient as a measure of distance. |
Closing this PR as it was meant to solve #6197 already closed in a different way. Feel free to reopen if this is not the case. |
In sklearn.cluster.hierarchical.py clustering based on cophenetic
distance was added to Agglomerative Clustering as defined in issue #6197