Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering based on cophenetic distance added #6234

Closed
wants to merge 11 commits into from
Closed

Clustering based on cophenetic distance added #6234

wants to merge 11 commits into from

Conversation

BiaDarkia
Copy link

In sklearn.cluster.hierarchical.py clustering based on cophenetic
distance was added to Agglomerative Clustering as defined in issue #6197

@amueller
Copy link
Member

amueller commented Oct 7, 2016

Sorry for the lack of response. Can you give some context? I'm not familiar with this metric.
Also, an example and some explanation in the user guide would be great.

@Jeltje
Copy link

Jeltje commented Nov 8, 2017

@BiaDarkia I would really like to see the cophenetic distance in scikit-learn - any chance you can respond to the comment and get the PR accepted?

@jnothman
Copy link
Member

jnothman commented Nov 8, 2017

Can you give some context? I'm not familiar with this metric. Also, an example and some explanation in the user guide would be great.

@Jeltje, are you able to help answer this query above?

@Jeltje
Copy link

Jeltje commented Nov 9, 2017

@jnothman The cophenetic distance is a measure of how similar two objects have to be in order to be grouped into the same cluster. So if you want to cluster based on this distance, you create a distance matrix in which the original pairwise distances between the objects are replaced by the computed distances between their clusters at the time of these clusters' merge.

Cophenetic distances can also be used to determine the cophenetic correlation coefficient of any other clustering method. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points, and so cophenetic correlation can be used to evaluate clustering performance evaluation without knowing the ground truth. Currently only the silhouette score will do that in scikit-learn.

The current PR only calculates the distance scores, but getting the coefficient should be relatively trivial, I think.

@jnothman
Copy link
Member

jnothman commented Nov 9, 2017

Let's set the correlation coefficient aside for now (though we have an implementation of calinski-harabasz as well as silhouette).

If cophenetic distances are merely the distance at cluster merge time, then I think #9069 is a simpler implementation of the same, which simply defines n_clusters as the number of merge distances above the threshold. Is this what you want? It is likely to be merged soon, imo. If the implementation needs to be more complex, can we just make it use fcluster? I've not yet understood the relationship between these implementations.

@BiaDarkia
Copy link
Author

Apologies for not replying in a timely manner. #9069 provides an elegant solution to cluster data into a specific number of clusters. Agglomerative clustering uses the same principle, i.e. clustering data into a specific number of cluster, but does not use euclidean distance as a measure of distance. Instead the cophenetic correlation coefficient is used as a measure of distance. However, adding the cophenetic correlation efficient as a measure of distance once #9069 is merged would provide the requested feature. If you could let me know once #9069 is merged, I could have a look into adding the cophenetic correlation efficient as a measure of distance.

@cmarmo
Copy link
Member

cmarmo commented Dec 14, 2020

Closing this PR as it was meant to solve #6197 already closed in a different way. Feel free to reopen if this is not the case.

@cmarmo cmarmo closed this Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants