Clustering based on cophenetic distance added #6234

BiaDarkia · 2016-01-26T13:10:24Z

In sklearn.cluster.hierarchical.py clustering based on cophenetic
distance was added to Agglomerative Clustering as defined in issue #6197

In sklearn.cluster.hierarchical.py clustering based on cophenetic distance was added to Agglomerative Clustering as defined in issue #6197

amueller · 2016-10-07T22:36:38Z

Sorry for the lack of response. Can you give some context? I'm not familiar with this metric.
Also, an example and some explanation in the user guide would be great.

Jeltje · 2017-11-08T19:35:13Z

@BiaDarkia I would really like to see the cophenetic distance in scikit-learn - any chance you can respond to the comment and get the PR accepted?

jnothman · 2017-11-08T22:28:12Z

Can you give some context? I'm not familiar with this metric. Also, an example and some explanation in the user guide would be great.

@Jeltje, are you able to help answer this query above?

Jeltje · 2017-11-09T00:32:40Z

@jnothman The cophenetic distance is a measure of how similar two objects have to be in order to be grouped into the same cluster. So if you want to cluster based on this distance, you create a distance matrix in which the original pairwise distances between the objects are replaced by the computed distances between their clusters at the time of these clusters' merge.

Cophenetic distances can also be used to determine the cophenetic correlation coefficient of any other clustering method. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points, and so cophenetic correlation can be used to evaluate clustering performance evaluation without knowing the ground truth. Currently only the silhouette score will do that in scikit-learn.

The current PR only calculates the distance scores, but getting the coefficient should be relatively trivial, I think.

jnothman · 2017-11-09T02:50:28Z

Let's set the correlation coefficient aside for now (though we have an implementation of calinski-harabasz as well as silhouette).

If cophenetic distances are merely the distance at cluster merge time, then I think #9069 is a simpler implementation of the same, which simply defines n_clusters as the number of merge distances above the threshold. Is this what you want? It is likely to be merged soon, imo. If the implementation needs to be more complex, can we just make it use fcluster? I've not yet understood the relationship between these implementations.

BiaDarkia · 2018-05-12T14:51:20Z

Apologies for not replying in a timely manner. #9069 provides an elegant solution to cluster data into a specific number of clusters. Agglomerative clustering uses the same principle, i.e. clustering data into a specific number of cluster, but does not use euclidean distance as a measure of distance. Instead the cophenetic correlation coefficient is used as a measure of distance. However, adding the cophenetic correlation efficient as a measure of distance once #9069 is merged would provide the requested feature. If you could let me know once #9069 is merged, I could have a look into adding the cophenetic correlation efficient as a measure of distance.

cmarmo · 2020-12-14T20:49:07Z

Closing this PR as it was meant to solve #6197 already closed in a different way. Feel free to reopen if this is not the case.

BiaDarkia added 11 commits January 26, 2016 22:06

Clustering based on cophenetic distance added

fa287a2

In sklearn.cluster.hierarchical.py clustering based on cophenetic distance was added to Agglomerative Clustering as defined in issue #6197

Fixed syntax error for pull request

636be86

Fixed syntax for pull request

b20c4ff

Fixed syntax for pull request

e251eeb

Added test case for clustering based on cophenetic distance

84d8355

Fixed missing import for test case

45dacbd

Fixed ValueError and test case

2b82ca0

Fixed syntax error for pull request

2ebbf16

Removed test case

47d3e29

Added corrected version of test case

bba2af8

Added updated version of test case

c5db218

amueller added the Waiting for Reviewer label Oct 7, 2016

github-actions bot added the module:cluster label Mar 2, 2020

cmarmo removed the Waiting for Reviewer label Dec 14, 2020

cmarmo closed this Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering based on cophenetic distance added #6234

Clustering based on cophenetic distance added #6234

BiaDarkia commented Jan 26, 2016

amueller commented Oct 7, 2016

Jeltje commented Nov 8, 2017

jnothman commented Nov 8, 2017

Jeltje commented Nov 9, 2017

jnothman commented Nov 9, 2017

BiaDarkia commented May 12, 2018

cmarmo commented Dec 14, 2020

Clustering based on cophenetic distance added #6234

Clustering based on cophenetic distance added #6234

Conversation

BiaDarkia commented Jan 26, 2016

amueller commented Oct 7, 2016

Jeltje commented Nov 8, 2017

jnothman commented Nov 8, 2017

Jeltje commented Nov 9, 2017

jnothman commented Nov 9, 2017

BiaDarkia commented May 12, 2018

cmarmo commented Dec 14, 2020