Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement different cut criteria for agglomerative clustering #6197

Closed
amueller opened this issue Jan 20, 2016 · 11 comments
Closed

implement different cut criteria for agglomerative clustering #6197

amueller opened this issue Jan 20, 2016 · 11 comments

Comments

@amueller
Copy link
Member

we currently have "n_clusters" as a criterion, and a single way to cut the tree (I'm not sure what our strategy is called). scipy implements many more strategies, in particular "distance": http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

@GaelVaroquaux do we have the "inconsistent" in this nomenclature?

@BiaDarkia
Copy link

I can have a closer look on the current implementation and expand it to have more clustering strategies similar to scipy.

@GaelVaroquaux
Copy link
Member

scipy implements many more strategies, in particular "distance":
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.fcluster.html

@GaelVaroquaux do we have the "inconsistent" in this nomenclature?

I don't understand a significant fraction of the information on the above
page :$. I don't know the definitions of words like "cophenetic
distance", "inconsistent value"

The strategy "distance" is what people have been most asking for, and I
guess that it should be the priority for scikit-learn.

@BiaDarkia
Copy link

I will have a look on implementing distance at the weekend.

@BiaDarkia
Copy link

I am nearly done with a working implementation for distance, but I have one more thing I would like to check on. I will submit a pull request once I am done.

@BiaDarkia
Copy link

I submitted a pull request for clustering based on cophenetic distance. When the additional arguments distance="True" and a threshold (float) is passed on to Agglomerative Clustering, you can cluster your data based on the cophenetic distance.

@twistedcubic
Copy link

@BiaDarkia I looked over your code, and I'm wondering: how and where are you determining the cophenetic distances? I only see the distance list being used. Thanks.

@BiaDarkia
Copy link

@twistedcubic: When building the tree in line 842 to 846 I specify the argument 'return_distance' to be true. Based on the documentation linkage_tree, the returned distance[i] refers to the distance between children[i[[0] and children[i][1] at which they are merged. If I understood the concept of cophenetic distance, it refers to the distance at which two children are merged into a single branch, which is reflected by the list returned_distance.

@twistedcubic
Copy link

@BiaDarkia I see, you are using the distance tree_builder returns. I will look over the code for linkage_tree more carefully.

@jnothman
Copy link
Member

jnothman commented Nov 9, 2017

See also #3796, #9069

@jnothman
Copy link
Member

jnothman commented Nov 9, 2017

From matlab docs "The inconsistency coefficient characterizes each link in a cluster tree by comparing its height with the average height of other links at the same level of the hierarchy. The higher the value of this coefficient, the less similar the objects connected by the link." I still think some of the terms here are confusing. Distance between clusters when merged are included when calculating this coefficient.

@amueller
Copy link
Member Author

we have distance thresholding now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants