-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement different cut criteria for agglomerative clustering #6197
Comments
I can have a closer look on the current implementation and expand it to have more clustering strategies similar to scipy. |
I don't understand a significant fraction of the information on the above The strategy "distance" is what people have been most asking for, and I |
I will have a look on implementing distance at the weekend. |
I am nearly done with a working implementation for distance, but I have one more thing I would like to check on. I will submit a pull request once I am done. |
I submitted a pull request for clustering based on cophenetic distance. When the additional arguments distance="True" and a threshold (float) is passed on to Agglomerative Clustering, you can cluster your data based on the cophenetic distance. |
@BiaDarkia I looked over your code, and I'm wondering: how and where are you determining the cophenetic distances? I only see the distance list being used. Thanks. |
@twistedcubic: When building the tree in line 842 to 846 I specify the argument 'return_distance' to be true. Based on the documentation linkage_tree, the returned distance[i] refers to the distance between children[i[[0] and children[i][1] at which they are merged. If I understood the concept of cophenetic distance, it refers to the distance at which two children are merged into a single branch, which is reflected by the list returned_distance. |
@BiaDarkia I see, you are using the distance tree_builder returns. I will look over the code for linkage_tree more carefully. |
From matlab docs "The inconsistency coefficient characterizes each link in a cluster tree by comparing its height with the average height of other links at the same level of the hierarchy. The higher the value of this coefficient, the less similar the objects connected by the link." I still think some of the terms here are confusing. Distance between clusters when merged are included when calculating this coefficient. |
we have distance thresholding now |
we currently have "n_clusters" as a criterion, and a single way to cut the tree (I'm not sure what our strategy is called). scipy implements many more strategies, in particular "distance": http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster
@GaelVaroquaux do we have the "inconsistent" in this nomenclature?
The text was updated successfully, but these errors were encountered: