the document for scipy.cluster.hierarchy.linkage() function states that it accepts redundant distance matrix as input:
"y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array."
This feature of accepting redundant distance matrix as input is also introduced by many popular books like "SciPy and NumPy: An Overview for Developers" By Eli Bressert:
However, when distance matrix is specified as input to the linkage() function, it is silently considered to be observation vectors. This can be seen by reading the source code :
If the input is a 2-dimensional matrix, it is not checked by predicate scipy.spatial.distance.is_valid_dm(). Rather, it is simply taken as observation vectors for hierarchical clustering. It may be also seen from the following example:
arr = scipy.array([[0,5],[1,5],[-1,5],[0,-5],[1,-5],[-1,-5], [-1.1,-5]])
pd = pdist(arr)
dm = squareform(pd)
linkage(dm) # result is different from the previous
I think this issue has relatively serious consequence if users do not check their clustering results carefully.
I suggest: either update the documentation for linkage() function to reflect the real functionality, or add a predicate check using scipy.spatial.distance.is_valid_dm() if two dimensional matrix is given as input so distance matrix is processed properly in the linkage() function.
@hongbo-zhu-cn So how how is it in the end? I am also facing this problem. Very serious if one didn't look carefully. Now can I directly feed my distance matrix in or have to make it condensed first?
@perfectionming Yes the bug is indeed serious. I just wonder how much code is still around and how many results had been published using the code :-) It is not yet fixed, I believe. So you will have to take care of your own matrix before running linkage().
@hongbo-zhu-cn LOL on "how many results had been published using the code". Ok so given the status quo, to be safe, I better pass the condensed version of it. Please help check whether my solution to this problem in StackOverflow (http://stackoverflow.com/questions/18952587/use-distance-matrix-in-scipy-cluster-hierarchy-linkage) is correct. Thanks! 👍 Oh, BTW, 我们是同胞:)
I agree this is really misleading and that nearly everyone who uses this function probably uses distance matrices which the function treats as arrays of observations.
i have a 20 (number of features) x 8000 (number of samples) matrix.
will i need to calculate the distance matrix (pdist(X, 'euclidean')) and pass it to linkage?
how can i cut the dendrogram and get the distance between the clusters after the cut? how can i plot the clusters?
Has this been fixed yet?