linkage() function mistakes distance matrix as observation vectors #2614

hongbo-zhu-cn opened this Issue Jul 3, 2013 · 6 comments

4 participants


the document for scipy.cluster.hierarchy.linkage() function states that it accepts redundant distance matrix as input:
"y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array."

This feature of accepting redundant distance matrix as input is also introduced by many popular books like "SciPy and NumPy: An Overview for Developers" By Eli Bressert:

However, when distance matrix is specified as input to the linkage() function, it is silently considered to be observation vectors. This can be seen by reading the source code :

If the input is a 2-dimensional matrix, it is not checked by predicate scipy.spatial.distance.is_valid_dm(). Rather, it is simply taken as observation vectors for hierarchical clustering. It may be also seen from the following example:

arr = scipy.array([[0,5],[1,5],[-1,5],[0,-5],[1,-5],[-1,-5], [-1.1,-5]])
pd = pdist(arr)
dm = squareform(pd)
linkage(dm)  # result is different from the previous 

I think this issue has relatively serious consequence if users do not check their clustering results carefully.

I suggest: either update the documentation for linkage() function to reflect the real functionality, or add a predicate check using scipy.spatial.distance.is_valid_dm() if two dimensional matrix is given as input so distance matrix is processed properly in the linkage() function.

@hongbo-zhu-cn hongbo-zhu-cn referenced this issue in hongbo-zhu-cn/scipy Jul 3, 2013
@hongbo-zhu-cn hongbo-zhu-cn Update af2cd07

@hongbo-zhu-cn So how how is it in the end? I am also facing this problem. Very serious if one didn't look carefully. Now can I directly feed my distance matrix in or have to make it condensed first?


@perfectionming Yes the bug is indeed serious. I just wonder how much code is still around and how many results had been published using the code :-) It is not yet fixed, I believe. So you will have to take care of your own matrix before running linkage().


@hongbo-zhu-cn LOL on "how many results had been published using the code". Ok so given the status quo, to be safe, I better pass the condensed version of it. Please help check whether my solution to this problem in StackOverflow ( is correct. Thanks! 👍 Oh, BTW, 我们是同胞:)


I agree this is really misleading and that nearly everyone who uses this function probably uses distance matrices which the function treats as arrays of observations.

Arnold1 commented May 16, 2016 edited

i have a 20 (number of features) x 8000 (number of samples) matrix.
will i need to calculate the distance matrix (pdist(X, 'euclidean')) and pass it to linkage?

how can i cut the dendrogram and get the distance between the clusters after the cut? how can i plot the clusters?


Has this been fixed yet?

@rgommers rgommers closed this in #6342 Jul 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment