linkage() function mistakes distance matrix as observation vectors #2614

Closed
hongbo-zhu-cn opened this Issue Jul 3, 2013 · 6 comments

4 participants

@hongbo-zhu-cn

the document for scipy.cluster.hierarchy.linkage() function states that it accepts redundant distance matrix as input:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html
"y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array."

This feature of accepting redundant distance matrix as input is also introduced by many popular books like "SciPy and NumPy: An Overview for Developers" By Eli Bressert:
http://books.google.co.uk/books?id=c-xzkDMDev0C&pg=PA48&dq=scipy+linkage+input&hl=en&sa=X&ei=4BzUUZqpNoSbPaj3gLgN&ved=0CDgQ6AEwAQ#v=onepage&q=scipy%20linkage%20input&f=false

However, when distance matrix is specified as input to the linkage() function, it is silently considered to be observation vectors. This can be seen by reading the source code :
https://github.com/scipy/scipy/blob/master/scipy/cluster/hierarchy.py

If the input is a 2-dimensional matrix, it is not checked by predicate scipy.spatial.distance.is_valid_dm(). Rather, it is simply taken as observation vectors for hierarchical clustering. It may be also seen from the following example:

arr = scipy.array([[0,5],[1,5],[-1,5],[0,-5],[1,-5],[-1,-5], [-1.1,-5]])
pd = pdist(arr)
dm = squareform(pd)
linkage(pd)
linkage(dm)  # result is different from the previous 

I think this issue has relatively serious consequence if users do not check their clustering results carefully.

I suggest: either update the documentation for linkage() function to reflect the real functionality, or add a predicate check using scipy.spatial.distance.is_valid_dm() if two dimensional matrix is given as input so distance matrix is processed properly in the linkage() function.

@hongbo-zhu-cn hongbo-zhu-cn referenced this issue in hongbo-zhu-cn/scipy Jul 3, 2013
@hongbo-zhu-cn hongbo-zhu-cn Update hierarchy.py af2cd07
@ghost

@hongbo-zhu-cn So how how is it in the end? I am also facing this problem. Very serious if one didn't look carefully. Now can I directly feed my distance matrix in or have to make it condensed first?

@hongbo-zhu-cn

@perfectionming Yes the bug is indeed serious. I just wonder how much code is still around and how many results had been published using the code :-) It is not yet fixed, I believe. So you will have to take care of your own matrix before running linkage().

@ghost

@hongbo-zhu-cn LOL on "how many results had been published using the code". Ok so given the status quo, to be safe, I better pass the condensed version of it. Please help check whether my solution to this problem in StackOverflow (http://stackoverflow.com/questions/18952587/use-distance-matrix-in-scipy-cluster-hierarchy-linkage) is correct. Thanks! 👍 Oh, BTW, 我们是同胞:)

@argriffing

I agree this is really misleading and that nearly everyone who uses this function probably uses distance matrices which the function treats as arrays of observations.

@Arnold1
Arnold1 commented May 16, 2016 edited

i have a 20 (number of features) x 8000 (number of samples) matrix.
will i need to calculate the distance matrix (pdist(X, 'euclidean')) and pass it to linkage?

how can i cut the dendrogram and get the distance between the clusters after the cut? how can i plot the clusters?

@jolespin

Has this been fixed yet?

@rgommers rgommers closed this in #6342 Jul 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment