Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

linkage() function mistakes distance matrix as observation vectors #2614

Open
hongbo-zhu-cn opened this Issue · 4 comments

2 participants

@hongbo-zhu-cn

the document for scipy.cluster.hierarchy.linkage() function states that it accepts redundant distance matrix as input:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html
"y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array."

This feature of accepting redundant distance matrix as input is also introduced by many popular books like "SciPy and NumPy: An Overview for Developers" By Eli Bressert:
http://books.google.co.uk/books?id=c-xzkDMDev0C&pg=PA48&dq=scipy+linkage+input&hl=en&sa=X&ei=4BzUUZqpNoSbPaj3gLgN&ved=0CDgQ6AEwAQ#v=onepage&q=scipy%20linkage%20input&f=false

However, when distance matrix is specified as input to the linkage() function, it is silently considered to be observation vectors. This can be seen by reading the source code :
https://github.com/scipy/scipy/blob/master/scipy/cluster/hierarchy.py

If the input is a 2-dimensional matrix, it is not checked by predicate scipy.spatial.distance.is_valid_dm(). Rather, it is simply taken as observation vectors for hierarchical clustering. It may be also seen from the following example:

arr = scipy.array([[0,5],[1,5],[-1,5],[0,-5],[1,-5],[-1,-5], [-1.1,-5]])
pd = pdist(arr)
dm = squareform(pd)
linkage(pd)
linkage(dm)  # result is different from the previous 

I think this issue has relatively serious consequence if users do not check their clustering results carefully.

I suggest: either update the documentation for linkage() function to reflect the real functionality, or add a predicate check using scipy.spatial.distance.is_valid_dm() if two dimensional matrix is given as input so distance matrix is processed properly in the linkage() function.

@hongbo-zhu-cn hongbo-zhu-cn referenced this issue from a commit in hongbo-zhu-cn/scipy
@hongbo-zhu-cn hongbo-zhu-cn Update hierarchy.py af2cd07
@ghost

@hongbo-zhu-cn So how how is it in the end? I am also facing this problem. Very serious if one didn't look carefully. Now can I directly feed my distance matrix in or have to make it condensed first?

@hongbo-zhu-cn

@perfectionming Yes the bug is indeed serious. I just wonder how much code is still around and how many results had been published using the code :-) It is not yet fixed, I believe. So you will have to take care of your own matrix before running linkage().

@ghost

@hongbo-zhu-cn LOL on "how many results had been published using the code". Ok so given the status quo, to be safe, I better pass the condensed version of it. Please help check whether my solution to this problem in StackOverflow (http://stackoverflow.com/questions/18952587/use-distance-matrix-in-scipy-cluster-hierarchy-linkage) is correct. Thanks! :+1: Oh, BTW, 我们是同胞:)

@argriffing
Collaborator

I agree this is really misleading and that nearly everyone who uses this function probably uses distance matrices which the function treats as arrays of observations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.