# linkage() function mistakes distance matrix as observation vectors #2614

Closed
opened this Issue Jul 3, 2013 · 6 comments

### 4 participants

the document for scipy.cluster.hierarchy.linkage() function states that it accepts redundant distance matrix as input:

"y : ndarray
A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array."

This feature of accepting redundant distance matrix as input is also introduced by many popular books like "SciPy and NumPy: An Overview for Developers" By Eli Bressert:

However, when distance matrix is specified as input to the linkage() function, it is silently considered to be observation vectors. This can be seen by reading the source code :
https://github.com/scipy/scipy/blob/master/scipy/cluster/hierarchy.py

If the input is a 2-dimensional matrix, it is not checked by predicate scipy.spatial.distance.is_valid_dm(). Rather, it is simply taken as observation vectors for hierarchical clustering. It may be also seen from the following example:

```arr = scipy.array([[0,5],[1,5],[-1,5],[0,-5],[1,-5],[-1,-5], [-1.1,-5]])
pd = pdist(arr)
dm = squareform(pd)
linkage(dm)  # result is different from the previous ```

I think this issue has relatively serious consequence if users do not check their clustering results carefully.

I suggest: either update the documentation for linkage() function to reflect the real functionality, or add a predicate check using scipy.spatial.distance.is_valid_dm() if two dimensional matrix is given as input so distance matrix is processed properly in the linkage() function.

referenced this issue Jul 3, 2013
Closed

#### linkage() can only take condensed distance matrix or observations #2613

referenced this issue in hongbo-zhu-cn/scipy Jul 3, 2013
 hongbo-zhu-cn `Update hierarchy.py` `af2cd07`
referenced this issue Jul 3, 2013
Closed

#### "linkage() function mistakes distance matrix as observation vectors" #2615

@hongbo-zhu-cn So how how is it in the end? I am also facing this problem. Very serious if one didn't look carefully. Now can I directly feed my distance matrix in or have to make it condensed first?

@perfectionming Yes the bug is indeed serious. I just wonder how much code is still around and how many results had been published using the code :-) It is not yet fixed, I believe. So you will have to take care of your own matrix before running linkage().

@hongbo-zhu-cn LOL on "how many results had been published using the code". Ok so given the status quo, to be safe, I better pass the condensed version of it. Please help check whether my solution to this problem in StackOverflow (http://stackoverflow.com/questions/18952587/use-distance-matrix-in-scipy-cluster-hierarchy-linkage) is correct. Thanks! 👍 Oh, BTW, 我们是同胞:)

referenced this issue Feb 27, 2014
Closed

#### scipy.cluster.hierarchy.linkage docstring is confusing #3401

I agree this is really misleading and that nearly everyone who uses this function probably uses distance matrices which the function treats as arrays of observations.

This was referenced Feb 27, 2014
Closed

Closed

Open

Merged

Closed

Closed

#### scipy.cluster.hierarchy.linkage() redundant distance matrix is mistreated #5508

commented May 16, 2016 edited

i have a 20 (number of features) x 8000 (number of samples) matrix.
will i need to calculate the distance matrix (pdist(X, 'euclidean')) and pass it to linkage?

how can i cut the dendrogram and get the distance between the clusters after the cut? how can i plot the clusters?

Has this been fixed yet?

referenced this issue Jul 1, 2016
Merged

#### DOC: cluster: clarify hierarchy.linkage usage #6342

closed this in #6342 Jul 5, 2016