Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

"linkage() function mistakes distance matrix as observation vectors" #2615

Open
wants to merge 2 commits into from

9 participants

@hongbo-zhu-cn

this is to handle issue # 2614 ( scipy#2614 )
"linkage() function mistakes distance matrix as observation vectors"
If input to linkage() is a 2-dimensional matrix, it is first checked by predicate is_valid_dm(). If it is a distance matrix, it is converted to condensed vector. Otherwise it is considered to be observation vectors and condensed distance matrix is calculated using pdist() function.

@hongbo-zhu-cn

this is to handle issue # 2614 ( scipy#2614 )
"linkage() function mistakes distance matrix as observation vectors"
If input to linkage() is a 2-dimensional matrix, it is first checked by predicate is_valid_dm(). If it is a distance matrix, it is converted to condensed vector. Otherwise it is considered to be observation vectors and condensed distance matrix is calculated using pdist() function.

@rgommers
Owner

Hmm, looks like this has the potential to silently return different results for existing code. I'm not a user so can't judge what the best solution is here. @jakevdp any opinion on this?

@jakevdp
Collaborator

I haven't used this much either, but I agree with @rgommers's concerns. In particular, the is_valid_dm function returns False if the matrix is not of type double. This might cause confusion if someone were to pass a float32 array, expecting it to be interpreted as a distance matrix. Also, there exist corner cases where a set of vectors might look like a distance matrix (however contrived that might be), and without a way to specifically flag the user's intent, it's hard to guess what the algorithm should do.

My first preference would be to keep the current behavior, but change the documentation to be reflective of the algorithm. It's clean, it involves no guesswork, and it won't break any legacy code.

If you really feel that the user should have the ability to pass the full distance matrix, then I would prefer adding a boolean flag indicating that this is the case, so the algorithm doesn't have to guess.

@hongbo-zhu-cn

I agree that there can be cases when a set of observation vectors look like a distance matrix. Yes, in that case the code will lead to bigger confusion.

My opinion is to keep the documentation and update the code accordingly. This is because
a) it makes linkage() function more convenient;
b) users of the function linkage() have learned that they can use distance matrix as input (through the documentation and other books). Removing this option will break their code. Also, the code still needs to guess if the input is a distance matrix so it can reject it.

c) it is just intuitive to use distance matrix as input.

Adding a boolean flag is a good idea. It will break users code, too (less dramatic than rejecting distance matrix). But, it is better than leaving a bug in the code.

@WarrenWeckesser
Collaborator

I agree with @jakevdp, for exactly the reasons he gives.

@rgommers
Owner

@hongbo-zhu-cn the conclusion from the above discussion is that distance matrix input can be accepted if you add a new keyword argument for that (with default False so it doesn't change return values of existing code). Could you please do this? It's also necessary to add a unit test that covers distance matrix input.

@hongbo-zhu-cn hongbo-zhu-cn Add bool redundant_dm to indicate the cases when redundant distance m…
…atrix is used as input

Add bool redundant_dm to indicate the cases when redundant distance matrix is used as input. In such cases, the redundant distance matrices are converted to condensed form using squareform.
fb1d365
@coveralls

Coverage Status

Changes Unknown when pulling fb1d365 on hongbo-zhu-cn:master into ** on scipy:master**.

@hongbo-zhu-cn

@coveralls
I do not understand your comment. I checked the master code and it is not merged there. The pull request is still Open. Can I do any further improvement to have the bug fixed?

@jakevdp
Collaborator

@hongbo-zhu-cn - coveralls is an automatic process which checks how a PR will affect test coverage

@hongbo-zhu-cn

@jakevdp oh, that's awkward :) Thank you for the explanation.
So I guess the issue is still open and more buggy clustering results are generated ... (hopefully not).

@pv pv added the PR label
@rgwlawson

Hi. I am a new user to Python and to SciPy (and recently installed Python 2.7 in the Anaconda environment). Given that I only recently installed Python 2.7 and I too noticed (and was confused) by this issue would it be fair to say that this issue (ie whether the appropriate input in linkage is a distance matrix or an observation matrix) has still not been resolved? If so, who should I write to in order to ask it get resolved (I am new to GitHub too, so don't understand how this works either)? Many thanks for your help

PS The results from my cluster analysis using a distance matrix that I had precalculated (rather than the observation matrix) actually resulted in a dendogram that clustered exactly as I wanted/expected ... very weird (and scary) indeed.

@richardtsai

@rgwlawson The reason why you got a dendrogram that looked like what you wanted is that linkage regarded your distance matrix as an observation matrix. And a distance matrix is in fact a transformation of the original observation vectors from the M-dimensional space to the N-dimensional space. It is expected but not the correct way. If you check the resulting linkage matrices, you will find that their third columns are different.

@rgwlawson

Thanks Richard. I thought it was something like that ... ie if two rows in the underlying observation matrix are highly correlated I was thinking this would show up in the correlation matrix too and hence in the clusters. Interestingly when I correctly put the condensed matrix into linkage (rather than the distance matrix) the clustering did improve marginally.

Do you happen to know if there is a new version of the code with the boolean switch to tag when the input is a distance matrix instead of the underlying observation matrix?

Thanks

@richardtsai

Both this PR and #3406 are working on this issue. @argriffing @hongbo-zhu-cn any opinions?

@argriffing
Collaborator

@richardtsai my opinion is that the well-intentioned permissiveness of the clustering interface is causing bugs in user code, and that it should be changed to be made more explicit.

@hongbo-zhu-cn

@richardtsai I think the conclusion had been reached (see @rgommers' comment on 22 Sep 2013). A fix had been proposed and a pull request is sent.

@richardtsai richardtsai commented on the diff
scipy/cluster/hierarchy.py
@@ -640,7 +642,8 @@ def linkage(y, method='single', metric='euclidean'):
if method not in _cpy_linkage_methods:
raise ValueError('Invalid method: %s' % method)
if method in _cpy_non_euclid_methods:
- dm = distance.pdist(X, metric)
+ if redundant_dm: dm = distance.squareform(X)
+ else: dm = distance.pdist(X, metric)
if redundant_dm:
    dm = distance.squareform(X)
else:
    dm = distance.pdist(X, metric)

might be better.

@rgommers Owner

and same change below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@richardtsai

@hongbo-zhu-cn The second paragraph of the docstring and the description of y parameter should be updated.
Some unit tests are needed as well.

@pv pv removed the PR label
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 3, 2013
  1. @hongbo-zhu-cn

    Update hierarchy.py

    hongbo-zhu-cn authored
Commits on Sep 23, 2013
  1. @hongbo-zhu-cn

    Add bool redundant_dm to indicate the cases when redundant distance m…

    hongbo-zhu-cn authored
    …atrix is used as input
    
    Add bool redundant_dm to indicate the cases when redundant distance matrix is used as input. In such cases, the redundant distance matrices are converted to condensed form using squareform.
This page is out of date. Refresh to see the latest.
Showing with 7 additions and 3 deletions.
  1. +7 −3 scipy/cluster/hierarchy.py
View
10 scipy/cluster/hierarchy.py
@@ -465,7 +465,7 @@ def ward(y):
return linkage(y, method='ward', metric='euclidean')
-def linkage(y, method='single', metric='euclidean'):
+def linkage(y, method='single', metric='euclidean', redundant_dm=False):
"""
Performs hierarchical/agglomerative clustering on the condensed
distance matrix y.
@@ -607,6 +607,8 @@ def linkage(y, method='single', metric='euclidean'):
metric : str, optional
The distance metric to use. See the ``distance.pdist`` function for a
list of valid distance metrics.
+ redundant_dm: bool, optional
+ Boolean. True if y is a redundant distance matrix.
Returns
-------
@@ -640,7 +642,8 @@ def linkage(y, method='single', metric='euclidean'):
if method not in _cpy_linkage_methods:
raise ValueError('Invalid method: %s' % method)
if method in _cpy_non_euclid_methods:
- dm = distance.pdist(X, metric)
+ if redundant_dm: dm = distance.squareform(X)
+ else: dm = distance.pdist(X, metric)
if redundant_dm:
    dm = distance.squareform(X)
else:
    dm = distance.pdist(X, metric)

might be better.

@rgommers Owner

and same change below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Z = np.zeros((n - 1, 4))
_hierarchy_wrap.linkage_wrap(dm, Z, n,
int(_cpy_non_euclid_methods[method]))
@@ -648,7 +651,8 @@ def linkage(y, method='single', metric='euclidean'):
if metric != 'euclidean':
raise ValueError(('Method %s requires the distance metric to '
'be euclidean') % s)
- dm = distance.pdist(X, metric)
+ if redundant_dm: dm = distance.squareform(X)
+ else: dm = distance.pdist(X, metric)
Z = np.zeros((n - 1, 4))
_hierarchy_wrap.linkage_euclid_wrap(dm, Z, X, m, n,
int(_cpy_euclid_methods[method]))
Something went wrong with that request. Please try again.