Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Use MST algorithm instead of SLINK for single linkage clustering #6495

Merged
merged 4 commits into from Sep 28, 2016

Conversation

@nmayorov
Copy link
Contributor

nmayorov commented Aug 17, 2016

Following #5960 I replaced SLINK algorithm with MST. It was rather straightforward to implement, except that post-labeling algorithm had to be changed slightly as well.

Timings show that MST is slightly faster on a random input and NN-CHAIN is about 2 times slower (so using MST makes sense).

Tests with ties (the very reason to replace SLINK) were taken from #5960.

@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Sep 25, 2016

Interesting that the Mullner paper has 54 citations already but was never published in a journal.

This PR implements the method originally by Rohlf (1973), so citing that paper would be good as well.

@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Sep 25, 2016

@nmayorov it would be good to cherry-pick b82937e so you preserve the authorship info.

@rgommers rgommers added this to the 0.19.0 milestone Sep 25, 2016
@nmayorov

This comment has been minimized.

Copy link
Contributor Author

nmayorov commented Sep 25, 2016

@rgommers I think it's done.

Copy link
Member

rgommers left a comment

LGTM

if x_root < y_root:
Z[i, 0], Z[i, 1] = x_root, y_root
else:
Z[i, 0], Z[i, 1] = y_root, x_root
Z[i, 3] = uf.merge(x_root, y_root)

This comment has been minimized.

Copy link
@rgommers

rgommers Sep 28, 2016

Member

This is a fix in label that affects nn_chain as well right? And the issue is covered by one of the new tests?

This comment has been minimized.

Copy link
@nmayorov

nmayorov Sep 28, 2016

Author Contributor

This change doesn't affect nn_chain, but is required for mst_linkage to label clusters correctly. It is not related to the ties issue.

So the new variant is more general and the original version relies on the order of merges in nn_chain, which is quite nontrivial thing. Partially my confidence is based on the extensive test suite available in scipy. I also run couple of comparisons with fastcluster on random inputs.

This comment has been minimized.

Copy link
@rgommers

rgommers Sep 28, 2016

Member

the original version relies on the order of merges in nn_chain

ah thanks, that wasn't clear to me.

@rgommers rgommers merged commit a6bf808 into scipy:master Sep 28, 2016
3 checks passed
3 checks passed
ci/circleci Your tests passed on CircleCI!
Details
codecov/project 78.79% (target 1.00%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Sep 28, 2016

In it goes then. Thanks @nmayorov!

@rgommers

This comment has been minimized.

Copy link
Member

rgommers commented Apr 4, 2017

won't do an 0.18.x release anymore, so removing the backport label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.