Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fast, low memory, single linkage implementation #11514

Merged
merged 73 commits into from Nov 22, 2019

Conversation

lmcinnes
Copy link
Contributor

Reference Issues/PRs

Fixes #11513

What does this implement/fix? Explain your changes.

This PR provides a small cython routine for computing the core mst computation in single linkage clustering in a way that computes distances only as needed, releasing them when they have been used. This allows for single linkage clustering on vector data to run successfully without exploding memory. As a side benefit this provides a speed improvement over the single linkage clustering implementation available in scipy.

Any other comments?

I originally sought to add this upstream to scipy. However scipy's distance computations are not exposed to cython, so on the fly distance computations are not feasible. In constrast Jake Vanderplas implemented most distance functions in the neighbors package in Cython, making fast computation available for all of these distance metrics easily implementable in scikit-learn.

…highlighting the relative strengths and weaknesses.
# Conflicts:
#	doc/modules/clustering.rst
#	examples/cluster/plot_linkage_comparison.py
#	sklearn/cluster/_hierarchical.pyx
#	sklearn/cluster/hierarchical.py
#	sklearn/cluster/tests/test_hierarchical.py
@lmcinnes
Copy link
Contributor Author

Sure -- I was following the kmeans example, which did 3d plots. I'll try to come up with something reasonable with colour schemes.

@lmcinnes
Copy link
Contributor Author

I modified the benchmarking script for 2D plots. This involved a little finessing of the parameters, so I updated the outputs of the scripts:

Base sklearn: https://gist.github.com/lmcinnes/6469b8d17c14330134f7564397c412b8
This patch: https://gist.github.com/lmcinnes/11cdc47c3403e4ca3422eeeb89dea8c7

Plots are as follows:

Base sklearn:

Base sklearn benchmark

This patch:

This patch benchmark

The final results are essentially as described above.

@GaelVaroquaux
Copy link
Member

Ball-park estimate, speed-up by a factor ~ 1.5x in this branch, right?

@GaelVaroquaux
Copy link
Member

Alright, I had a last read at the code. It's nice and compact. Excellent job. Merging! Thank you!

@GaelVaroquaux
Copy link
Member

Wait! I spoke too early. Can you please update the changelog (in doc/whats_new).

Thanks!

@NicolasHug
Copy link
Member

Needs a what's new entry before merging ;)

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merging! Thank you!

@GaelVaroquaux GaelVaroquaux merged commit db59dd7 into scikit-learn:master Nov 22, 2019
@amueller
Copy link
Member

Awesome!

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Nov 25, 2019
…1514)

* First cut at basic single linkage internals

* Refer to correct dist_metrics package

* Add csgraph sparse implementation for single linkage

* Add fast labelling/conversion from MST to single linkage tree; remove uneeded single_linkage.pyx file.

* Ensure existing tests cover single linkage

* Name cingle linkage labelling correctly.

* Iterating toward correct solution. Still have to get n_clusters, compute_full_tree=False working

* Get n_components correct.

* Update docstrings.

* Fix the parents array when we don't get the "full tree"

* Add single linkage to agglomerative clustering example.

* Add single linkage to digits agglomerative clustering example.

* Update documentation to reflect the addition of single linkage.

* Update documentation to reflect the addition of single linkage.

* Pep8 fix for class declaration in cython

* Fix heading in clustering docs

* Update the digits clustering text to reflect the new reality.

* Provide a more complete comparison of the different linkage methods, highlighting the relative strengths and weaknesses.

* We don't need connectivity here, and we can ignore issues with warnings for spectral clustering.

* Add an explicit test that single linkage successfully works on examples it should perform well on.

* Update docs with a more complete comparison on linkage methods (scale to be determined?)

* List formatting in example linkage comparison.

* Flake8 fixes.

* Flake8 fixes.

* More Flake8 fixes.

* Fix agglomerative plot example with correct subplot spec

* Explicitly test linkages (including single) produce results identical to scipy.cluster.hierarchical

* Fix comment on why we sort (consistency)

* Make dense single linkage faster

* Add docstring to new mst-linkage-core computations.

* Add a test that new single linkage code matches scipy

* Ensure we only attemtp this for metrics Jake implemented.

* Per amueller; it's a long paper, ref the figure.

* Clean up a few things.

* Too many blank lines for flake8

* Bad scipy slink input

* Flake8 fixes

* Clean up cython a little; fix typo/carryover

* Convert memoryview to numpy array on return

* Just convert to the correct dtype

* Update sklearn/cluster/_hierarchical.pyx

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Update sklearn/cluster/_hierarchical.pyx

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Update sklearn/cluster/_hierarchical.pyx

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Update sklearn/cluster/tests/test_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Fixes as per @NicolasHug suggestions.

* Update renaming of params in test_hierarchical

* Relative import?

* Ah, it got renamed in master...

* A bad merge on my part.

* In principle this is in sklearn.neighbors now...

* No; not that way...

* Declare dim before use.

* Update sklearn/cluster/tests/test_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Remaining fixes per Nicolas Hug.

* Update sklearn/cluster/tests/test_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Fix flake8 issues.

* Switch from stable to mergesort per jnotham

* Update sklearn/cluster/_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Skip checks that are already validated.

* Update docstring per Gael's suggestion

* Add a benchmark script for agglomerative clustering

* Fix some flake8 issues

* No flake8 on the one line

* Update parameters and output for benchmark hierarchical

* Switch to 2D plotting for hierarchical benchmark

* Wrong colormap name

* Formatting fpr bench hierarchical

* Add an item to WhatsNew
This was referenced Nov 25, 2019
@ogrisel ogrisel added this to the 0.22 milestone Nov 25, 2019
@ogrisel
Copy link
Member

ogrisel commented Nov 25, 2019

I put the 0.22 label on this PR because it is currently included in #15715 but @jnothman stated that this should not be part of the 0.22 because it was merged after the feature freeze and 0.22.X branch (and it is not a bugfix). I am +0 to move it to 0.23.

Feel free to retag to 0.23 if everyone agrees.

@jnothman
Copy link
Member

I don't feel strongly about it, really. I just feel like there's an ideal release process where RC means no more features, and you need to draw that line somewhere.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 25, 2019 via email

@jnothman
Copy link
Member

jnothman commented Nov 25, 2019 via email

@lmcinnes
Copy link
Contributor Author

Sorry -- my mistake; I got the wrong what's new file apparently. Is there anything I need to do to correct this?

@jnothman
Copy link
Member

jnothman commented Nov 25, 2019 via email

adrinjalali added a commit to adrinjalali/scikit-learn that referenced this pull request Nov 26, 2019
jnothman pushed a commit that referenced this pull request Nov 28, 2019
* First cut at basic single linkage internals

* Refer to correct dist_metrics package

* Add csgraph sparse implementation for single linkage

* Add fast labelling/conversion from MST to single linkage tree; remove uneeded single_linkage.pyx file.

* Ensure existing tests cover single linkage

* Name cingle linkage labelling correctly.

* Iterating toward correct solution. Still have to get n_clusters, compute_full_tree=False working

* Get n_components correct.

* Update docstrings.

* Fix the parents array when we don't get the "full tree"

* Add single linkage to agglomerative clustering example.

* Add single linkage to digits agglomerative clustering example.

* Update documentation to reflect the addition of single linkage.

* Update documentation to reflect the addition of single linkage.

* Pep8 fix for class declaration in cython

* Fix heading in clustering docs

* Update the digits clustering text to reflect the new reality.

* Provide a more complete comparison of the different linkage methods, highlighting the relative strengths and weaknesses.

* We don't need connectivity here, and we can ignore issues with warnings for spectral clustering.

* Add an explicit test that single linkage successfully works on examples it should perform well on.

* Update docs with a more complete comparison on linkage methods (scale to be determined?)

* List formatting in example linkage comparison.

* Flake8 fixes.

* Flake8 fixes.

* More Flake8 fixes.

* Fix agglomerative plot example with correct subplot spec

* Explicitly test linkages (including single) produce results identical to scipy.cluster.hierarchical

* Fix comment on why we sort (consistency)

* Make dense single linkage faster

* Add docstring to new mst-linkage-core computations.

* Add a test that new single linkage code matches scipy

* Ensure we only attemtp this for metrics Jake implemented.

* Per amueller; it's a long paper, ref the figure.

* Clean up a few things.

* Too many blank lines for flake8

* Bad scipy slink input

* Flake8 fixes

* Clean up cython a little; fix typo/carryover

* Convert memoryview to numpy array on return

* Just convert to the correct dtype

* Update sklearn/cluster/_hierarchical.pyx

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Update sklearn/cluster/_hierarchical.pyx

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Update sklearn/cluster/_hierarchical.pyx

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Update sklearn/cluster/tests/test_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Fixes as per @NicolasHug suggestions.

* Update renaming of params in test_hierarchical

* Relative import?

* Ah, it got renamed in master...

* A bad merge on my part.

* In principle this is in sklearn.neighbors now...

* No; not that way...

* Declare dim before use.

* Update sklearn/cluster/tests/test_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Remaining fixes per Nicolas Hug.

* Update sklearn/cluster/tests/test_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Fix flake8 issues.

* Switch from stable to mergesort per jnotham

* Update sklearn/cluster/_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Skip checks that are already validated.

* Update docstring per Gael's suggestion

* Add a benchmark script for agglomerative clustering

* Fix some flake8 issues

* No flake8 on the one line

* Update parameters and output for benchmark hierarchical

* Switch to 2D plotting for hierarchical benchmark

* Wrong colormap name

* Formatting fpr bench hierarchical

* Add an item to WhatsNew
jnothman pushed a commit that referenced this pull request Nov 28, 2019
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
…1514)

* First cut at basic single linkage internals

* Refer to correct dist_metrics package

* Add csgraph sparse implementation for single linkage

* Add fast labelling/conversion from MST to single linkage tree; remove uneeded single_linkage.pyx file.

* Ensure existing tests cover single linkage

* Name cingle linkage labelling correctly.

* Iterating toward correct solution. Still have to get n_clusters, compute_full_tree=False working

* Get n_components correct.

* Update docstrings.

* Fix the parents array when we don't get the "full tree"

* Add single linkage to agglomerative clustering example.

* Add single linkage to digits agglomerative clustering example.

* Update documentation to reflect the addition of single linkage.

* Update documentation to reflect the addition of single linkage.

* Pep8 fix for class declaration in cython

* Fix heading in clustering docs

* Update the digits clustering text to reflect the new reality.

* Provide a more complete comparison of the different linkage methods, highlighting the relative strengths and weaknesses.

* We don't need connectivity here, and we can ignore issues with warnings for spectral clustering.

* Add an explicit test that single linkage successfully works on examples it should perform well on.

* Update docs with a more complete comparison on linkage methods (scale to be determined?)

* List formatting in example linkage comparison.

* Flake8 fixes.

* Flake8 fixes.

* More Flake8 fixes.

* Fix agglomerative plot example with correct subplot spec

* Explicitly test linkages (including single) produce results identical to scipy.cluster.hierarchical

* Fix comment on why we sort (consistency)

* Make dense single linkage faster

* Add docstring to new mst-linkage-core computations.

* Add a test that new single linkage code matches scipy

* Ensure we only attemtp this for metrics Jake implemented.

* Per amueller; it's a long paper, ref the figure.

* Clean up a few things.

* Too many blank lines for flake8

* Bad scipy slink input

* Flake8 fixes

* Clean up cython a little; fix typo/carryover

* Convert memoryview to numpy array on return

* Just convert to the correct dtype

* Update sklearn/cluster/_hierarchical.pyx

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Update sklearn/cluster/_hierarchical.pyx

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Update sklearn/cluster/_hierarchical.pyx

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Update sklearn/cluster/tests/test_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Fixes as per @NicolasHug suggestions.

* Update renaming of params in test_hierarchical

* Relative import?

* Ah, it got renamed in master...

* A bad merge on my part.

* In principle this is in sklearn.neighbors now...

* No; not that way...

* Declare dim before use.

* Update sklearn/cluster/tests/test_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Remaining fixes per Nicolas Hug.

* Update sklearn/cluster/tests/test_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Fix flake8 issues.

* Switch from stable to mergesort per jnotham

* Update sklearn/cluster/_hierarchical.py

Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>

* Skip checks that are already validated.

* Update docstring per Gael's suggestion

* Add a benchmark script for agglomerative clustering

* Fix some flake8 issues

* No flake8 on the one line

* Update parameters and output for benchmark hierarchical

* Switch to 2D plotting for hierarchical benchmark

* Wrong colormap name

* Formatting fpr bench hierarchical

* Add an item to WhatsNew
Pseudomanifold pushed a commit to BorgwardtLab/scikit-learn that referenced this pull request Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Single linkage clustering fails for large datasets
6 participants