Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low memory version of single linkage clustering #9031

Open
lmcinnes opened this issue Jul 14, 2018 · 4 comments
Open

Low memory version of single linkage clustering #9031

lmcinnes opened this issue Jul 14, 2018 · 4 comments
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected enhancement A new feature or improvement scipy.cluster

Comments

@lmcinnes
Copy link

Currently in scipy single linkage clustering on vector data starts by calling pdist to generate the full distance matrix, and then runs an MST calculation on that. While this is efficient it is very memory intensive for large datasets. It is relatively straightforward to to calculate the MST generating distances only as they are needed through the algorithm, allowing efficient computation without filling memory. It would be beneficial if such an option was available in scipy.

Reproducing code example:

import numpy as np
import scipy.cluster.hierarchy

data = np.random.normal(size=(64000,2))
tree = scipy.cluster.hierarchy.linkage(data, method='single')

Error message:

On my mac laptop this simply freezes the whole machine. On Linux a MemoryError with no traceback results. If the (lack of) error is not reproducible on your machine simply make data a larger array.

Scipy/Numpy/Python version information:

1.0.0 1.11.3 sys.version_info(major=3, minor=5, micro=4, releaselevel='final', serial=0)
@rgommers rgommers added defect A clear bug or issue that prevents SciPy from being installed or used as expected enhancement A new feature or improvement scipy.cluster labels Jul 14, 2018
@rgommers
Copy link
Member

rgommers commented Jul 14, 2018

This is both a defect (no MemoryError on macOS) and an enhancement request. Discussion with @lmcinnes at the SciPy'18 sprint - changing to his single linkage implementation seems feasible. Does require a Cython interface to the distance functions in spatial though (gh-9032), so cannot easily be integrated right now.

@Mortal
Copy link

Mortal commented Jul 2, 2023

Actually, it turned out that it's rather easy to combine scipy's qhull bindings with scipy's MST code to implement single-linkage Euclidean clustering in N log N time and linear memory. I have a proof of concept in https://github.com/Mortal/singlelinkage - it easily handles 64000 points in 2d per the reproduction of @lmcinnes. It uses the edges of the Delaunay triangulation to limit the size of the input to the MST algorithm.

I would be happy to implement this into the scipy library as a replacement for the current Euclidean single-linkage implementation. Any pointers where to start?

EDIT TO ADD: Based on the comment from @lmcinnes I have added to the linked repo a quadratic-time, linear-memory algorithm that doesn't use scipy's Delaunay triangulation or MST algorithm, but instead implements Prim's algorithm directly.

@j-bowhay
Copy link
Member

j-bowhay commented Jul 2, 2023

@Mortal it is first best to put your proposal to the mailing list to see if there is support for it

@lmcinnes
Copy link
Author

lmcinnes commented Jul 4, 2023

This looks like a good implementation. The biggest catches are that it only works for Euclidean distance, and since it relies on Delaunay triangulation it only works well for low dimensional data (Delaunay in high dimensions becomes expensive very quickly).

Since we ultimately just want the MST, and assuming we use Prims algorithm for that, it is easy enough to only generate the distances required for each step of Prims (i.e. distances from the last added point to all remaining points) which is linear in memory and quadratic in time complexity.

If we are only interested in low(ish) dimensional data then Dual-tree Boruvka will scale better than Delaunay triangulation with dimension, and assuming we use Ball-Trees from scipy, it should support a wide range of distances beyond just Euclidean. Unfortunately Dual-tree Boruvka is rather more complicated to implement than the solution using Delaunay triangulation. Let's see what the mailing list thinks of things ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected enhancement A new feature or improvement scipy.cluster
Projects
None yet
Development

No branches or pull requests

4 participants