-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low memory version of single linkage clustering #9031
Comments
This is both a defect (no |
Actually, it turned out that it's rather easy to combine scipy's qhull bindings with scipy's MST code to implement single-linkage Euclidean clustering in N log N time and linear memory. I have a proof of concept in https://github.com/Mortal/singlelinkage - it easily handles 64000 points in 2d per the reproduction of @lmcinnes. It uses the edges of the Delaunay triangulation to limit the size of the input to the MST algorithm. I would be happy to implement this into the scipy library as a replacement for the current Euclidean single-linkage implementation. Any pointers where to start? EDIT TO ADD: Based on the comment from @lmcinnes I have added to the linked repo a quadratic-time, linear-memory algorithm that doesn't use scipy's Delaunay triangulation or MST algorithm, but instead implements Prim's algorithm directly. |
@Mortal it is first best to put your proposal to the mailing list to see if there is support for it |
This looks like a good implementation. The biggest catches are that it only works for Euclidean distance, and since it relies on Delaunay triangulation it only works well for low dimensional data (Delaunay in high dimensions becomes expensive very quickly). Since we ultimately just want the MST, and assuming we use Prims algorithm for that, it is easy enough to only generate the distances required for each step of Prims (i.e. distances from the last added point to all remaining points) which is linear in memory and quadratic in time complexity. If we are only interested in low(ish) dimensional data then Dual-tree Boruvka will scale better than Delaunay triangulation with dimension, and assuming we use Ball-Trees from scipy, it should support a wide range of distances beyond just Euclidean. Unfortunately Dual-tree Boruvka is rather more complicated to implement than the solution using Delaunay triangulation. Let's see what the mailing list thinks of things ... |
Currently in scipy single linkage clustering on vector data starts by calling
pdist
to generate the full distance matrix, and then runs an MST calculation on that. While this is efficient it is very memory intensive for large datasets. It is relatively straightforward to to calculate the MST generating distances only as they are needed through the algorithm, allowing efficient computation without filling memory. It would be beneficial if such an option was available in scipy.Reproducing code example:
Error message:
On my mac laptop this simply freezes the whole machine. On Linux a MemoryError with no traceback results. If the (lack of) error is not reproducible on your machine simply make
data
a larger array.Scipy/Numpy/Python version information:
The text was updated successfully, but these errors were encountered: