Skip to content

[FEA] [algorithms/hdbscan] Reduce Memory footprint while performing soft clustering using HDBSCAN #6551

@abhijithp135

Description

@abhijithp135

Is your feature request related to a problem? Please describe.
It is not related to any problem. I would like to generate soft clustering data for a sample having size of 1.5 million. Number of clusters generated after clustering : 15,000
In order to run soft clustering, I need at least 1.5 * 10^6 * 15000 * 4 = 90 GB memory to store the result.
This is clearly not scalable if we want to run clustering on even large data.

Describe the solution you'd like
Instead of storing the probabilities of a point belonging to all the clusters, expose a param in the constructor and store only those many probabilities after sorting. And also store the probability of this point being a noise point in separate array.
At least for my use case, I don't need more than top 10 probabilities and the noise probability

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions