[FEA] [algorithms/hdbscan] Reduce Memory footprint while performing soft clustering using HDBSCAN

**Is your feature request related to a problem? Please describe.**
It is not related to any problem. I would like to generate soft clustering data for a sample having size of 1.5 million. Number of clusters generated after clustering : 15,000
In order to run soft clustering, I need at least 1.5 * 10^6 * 15000 * 4 = 90 GB memory to store the result.
This is clearly not scalable if we want to run clustering on even large data.

**Describe the solution you'd like**
Instead of storing the probabilities of a point belonging to all the clusters, expose a param in the constructor and store only those many probabilities after sorting. And also store the probability of this point being a noise point in separate array.
At least for my use case, I don't need more than top 10 probabilities and the noise probability

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context, code examples, or references to existing implementations about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] [algorithms/hdbscan] Reduce Memory footprint while performing soft clustering using HDBSCAN #6551

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] [algorithms/hdbscan] Reduce Memory footprint while performing soft clustering using HDBSCAN #6551

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions