[FEA] Support for HDBSCAN membership_vector and all_points_membership_vectors #4724

beckernick · 2022-05-05T13:12:07Z

I'd like to be able to use HDBSCAN to calculate membership vectors for points, like I can with the CPU library. Per the CPU library documentation, this function "produces a vector for each point in [provided data] that gives a probability that the given point is a member of a cluster for each of the selected clusters of the clusterer."

HDBSCAN provides separate methods to calculating membership vectors for both new points and all points in the original dataset

This is useful in topic modeling workflows. Per the BERTopic documentation, "Calculating the probabilities is quite expensive and can significantly increase the computation time. Thus, only use it if you do not mind waiting a bit before the model is done running or if you have less than 50_000 documents." Accelerating this would be valuable, as these probabilities are necessary for evaluating confidence of topic assignments to documents. The existing probabilities_ we provide represent the probability of the ultimately assigned topic cluster, but it would be nice to provide the full vector of probabilities per point like CPU HDBSCAN for compatibility with tools that depend on it (like BERTopic).

import hdbscan
from sklearn.datasets import make_blobs
import numpy as np


np.random.seed(12)

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True)
clusterer.fit(data)

new_points = np.array([[-9,7], [-3,4]])
hdbscan.membership_vector(clusterer, new_points)
array([[0.02848655, 0.07827696, 0.46968066],
       [0.0569429 , 0.05463597, 0.03344189]])

import hdbscan
from sklearn.datasets import make_blobs
import numpy as np

np.random.seed(12)


data, _ = make_blobs(1000)
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True)
clusterer.fit(data)

hdbscan.all_points_membership_vectors(clusterer)[:5].round(3)
array([[0.   , 0.   , 1.   ],
       [0.667, 0.11 , 0.05 ],
       [0.697, 0.059, 0.024],
       [0.   , 1.   , 0.   ],
       [0.546, 0.177, 0.047]])

The text was updated successfully, but these errors were encountered:

github-actions · 2022-06-04T20:02:45Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

beckernick · 2022-08-29T13:28:41Z

With the merge of #4800, soft clustering the original dataset with all_points_membership_vectors is now available. Please give it a try and file issues if you run into any issues or have any feedback.

The outstanding request in this issue is soft-clustering a new set of points with membership_vector.

ZhangzihanGit · 2022-08-30T04:48:28Z

Hi,

Thanks for implementing this feature!

I notice that the all_points_membership_vectors method is written outside of the HDBSCAN class, is this intended?

In that case, we need to explicitly import this method to use it:

from cuml.cluster import HDBSCAN, all_points_membership_vectors

Thanks

beckernick · 2022-08-30T13:06:50Z

Good question. Yes. In the canonical CPU library, this function is accessible from the module level namespace rather than as a method of the class. We've matched this user experience.

import cuml
from cuml.cluster import hdbscan
import hdbscan as cpu_hdbscan

print(cpu_hdbscan.all_points_membership_vectors)
print(cuml.cluster.hdbscan.all_points_membership_vectors, hdbscan.all_points_membership_vectors)
<function all_points_membership_vectors at 0x7fa895c611f0>
<cyfunction all_points_membership_vectors at 0x7fa897a35520> <cyfunction all_points_membership_vectors at 0x7fa897a35520>

ldsands · 2022-09-21T16:22:21Z

Good question. Yes. In the canonical CPU library, this function is accessible from the module level namespace rather than as a method of the class. We've matched this user experience.

import cuml
from cuml.cluster import hdbscan
import hdbscan as cpu_hdbscan

print(cpu_hdbscan.all_points_membership_vectors)
print(cuml.cluster.hdbscan.all_points_membership_vectors, hdbscan.all_points_membership_vectors)
<function all_points_membership_vectors at 0x7fa895c611f0>
<cyfunction all_points_membership_vectors at 0x7fa897a35520> <cyfunction all_points_membership_vectors at 0x7fa897a35520>

I just ran the nightly yesterday, and it didn't seem to work I had to use your workaround from here. Am I missing something?

# this doesn't work
probs = cuml.cluster.hdbscan.all_points_membership_vectors(gpu_hdbscan_model)
# this does work
probs = cuml.cluster.all_points_membership_vectors(gpu_hdbscan_model)

beckernick · 2022-09-22T15:11:33Z

Thanks for commenting about this namespace issue. This will be resolved in #4895

Addresses #4724. Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - AJ Schmidt (https://github.com/ajschmidt8) URL: #4895

github-actions · 2022-10-22T16:03:59Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

hatemr · 2023-01-03T22:22:49Z

Thanks for implementing this feature!

Has anyone had issues using all_points_membership_vectors with a large dataset? (for me, original embeddings are 235002 x 384). It causes my Python kernel to fail. Apologies if this isn't enough detail.

cjnolet · 2023-01-03T22:28:01Z

@hatemr can you provide the error you are getting when the Python kernel fails and describe your environment (type of GPU, etc...)?

hatemr · 2023-01-03T22:38:19Z

g4dn.8xlarge

beckernick · 2023-01-03T23:03:15Z

Would recommend we take further discussion of this to #4879 , which includes some potential ideas for ways to reduce it.

I will comment on that issue with this context.

ravinani · 2023-01-04T14:46:20Z

@hatemr I tried the BERTopic with GPU for almost 1M documents. My original embeddings are 1M x 384. I tried to get the probabilities of each topic for evary document using "all_points_membership_vectors" but the system crashes whenver I tried to perform the task

I am using Colab with GPU of 40GB. Can you please suggest the workaround to get the probabilities of each document for every topic

hatemr · 2023-01-04T15:03:24Z

@hatemr I tried the BERTopic with GPU for almost 1M documents. My original embeddings are 1M x 384. I tried to get the probabilities of each topic for evary document using "all_points_membership_vectors" but the system crashes whenver I tried to perform the task

I am using Colab with GPU of 40GB. Can you please suggest the workaround to get the probabilities of each document for every topic

@ravinani see answer here

Addresses rapidsai#4724. Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - AJ Schmidt (https://github.com/ajschmidt8) URL: rapidsai#4895

Closes #4724 Authors: - Tarang Jain (https://github.com/tarang-jain) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #5247

beckernick added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 5, 2022

beckernick added Cython / Python Cython or Python issue libcuml CUDA / C++ CUDA issue and removed ? - Needs Triage Need team to review and classify libcuml labels May 5, 2022

beckernick mentioned this issue May 20, 2022

Implementation from cuML in Berttopic MaartenGr/BERTopic#495

Closed

github-actions bot added the inactive-30d label Jun 4, 2022

beckernick mentioned this issue Aug 29, 2022

[FEA] Soft clustering with HDBSCAN #4467

Closed

github-actions bot removed the inactive-30d label Aug 29, 2022

cjnolet mentioned this issue Sep 21, 2022

Fix HDBSCAN python namespace #4895

Merged

github-actions bot added the inactive-30d label Oct 22, 2022

beckernick mentioned this issue Jan 3, 2023

[FEA] Reduce memory pressure in HDBSCAN all_points_membership_vectors (or provide/link to best practices) #4879

Closed

tarang-jain mentioned this issue Mar 7, 2023

[FEA] membership_vector for HDBSCAN #5247

Merged

rapids-bot bot closed this as completed in #5247 Mar 31, 2023

rapids-bot bot pushed a commit that referenced this issue Mar 31, 2023

membership_vector for HDBSCAN (#5247)

79bfc47

Closes #4724 Authors: - Tarang Jain (https://github.com/tarang-jain) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #5247

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support for HDBSCAN membership_vector and all_points_membership_vectors #4724

[FEA] Support for HDBSCAN membership_vector and all_points_membership_vectors #4724

beckernick commented May 5, 2022 •

edited

Loading

github-actions bot commented Jun 4, 2022

beckernick commented Aug 29, 2022

ZhangzihanGit commented Aug 30, 2022 •

edited

Loading

beckernick commented Aug 30, 2022 •

edited

Loading

ldsands commented Sep 21, 2022

beckernick commented Sep 22, 2022

github-actions bot commented Oct 22, 2022

hatemr commented Jan 3, 2023

cjnolet commented Jan 3, 2023

hatemr commented Jan 3, 2023 •

edited

Loading

beckernick commented Jan 3, 2023 •

edited

Loading

ravinani commented Jan 4, 2023

hatemr commented Jan 4, 2023

[FEA] Support for HDBSCAN membership_vector and all_points_membership_vectors #4724

[FEA] Support for HDBSCAN membership_vector and all_points_membership_vectors #4724

Comments

beckernick commented May 5, 2022 • edited Loading

github-actions bot commented Jun 4, 2022

beckernick commented Aug 29, 2022

ZhangzihanGit commented Aug 30, 2022 • edited Loading

beckernick commented Aug 30, 2022 • edited Loading

ldsands commented Sep 21, 2022

beckernick commented Sep 22, 2022

github-actions bot commented Oct 22, 2022

hatemr commented Jan 3, 2023

cjnolet commented Jan 3, 2023

hatemr commented Jan 3, 2023 • edited Loading

beckernick commented Jan 3, 2023 • edited Loading

ravinani commented Jan 4, 2023

hatemr commented Jan 4, 2023

beckernick commented May 5, 2022 •

edited

Loading

ZhangzihanGit commented Aug 30, 2022 •

edited

Loading

beckernick commented Aug 30, 2022 •

edited

Loading

hatemr commented Jan 3, 2023 •

edited

Loading

beckernick commented Jan 3, 2023 •

edited

Loading