-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support for HDBSCAN membership_vector and all_points_membership_vectors #4724
Comments
This issue has been labeled |
With the merge of #4800, soft clustering the original dataset with The outstanding request in this issue is soft-clustering a new set of points with |
Hi, Thanks for implementing this feature! I notice that the In that case, we need to explicitly import this method to use it: from cuml.cluster import HDBSCAN, all_points_membership_vectors Thanks |
Good question. Yes. In the canonical CPU library, this function is accessible from the module level namespace rather than as a method of the class. We've matched this user experience. import cuml
from cuml.cluster import hdbscan
import hdbscan as cpu_hdbscan
print(cpu_hdbscan.all_points_membership_vectors)
print(cuml.cluster.hdbscan.all_points_membership_vectors, hdbscan.all_points_membership_vectors)
<function all_points_membership_vectors at 0x7fa895c611f0>
<cyfunction all_points_membership_vectors at 0x7fa897a35520> <cyfunction all_points_membership_vectors at 0x7fa897a35520> |
I just ran the nightly yesterday, and it didn't seem to work I had to use your workaround from here. Am I missing something? # this doesn't work
probs = cuml.cluster.hdbscan.all_points_membership_vectors(gpu_hdbscan_model)
# this does work
probs = cuml.cluster.all_points_membership_vectors(gpu_hdbscan_model) |
Thanks for commenting about this namespace issue. This will be resolved in #4895 |
Addresses #4724. Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - AJ Schmidt (https://github.com/ajschmidt8) URL: #4895
This issue has been labeled |
Thanks for implementing this feature! Has anyone had issues using |
@hatemr can you provide the error you are getting when the Python kernel fails and describe your environment (type of GPU, etc...)? |
Would recommend we take further discussion of this to #4879 , which includes some potential ideas for ways to reduce it. I will comment on that issue with this context. |
@hatemr I tried the BERTopic with GPU for almost 1M documents. My original embeddings are 1M x 384. I tried to get the probabilities of each topic for evary document using "all_points_membership_vectors" but the system crashes whenver I tried to perform the task I am using Colab with GPU of 40GB. Can you please suggest the workaround to get the probabilities of each document for every topic |
|
Addresses rapidsai#4724. Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - AJ Schmidt (https://github.com/ajschmidt8) URL: rapidsai#4895
Closes #4724 Authors: - Tarang Jain (https://github.com/tarang-jain) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #5247
I'd like to be able to use HDBSCAN to calculate membership vectors for points, like I can with the CPU library. Per the CPU library documentation, this function "produces a vector for each point in [provided data] that gives a probability that the given point is a member of a cluster for each of the selected clusters of the clusterer."
HDBSCAN provides separate methods to calculating membership vectors for both new points and all points in the original dataset
This is useful in topic modeling workflows. Per the BERTopic documentation, "Calculating the probabilities is quite expensive and can significantly increase the computation time. Thus, only use it if you do not mind waiting a bit before the model is done running or if you have less than 50_000 documents." Accelerating this would be valuable, as these probabilities are necessary for evaluating confidence of topic assignments to documents. The existing
probabilities_
we provide represent the probability of the ultimately assigned topic cluster, but it would be nice to provide the full vector of probabilities per point like CPU HDBSCAN for compatibility with tools that depend on it (like BERTopic).The text was updated successfully, but these errors were encountered: