Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support for HDBSCAN membership_vector and all_points_membership_vectors #4724

Closed
beckernick opened this issue May 5, 2022 · 13 comments · Fixed by #5247
Closed

[FEA] Support for HDBSCAN membership_vector and all_points_membership_vectors #4724

beckernick opened this issue May 5, 2022 · 13 comments · Fixed by #5247
Labels
CUDA / C++ CUDA issue Cython / Python Cython or Python issue feature request New feature or request inactive-30d

Comments

@beckernick
Copy link
Member

beckernick commented May 5, 2022

I'd like to be able to use HDBSCAN to calculate membership vectors for points, like I can with the CPU library. Per the CPU library documentation, this function "produces a vector for each point in [provided data] that gives a probability that the given point is a member of a cluster for each of the selected clusters of the clusterer."

HDBSCAN provides separate methods to calculating membership vectors for both new points and all points in the original dataset

This is useful in topic modeling workflows. Per the BERTopic documentation, "Calculating the probabilities is quite expensive and can significantly increase the computation time. Thus, only use it if you do not mind waiting a bit before the model is done running or if you have less than 50_000 documents." Accelerating this would be valuable, as these probabilities are necessary for evaluating confidence of topic assignments to documents. The existing probabilities_ we provide represent the probability of the ultimately assigned topic cluster, but it would be nice to provide the full vector of probabilities per point like CPU HDBSCAN for compatibility with tools that depend on it (like BERTopic).

import hdbscan
from sklearn.datasets import make_blobs
import numpy as np


np.random.seed(12)

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True)
clusterer.fit(data)

new_points = np.array([[-9,7], [-3,4]])
hdbscan.membership_vector(clusterer, new_points)
array([[0.02848655, 0.07827696, 0.46968066],
       [0.0569429 , 0.05463597, 0.03344189]])
import hdbscan
from sklearn.datasets import make_blobs
import numpy as np

np.random.seed(12)


data, _ = make_blobs(1000)
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True)
clusterer.fit(data)

hdbscan.all_points_membership_vectors(clusterer)[:5].round(3)
array([[0.   , 0.   , 1.   ],
       [0.667, 0.11 , 0.05 ],
       [0.697, 0.059, 0.024],
       [0.   , 1.   , 0.   ],
       [0.546, 0.177, 0.047]])
@beckernick beckernick added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 5, 2022
@beckernick beckernick added Cython / Python Cython or Python issue libcuml CUDA / C++ CUDA issue and removed ? - Needs Triage Need team to review and classify libcuml labels May 5, 2022
@github-actions
Copy link

github-actions bot commented Jun 4, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@beckernick
Copy link
Member Author

With the merge of #4800, soft clustering the original dataset with all_points_membership_vectors is now available. Please give it a try and file issues if you run into any issues or have any feedback.

The outstanding request in this issue is soft-clustering a new set of points with membership_vector.

@ZhangzihanGit
Copy link

ZhangzihanGit commented Aug 30, 2022

Hi,

Thanks for implementing this feature!

I notice that the all_points_membership_vectors method is written outside of the HDBSCAN class, is this intended?

In that case, we need to explicitly import this method to use it:

from cuml.cluster import HDBSCAN, all_points_membership_vectors

Thanks

@beckernick
Copy link
Member Author

beckernick commented Aug 30, 2022

Good question. Yes. In the canonical CPU library, this function is accessible from the module level namespace rather than as a method of the class. We've matched this user experience.

import cuml
from cuml.cluster import hdbscan
import hdbscan as cpu_hdbscan

print(cpu_hdbscan.all_points_membership_vectors)
print(cuml.cluster.hdbscan.all_points_membership_vectors, hdbscan.all_points_membership_vectors)
<function all_points_membership_vectors at 0x7fa895c611f0>
<cyfunction all_points_membership_vectors at 0x7fa897a35520> <cyfunction all_points_membership_vectors at 0x7fa897a35520>

@ldsands
Copy link

ldsands commented Sep 21, 2022

Good question. Yes. In the canonical CPU library, this function is accessible from the module level namespace rather than as a method of the class. We've matched this user experience.

import cuml
from cuml.cluster import hdbscan
import hdbscan as cpu_hdbscan

print(cpu_hdbscan.all_points_membership_vectors)
print(cuml.cluster.hdbscan.all_points_membership_vectors, hdbscan.all_points_membership_vectors)
<function all_points_membership_vectors at 0x7fa895c611f0>
<cyfunction all_points_membership_vectors at 0x7fa897a35520> <cyfunction all_points_membership_vectors at 0x7fa897a35520>

I just ran the nightly yesterday, and it didn't seem to work I had to use your workaround from here. Am I missing something?

# this doesn't work
probs = cuml.cluster.hdbscan.all_points_membership_vectors(gpu_hdbscan_model)
# this does work
probs = cuml.cluster.all_points_membership_vectors(gpu_hdbscan_model)

@beckernick
Copy link
Member Author

Thanks for commenting about this namespace issue. This will be resolved in #4895

rapids-bot bot pushed a commit that referenced this issue Sep 22, 2022
Addresses #4724.

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #4895
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@hatemr
Copy link

hatemr commented Jan 3, 2023

Thanks for implementing this feature!

Has anyone had issues using all_points_membership_vectors with a large dataset? (for me, original embeddings are 235002 x 384). It causes my Python kernel to fail. Apologies if this isn't enough detail.

@cjnolet
Copy link
Member

cjnolet commented Jan 3, 2023

@hatemr can you provide the error you are getting when the Python kernel fails and describe your environment (type of GPU, etc...)?

@hatemr
Copy link

hatemr commented Jan 3, 2023

g4dn.8xlarge
image
image

image

@beckernick
Copy link
Member Author

beckernick commented Jan 3, 2023

Would recommend we take further discussion of this to #4879 , which includes some potential ideas for ways to reduce it.

I will comment on that issue with this context.

@ravinani
Copy link

ravinani commented Jan 4, 2023

@hatemr I tried the BERTopic with GPU for almost 1M documents. My original embeddings are 1M x 384. I tried to get the probabilities of each topic for evary document using "all_points_membership_vectors" but the system crashes whenver I tried to perform the task

I am using Colab with GPU of 40GB. Can you please suggest the workaround to get the probabilities of each document for every topic

Screenshot 2023-01-04 at 8 14 54 PM

@hatemr
Copy link

hatemr commented Jan 4, 2023

@hatemr I tried the BERTopic with GPU for almost 1M documents. My original embeddings are 1M x 384. I tried to get the probabilities of each topic for evary document using "all_points_membership_vectors" but the system crashes whenver I tried to perform the task

I am using Colab with GPU of 40GB. Can you please suggest the workaround to get the probabilities of each document for every topic

Screenshot 2023-01-04 at 8 14 54 PM

@ravinani see answer here

jakirkham pushed a commit to jakirkham/cuml that referenced this issue Feb 27, 2023
rapids-bot bot pushed a commit that referenced this issue Mar 31, 2023
Closes #4724

Authors:
  - Tarang Jain (https://github.com/tarang-jain)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #5247
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA / C++ CUDA issue Cython / Python Cython or Python issue feature request New feature or request inactive-30d
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants