New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[FEA] approximate_predict function for HDBSCAN #4872

Merged

rapids-bot merged 91 commits into rapidsai:branch-22.10 from tarang-jain:fea-approximate-predict-hdbscan-new

Sep 3, 2022

Contributor

tarang-jain commented Aug 22, 2022 •

edited

PR for HDBSCAN approximate_predict

Closes #4877
Closes #4448

tarang-jain added 30 commits

July 5, 2022 12:59


          Exemplar indices obtained

8614cbd


          Further additions to distance membership

dbb7d48


          Cleanup dist membership vector

8e25363


          Include changes

a6cb20f


          testing

d04dff2


          Further changes

2cfdae6


          Further changes to distance based membership (clean build)

f13e587


          Reuse label map, replace unique_by_key with sorted_coo_to_csr


          Outlier based membership initial commit (unclean build)

1ab28c0


          Restructuring functions (unclean build)

66c815a


          Intermediate commits

abff747


          Corrections in exemplar computation and outlier membership

b979a5b


          All point membership vector all parts working

7bf26d7


          Initial commit for Prediction Data class

a7ab49e


          initial commit

bffd7ca


          Staged changes

ed59da5


          Circling back to implementation without PredictionData

3ed7ba3


          PredictionData finally added (errors in cython)

199695f


          Clean build with somee debug statements

17cf426


          Python API created and working

e5bc693


          convert output dtype to cupy

4f4136b


          Debugging exemplar_idx with large number of clusters

b130ebe


          correction cache function and moving to .cu file

51cc6c8


          Allow size_t as size of dataset (note: beware of limits as laterconve…

336d638

…rted to int))


          Save data (self.X_m), resolving compute-sanitizer errors

cca5f15


          Resolving compute-santiizer error

d5264f9


          Added gtest

441a397


          Further changes to pytest

2c00591


          Add pytest, add support for distance metrics

a5cdf70


          Styling changes

b971aa5

cjnolet reviewed

View reviewed changes

cpp/src/hdbscan/detail/reachability.cuh

+                // This is temporary. Once faiss is updated, we should be able to
+                // pass value_idx through to knn.
+                rmm::device_uvector<int64_t> int64_indices(k * n_search_items, stream);

Member

cjnolet Aug 29, 2022

I think we might no longer need to do this. Can you try removing this and have it compute trhe output indices directly? I know we no longer need this in the impl detail API in RAFT and I think we shouldn't need it in the public API anymore either.

Member

cjnolet Aug 29, 2022

It's not a huge deal if we need to leave it in for the meantime- we are still doing it in another spot as well. Will just be nice to remove the additional memory usage.

cjnolet requested changes

View reviewed changes

Member

cjnolet left a comment

Your implementation is looking great. I found a few things during this pass, mostly pytest related. We want to make sure we're getting full test coverage through the different options here to uncovery any potential strange bugs that can arise from the different code paths.

cpp/src/hdbscan/detail/reachability.cuh

+                // This is temporary. Once faiss is updated, we should be able to
+                // pass value_idx through to knn.
+                rmm::device_uvector<int64_t> int64_indices(k * n_search_items, stream);

Member

cjnolet Aug 29, 2022

It's not a huge deal if we need to leave it in for the meantime- we are still doing it in another spot as well. Will just be nice to remove the additional memory usage.

cpp/test/sg/hdbscan_test.cu Outdated


		transformLabels(handle, labels.data(), label_map.data(), params.n_row);

		ML::HDBSCAN::detail::Predict::approximate_predict(handle,

Member

cjnolet Aug 30, 2022

I would suggest calling the public API function for this, since the approximate predict has one

python/cuml/tests/test_hdbscan.py Show resolved Hide resolved

python/cuml/tests/test_hdbscan.py Outdated

+              @pytest.mark.parametrize('dataset', dataset_names)
+              @pytest.mark.parametrize('min_samples', [15])
+              @pytest.mark.parametrize('min_cluster_size', [10, 25])
+              @pytest.mark.parametrize('cluster_selection_epsilon', [0.0, 50.0])

Member

cjnolet Aug 30, 2022

Just curious- why the big jump here (from 0.5 to 50.0)?

python/cuml/tests/test_hdbscan.py Outdated

               @pytest.mark.parametrize('max_cluster_size', [0])
-              @pytest.mark.parametrize('cluster_selection_method', ['eom', 'leaf'])
+              @pytest.mark.parametrize('cluster_selection_method', ['eom'])

Member

cjnolet Aug 30, 2022

Why aren't we testing the leaf method?

python/cuml/tests/test_hdbscan.py

-              @pytest.mark.parametrize('nrows', [1000, 10000])
-              @pytest.mark.parametrize('ncols', [10, 25])
-              @pytest.mark.parametrize('nclusters', [10, 15])
-              @pytest.mark.parametrize('allow_single_cluster', [False, True])

Member

cjnolet Aug 30, 2022

Why isn't allow_single_cluster being tested here?

python/cuml/tests/test_hdbscan.py

@@ @@ -462,30 +461,40 @@ def test_hdbscan_plots(): @@
                   assert cuml_agg.minimum_spanning_tree_ is None
-              @pytest.mark.parametrize('nrows', [1000, 10000])
-              @pytest.mark.parametrize('ncols', [10, 25])
-              @pytest.mark.parametrize('nclusters', [10, 15])

Member

cjnolet Aug 30, 2022

Why not test 10 clusters?

python/cuml/tests/test_hdbscan.py Outdated

+              @pytest.mark.parametrize('n_points_to_predict', [500])
+              @pytest.mark.parametrize('dataset', dataset_names)
+              @pytest.mark.parametrize('min_samples', [15])
+              @pytest.mark.parametrize('cluster_selection_epsilon', [0.0])

Member

cjnolet Aug 30, 2022

Same question as above- why are we limiting the test cases here (no selection epsilon, only testing eom, etc..) My main concern is that if we're not enabling all the possible paths through the code with these options, the users are going to find strange bugs and unexpected behaviors when they enable these options. Not testing these paths at all also has the side effect that things can break as the underlying implementation is updated and we won't know about these breakages.

v22.10 Release automation moved this from PR-WIP to PR-Needs review

dantegd added the 4 - Waiting on Author label

tarang-jain added 5 commits

August 30, 2022 10:22


          Style fix

b24215d


          Fix discrepancy in core distance computation

18f68f6


          Final fix for discrepancy in core distance computation

b2a418e


          Merge branch 'branch-22.10' of github.com:rapidsai/cuml into fea-appr…

0a99f3c

…oximate-predict-hdbscan-new


          Update gtest

f7325d0

Member

cjnolet commented Aug 31, 2022

rerun tests

beckernick mentioned this pull request

[FEA] Reduce memory pressure in HDBSCAN all_points_membership_vectors (or provide/link to best practices) #4879

Closed

tarang-jain added 2 commits

September 1, 2022 09:50


          Update pytest

30fceb4


          Merge branch 'branch-22.10' of github.com:rapidsai/cuml into fea-appr…

254d5e4

…oximate-predict-hdbscan-new

cjnolet requested changes

View reviewed changes

cpp/src/hdbscan/detail/reachability.cuh Show resolved Hide resolved

cpp/src/hdbscan/runner.h Show resolved Hide resolved

cpp/src/hdbscan/detail/predict.cuh Outdated

+                // Slice core distances (distances to kth nearest neighbor). Note that we slice the
+                // (min_samples+1)-th to be consistent with Scikit-learn Contrib
+                Reachability::core_distances<value_idx>(dists.data(),
+                                                        min_samples + 1,

Member

cjnolet Sep 1, 2022

I understand why you did this here, but it would be nice if we could find away to keep this logic closer to the public API (non-detai-namespace) version.

Right now we have this increment of min_samples embedded in the build_linkage function which is in the public namespace but it's a bit inconsistent and confusing that it's buried in the private namespace for the prediction.

Can we move this out to out_of_sample_predict?

beckernick reviewed

View reviewed changes

python/cuml/cluster/__init__.py

Comment on lines +22 to +23

		from cuml.cluster.prediction import all_points_membership_vectors
		from cuml.cluster.prediction import approximate_predict

Member

beckernick Sep 2, 2022

Similar to #4872 (comment) , I see we already have condense_hierarchy here but I think we want these exposed not in the cuml.cluster.* namespace but in the cuml.cluster.hdbscan.* namespace, for two reasons.

This functionality is tightly coupled to HDBSCAN. Being in the cluster namespace implicitly suggests it applies more broadly
Today, switching between the CPU hdbscan and cuML hdbscan backend is necessary for developers who want to provide both functionality. Having to use separate backend namespaces for the clusterer and the functionality is annoying. As an example of what's necessary in this setup, see this gist. I had to use two different if/else branches to choose a backend depending on what I was doing.

Member

beckernick Sep 2, 2022

Discussed offline. We can do this in a follow up

cjnolet reviewed

View reviewed changes

cpp/test/sg/hdbscan_test.cu Outdated

                     data.data(),
-                    raft::distance::DistanceType::L2SqrtExpanded);
+                    raft::distance::DistanceType::L2SqrtExpanded,

Member

cjnolet Sep 2, 2022

Reminder to document the process you followed to get the results on the cpu side so that a future developer can reproduce your results.

cjnolet and others added 4 commits

September 2, 2022 12:39


          Adding pytests for moons and circles to test hdbscan approximate pred…

c3aad75

…iction and membership vectors


          Fixing line width


          Add digits pytest

f27e454


          Updates after PR reviews

e17b4c2

Contributor Author

tarang-jain commented Sep 2, 2022

rerun tests

codecov-commenter commented Sep 3, 2022

Codecov Report

Base: 78.02% // Head: 78.04% // Increases project coverage by +0.02% 🎉

Coverage data is based on head (e17b4c2) compared to base (7a0ab85).
Patch coverage: 85.18% of modified lines in pull request are covered.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.10    #4872      +/-   ##
================================================
+ Coverage         78.02%   78.04%   +0.02%     
================================================
  Files               180      180              
  Lines             11385    11424      +39     
================================================
+ Hits               8883     8916      +33     
- Misses             2502     2508       +6

Flag	Coverage Δ
dask	`46.29% <68.51%> (+0.07%)`	⬆️
non-dask	`67.33% <85.18%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/cuml/benchmark/nvtx_benchmark.py	`0.00% <0.00%> (ø)`
python/cuml/common/array.py	`95.10% <85.10%> (-2.88%)`	⬇️
python/cuml/cluster/__init__.py	`100.00% <100.00%> (ø)`
python/cuml/metrics/__init__.py	`100.00% <100.00%> (ø)`
python/cuml/thirdparty_adapters/adapters.py	`91.54% <100.00%> (+0.05%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

cjnolet approved these changes

View reviewed changes

v22.10 Release automation moved this from PR-Needs review to PR-Reviewer approved

Member

cjnolet commented Sep 3, 2022

@gpucibot merge

rapids-bot bot merged commit cb2d681 into rapidsai:branch-22.10

v22.10 Release automation moved this from PR-Reviewer approved to Done

ldsands mentioned this pull request

Any tips how to use GPU to run Bertopic on GCP? MaartenGr/BERTopic#733

Closed

sebastien-mcrae mentioned this pull request

GPU accelerated UMAP and HDBSCAN issues: memory and predict MaartenGr/BERTopic#644

Closed

jakirkham pushed a commit to jakirkham/cuml that referenced this pull request


          approximate_predict function for HDBSCAN (rapidsai#4872)

58b63d1

PR for HDBSCAN approximate_predict

- [x] Building cluster_map
- [x] Modifying PredictionData class
- [x] Obtaining nearest neighbor in MR space
- [x] Computing probability
- [x] Tests

Closes rapidsai#4877
Closes rapidsai#4448

Authors:
  - Tarang Jain (https://github.com/tarang-jain)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4872

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment