Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] approximate_predict function for HDBSCAN #4872

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Aug 22, 2022

PR for HDBSCAN approximate_predict

  • Building cluster_map
  • Modifying PredictionData class
  • Obtaining nearest neighbor in MR space
  • Computing probability
  • Tests

Closes #4877
Closes #4448


// This is temporary. Once faiss is updated, we should be able to
// pass value_idx through to knn.
rmm::device_uvector<int64_t> int64_indices(k * n_search_items, stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might no longer need to do this. Can you try removing this and have it compute trhe output indices directly? I know we no longer need this in the impl detail API in RAFT and I think we shouldn't need it in the public API anymore either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a huge deal if we need to leave it in for the meantime- we are still doing it in another spot as well. Will just be nice to remove the additional memory usage.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your implementation is looking great. I found a few things during this pass, mostly pytest related. We want to make sure we're getting full test coverage through the different options here to uncovery any potential strange bugs that can arise from the different code paths.


// This is temporary. Once faiss is updated, we should be able to
// pass value_idx through to knn.
rmm::device_uvector<int64_t> int64_indices(k * n_search_items, stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a huge deal if we need to leave it in for the meantime- we are still doing it in another spot as well. Will just be nice to remove the additional memory usage.


transformLabels(handle, labels.data(), label_map.data(), params.n_row);

ML::HDBSCAN::detail::Predict::approximate_predict(handle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest calling the public API function for this, since the approximate predict has one

python/cuml/tests/test_hdbscan.py Show resolved Hide resolved
@pytest.mark.parametrize('dataset', dataset_names)
@pytest.mark.parametrize('min_samples', [15])
@pytest.mark.parametrize('min_cluster_size', [10, 25])
@pytest.mark.parametrize('cluster_selection_epsilon', [0.0, 50.0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious- why the big jump here (from 0.5 to 50.0)?

@pytest.mark.parametrize('max_cluster_size', [0])
@pytest.mark.parametrize('cluster_selection_method', ['eom', 'leaf'])
@pytest.mark.parametrize('cluster_selection_method', ['eom'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we testing the leaf method?

@pytest.mark.parametrize('nrows', [1000, 10000])
@pytest.mark.parametrize('ncols', [10, 25])
@pytest.mark.parametrize('nclusters', [10, 15])
@pytest.mark.parametrize('allow_single_cluster', [False, True])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't allow_single_cluster being tested here?

@@ -462,30 +461,40 @@ def test_hdbscan_plots():
assert cuml_agg.minimum_spanning_tree_ is None


@pytest.mark.parametrize('nrows', [1000, 10000])
@pytest.mark.parametrize('ncols', [10, 25])
@pytest.mark.parametrize('nclusters', [10, 15])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not test 10 clusters?

@pytest.mark.parametrize('n_points_to_predict', [500])
@pytest.mark.parametrize('dataset', dataset_names)
@pytest.mark.parametrize('min_samples', [15])
@pytest.mark.parametrize('cluster_selection_epsilon', [0.0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above- why are we limiting the test cases here (no selection epsilon, only testing eom, etc..) My main concern is that if we're not enabling all the possible paths through the code with these options, the users are going to find strange bugs and unexpected behaviors when they enable these options. Not testing these paths at all also has the side effect that things can break as the underlying implementation is updated and we won't know about these breakages.

v22.10 Release automation moved this from PR-WIP to PR-Needs review Aug 30, 2022
@dantegd dantegd added the 4 - Waiting on Author Waiting for author to respond to review label Aug 30, 2022
@cjnolet
Copy link
Member

cjnolet commented Aug 31, 2022

rerun tests

cpp/src/hdbscan/detail/reachability.cuh Show resolved Hide resolved
cpp/src/hdbscan/runner.h Show resolved Hide resolved
// Slice core distances (distances to kth nearest neighbor). Note that we slice the
// (min_samples+1)-th to be consistent with Scikit-learn Contrib
Reachability::core_distances<value_idx>(dists.data(),
min_samples + 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you did this here, but it would be nice if we could find away to keep this logic closer to the public API (non-detai-namespace) version.

Right now we have this increment of min_samples embedded in the build_linkage function which is in the public namespace but it's a bit inconsistent and confusing that it's buried in the private namespace for the prediction.

Can we move this out to out_of_sample_predict?

Comment on lines +22 to +23
from cuml.cluster.prediction import all_points_membership_vectors
from cuml.cluster.prediction import approximate_predict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to #4872 (comment) , I see we already have condense_hierarchy here but I think we want these exposed not in the cuml.cluster.* namespace but in the cuml.cluster.hdbscan.* namespace, for two reasons.

  • This functionality is tightly coupled to HDBSCAN. Being in the cluster namespace implicitly suggests it applies more broadly
  • Today, switching between the CPU hdbscan and cuML hdbscan backend is necessary for developers who want to provide both functionality. Having to use separate backend namespaces for the clusterer and the functionality is annoying. As an example of what's necessary in this setup, see this gist. I had to use two different if/else branches to choose a backend depending on what I was doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. We can do this in a follow up

data.data(),
raft::distance::DistanceType::L2SqrtExpanded);
raft::distance::DistanceType::L2SqrtExpanded,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to document the process you followed to get the results on the cpu side so that a future developer can reproduce your results.

@tarang-jain
Copy link
Contributor Author

rerun tests

@codecov-commenter
Copy link

Codecov Report

Base: 78.02% // Head: 78.04% // Increases project coverage by +0.02% 🎉

Coverage data is based on head (e17b4c2) compared to base (7a0ab85).
Patch coverage: 85.18% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.10    #4872      +/-   ##
================================================
+ Coverage         78.02%   78.04%   +0.02%     
================================================
  Files               180      180              
  Lines             11385    11424      +39     
================================================
+ Hits               8883     8916      +33     
- Misses             2502     2508       +6     
Flag Coverage Δ
dask 46.29% <68.51%> (+0.07%) ⬆️
non-dask 67.33% <85.18%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/benchmark/nvtx_benchmark.py 0.00% <0.00%> (ø)
python/cuml/common/array.py 95.10% <85.10%> (-2.88%) ⬇️
python/cuml/cluster/__init__.py 100.00% <100.00%> (ø)
python/cuml/metrics/__init__.py 100.00% <100.00%> (ø)
python/cuml/thirdparty_adapters/adapters.py 91.54% <100.00%> (+0.05%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

v22.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Sep 3, 2022
@cjnolet
Copy link
Member

cjnolet commented Sep 3, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cb2d681 into rapidsai:branch-22.10 Sep 3, 2022
v22.10 Release automation moved this from PR-Reviewer approved to Done Sep 3, 2022
jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
PR for HDBSCAN approximate_predict

- [x] Building cluster_map
- [x] Modifying PredictionData class
- [x] Obtaining nearest neighbor in MR space
- [x] Computing probability
- [x] Tests

Closes rapidsai#4877
Closes rapidsai#4448

Authors:
  - Tarang Jain (https://github.com/tarang-jain)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4872
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on Author Waiting for author to respond to review CMake CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
No open projects
5 participants