New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] approximate_predict function for HDBSCAN #4872
[FEA] approximate_predict function for HDBSCAN #4872
Conversation
|
||
// This is temporary. Once faiss is updated, we should be able to | ||
// pass value_idx through to knn. | ||
rmm::device_uvector<int64_t> int64_indices(k * n_search_items, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might no longer need to do this. Can you try removing this and have it compute trhe output indices directly? I know we no longer need this in the impl detail API in RAFT and I think we shouldn't need it in the public API anymore either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a huge deal if we need to leave it in for the meantime- we are still doing it in another spot as well. Will just be nice to remove the additional memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your implementation is looking great. I found a few things during this pass, mostly pytest related. We want to make sure we're getting full test coverage through the different options here to uncovery any potential strange bugs that can arise from the different code paths.
|
||
// This is temporary. Once faiss is updated, we should be able to | ||
// pass value_idx through to knn. | ||
rmm::device_uvector<int64_t> int64_indices(k * n_search_items, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a huge deal if we need to leave it in for the meantime- we are still doing it in another spot as well. Will just be nice to remove the additional memory usage.
cpp/test/sg/hdbscan_test.cu
Outdated
|
||
transformLabels(handle, labels.data(), label_map.data(), params.n_row); | ||
|
||
ML::HDBSCAN::detail::Predict::approximate_predict(handle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest calling the public API function for this, since the approximate predict has one
python/cuml/tests/test_hdbscan.py
Outdated
@pytest.mark.parametrize('dataset', dataset_names) | ||
@pytest.mark.parametrize('min_samples', [15]) | ||
@pytest.mark.parametrize('min_cluster_size', [10, 25]) | ||
@pytest.mark.parametrize('cluster_selection_epsilon', [0.0, 50.0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious- why the big jump here (from 0.5 to 50.0)?
python/cuml/tests/test_hdbscan.py
Outdated
@pytest.mark.parametrize('max_cluster_size', [0]) | ||
@pytest.mark.parametrize('cluster_selection_method', ['eom', 'leaf']) | ||
@pytest.mark.parametrize('cluster_selection_method', ['eom']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't we testing the leaf method?
@pytest.mark.parametrize('nrows', [1000, 10000]) | ||
@pytest.mark.parametrize('ncols', [10, 25]) | ||
@pytest.mark.parametrize('nclusters', [10, 15]) | ||
@pytest.mark.parametrize('allow_single_cluster', [False, True]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't allow_single_cluster
being tested here?
@@ -462,30 +461,40 @@ def test_hdbscan_plots(): | |||
assert cuml_agg.minimum_spanning_tree_ is None | |||
|
|||
|
|||
@pytest.mark.parametrize('nrows', [1000, 10000]) | |||
@pytest.mark.parametrize('ncols', [10, 25]) | |||
@pytest.mark.parametrize('nclusters', [10, 15]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not test 10 clusters?
python/cuml/tests/test_hdbscan.py
Outdated
@pytest.mark.parametrize('n_points_to_predict', [500]) | ||
@pytest.mark.parametrize('dataset', dataset_names) | ||
@pytest.mark.parametrize('min_samples', [15]) | ||
@pytest.mark.parametrize('cluster_selection_epsilon', [0.0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above- why are we limiting the test cases here (no selection epsilon, only testing eom
, etc..) My main concern is that if we're not enabling all the possible paths through the code with these options, the users are going to find strange bugs and unexpected behaviors when they enable these options. Not testing these paths at all also has the side effect that things can break as the underlying implementation is updated and we won't know about these breakages.
rerun tests |
…oximate-predict-hdbscan-new
cpp/src/hdbscan/detail/predict.cuh
Outdated
// Slice core distances (distances to kth nearest neighbor). Note that we slice the | ||
// (min_samples+1)-th to be consistent with Scikit-learn Contrib | ||
Reachability::core_distances<value_idx>(dists.data(), | ||
min_samples + 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you did this here, but it would be nice if we could find away to keep this logic closer to the public API (non-detai-namespace) version.
Right now we have this increment of min_samples embedded in the build_linkage function which is in the public namespace but it's a bit inconsistent and confusing that it's buried in the private namespace for the prediction.
Can we move this out to out_of_sample_predict?
from cuml.cluster.prediction import all_points_membership_vectors | ||
from cuml.cluster.prediction import approximate_predict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to #4872 (comment) , I see we already have condense_hierarchy
here but I think we want these exposed not in the cuml.cluster.*
namespace but in the cuml.cluster.hdbscan.*
namespace, for two reasons.
- This functionality is tightly coupled to HDBSCAN. Being in the
cluster
namespace implicitly suggests it applies more broadly - Today, switching between the CPU hdbscan and cuML hdbscan backend is necessary for developers who want to provide both functionality. Having to use separate backend namespaces for the clusterer and the functionality is annoying. As an example of what's necessary in this setup, see this gist. I had to use two different if/else branches to choose a backend depending on what I was doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. We can do this in a follow up
cpp/test/sg/hdbscan_test.cu
Outdated
data.data(), | ||
raft::distance::DistanceType::L2SqrtExpanded); | ||
raft::distance::DistanceType::L2SqrtExpanded, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to document the process you followed to get the results on the cpu side so that a future developer can reproduce your results.
rerun tests |
Codecov ReportBase: 78.02% // Head: 78.04% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.10 #4872 +/- ##
================================================
+ Coverage 78.02% 78.04% +0.02%
================================================
Files 180 180
Lines 11385 11424 +39
================================================
+ Hits 8883 8916 +33
- Misses 2502 2508 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@gpucibot merge |
PR for HDBSCAN approximate_predict - [x] Building cluster_map - [x] Modifying PredictionData class - [x] Obtaining nearest neighbor in MR space - [x] Computing probability - [x] Tests Closes rapidsai#4877 Closes rapidsai#4448 Authors: - Tarang Jain (https://github.com/tarang-jain) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4872
PR for HDBSCAN approximate_predict
Closes #4877
Closes #4448