Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] All points membership vector for HDBSCAN #4800

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Jul 5, 2022

  • All points distance membership vector
  • All points outlier membership vector
  • All points probability in some cluster
  • All points membership vector
  • Tests

@tarang-jain tarang-jain changed the title All points distance membership vector for HDBSCAN All points membership vector for HDBSCAN Jul 5, 2022
@github-actions github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Jul 5, 2022
@tarang-jain tarang-jain force-pushed the fea-all-points-membership-vector-hdbscan branch from b823a4b to dbb7d48 Compare July 5, 2022 20:29
@github-actions github-actions bot removed the Cython / Python Cython or Python issue label Jul 5, 2022
@tarang-jain tarang-jain changed the title All points membership vector for HDBSCAN [FEA] All points membership vector for HDBSCAN Jul 5, 2022
@caryr35 caryr35 added this to PR-WIP in v22.08 Release via automation Jul 12, 2022
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Jul 22, 2022
*/
void hdbscan(const raft::handle_t& handle,
const float* X,
size_t m,
size_t n,
raft::distance::DistanceType metric,
HDBSCAN::Common::HDBSCANParams& params,
HDBSCAN::Common::hdbscan_output<int, float>& out);
HDBSCAN::Common::hdbscan_output<int, float>& out,
bool prediction_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add prediction_data to params? Ideally we'd use std::optional for prediction_data_

exemplar_label_offsets.resize(n_selected_clusters_ + 1, handle.get_stream());
deaths.resize(n_clusters, handle.get_stream());
selected_clusters.resize(n_selected_clusters, handle.get_stream());
raft::copy(exemplar_idx.begin(), exemplar_idx_, n_exemplars_, handle.get_stream());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expensive in both memory usage and runtime. Rather than accepting raw pointers in cache, I would suggest creating an allocate() function that will resize the device_uvector members and then using the getters for the underlying pointers inside build_prediction_data to have the computations populate the values directly. That would remove the additional memory usage.

@@ -255,6 +258,15 @@ void _fit_hdbscan(const raft::handle_t& handle,
* starting at 0 even in the presence of noise (-1)
*/

if (prediction_data) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if(params.prediction_data) or you could also do if(prediction_data_.has_value()) if you use std::optional.


params = ::testing::TestWithParam<SoftClusteringInputs<T, IdxT>>::GetParam();

Logger::get().setLevel(CUML_LEVEL_DEBUG);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably remove this

v22.10 Release automation moved this from PR-WIP to PR-Needs review Aug 11, 2022
@cjnolet
Copy link
Member

cjnolet commented Aug 11, 2022

rerun tests

@tarang-jain tarang-jain requested a review from a team as a code owner August 11, 2022 23:52
@github-actions github-actions bot added the CMake label Aug 11, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #4800 (00b39aa) into branch-22.10 (7a0ab85) will increase coverage by 0.01%.
The diff coverage is 86.27%.

@@               Coverage Diff                @@
##           branch-22.10    #4800      +/-   ##
================================================
+ Coverage         78.02%   78.04%   +0.01%     
================================================
  Files               180      180              
  Lines             11385    11422      +37     
================================================
+ Hits               8883     8914      +31     
- Misses             2502     2508       +6     
Flag Coverage Δ
dask 46.28% <68.62%> (+0.06%) ⬆️
non-dask 67.32% <86.27%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/common/array.py 95.10% <85.10%> (-2.88%) ⬇️
python/cuml/cluster/__init__.py 100.00% <100.00%> (ø)
python/cuml/thirdparty_adapters/adapters.py 91.54% <100.00%> (+0.05%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

v22.10 Release automation moved this from PR-Needs review to PR-Reviewer approved Aug 26, 2022
@cjnolet
Copy link
Member

cjnolet commented Aug 26, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 80621f0 into rapidsai:branch-22.10 Aug 26, 2022
v22.10 Release automation moved this from PR-Reviewer approved to Done Aug 26, 2022
divyegala pushed a commit to viclafargue/cuml that referenced this pull request Sep 2, 2022
- [x] All points distance membership vector
- [x] All points outlier membership vector
- [x] All points probability in some cluster
- [x] All points membership vector
- [x] Tests

Authors:
  - Tarang Jain (https://github.com/tarang-jain)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4800
ajschmidt8 pushed a commit that referenced this pull request Feb 13, 2023
- [x] All points distance membership vector
- [x] All points outlier membership vector
- [x] All points probability in some cluster
- [x] All points membership vector
- [x] Tests

Authors:
  - Tarang Jain (https://github.com/tarang-jain)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #4800
jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
- [x] All points distance membership vector
- [x] All points outlier membership vector
- [x] All points probability in some cluster
- [x] All points membership vector
- [x] Tests

Authors:
  - Tarang Jain (https://github.com/tarang-jain)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4800
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants