Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] membership_vector for HDBSCAN #5247

Merged
merged 33 commits into from
Mar 31, 2023

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Feb 22, 2023

Closes #4724

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Feb 23, 2023
* @param[out] indptr CSR indptr of parents array after sort
*/
template <typename value_idx, typename value_t>
void softmax(const raft::handle_t& handle, value_t* data, value_idx n, size_t m)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to create a copy of the data in order to avoid inconsistency during normalization (for numerical stability)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't sound right- we shouldn't have to make a copy here. If these operations are being scheduled wrt the same cuda stream, it should be guaranteed that any prior operations scheduled on compute resources are complete before new operations are scheduled.

One thing we can do to test the theory that there's a data race somewhere is to use the input directly (instead of copying) and call handle.sync_stream() before and after computing the argmax. Allocating more memory will also cause a device synchronization, which would have the same effect as doing a stream sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was that if this operation:
data[idx] = exp(data_copy[idx] - data_copy[n*(idx / n) + membership_argmax[idx / n]])
was replaced by this:
data[idx] = exp(data[idx] - data[n*(idx / n) + membership_argmax[idx / n]]), then it is possible that data[n*(idx / n) + membership_argmax[idx / n]] got modified while/before data[idx] was being computed since they are both being modified in parallel in the same thrust::for_each invocation.

Copy link
Member

@cjnolet cjnolet Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to copy the whole dataset to do this, though. It looks like what you want is to subtract the value in a specific column for each row from each value in data, right? Instead of copying all m*n entries from the input, why not just create an array of size of m that contains all the max values for each row? That way you've reduced your problme down to the number of rows. I also think it would make your arithmetic easier to follow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would highly suggest using raft::matrix::map_offset primitive in raft for this. The benefit to using the raft primitives (even if some of them end up just wrapping thrust calls) is that it provides a facade layer for future optimizations.


rmm::device_uvector<value_idx> membership_argmax(m, stream);

raft::matrix::argmax(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be calling the mdspan variants for new code. The raw pointer versions are deprecated.

@github-actions github-actions bot added the CMake label Feb 28, 2023
data_copy = data_copy.data(),
membership_argmax = membership_argmax.data(),
n] __device__(auto idx) {
data[idx] = exp(data_copy[idx] - data_copy[n*(idx / n) + membership_argmax[idx / n]]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably consolidate this divison so we only have to do it once. We should also try and use the fast int division utility in raft.

@tarang-jain tarang-jain marked this pull request as ready for review March 1, 2023 23:19
@tarang-jain tarang-jain requested review from a team as code owners March 1, 2023 23:19
auto data_const_view = raft::make_device_matrix_view<const value_t, value_idx, raft::row_major>(data, (int)m, n);
auto argmax_view = raft::make_device_vector_view<value_idx, value_idx>(argmax.data(), (int)m);

raft::matrix::argmax(
Copy link
Member

@cjnolet cjnolet Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need the argmax here? It looks like you are computing the argmax just to then take the actual max. Could you jsut do a max instead?

Copy link
Member

@cjnolet cjnolet Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading the math correctly here, it looks like what you want is an Linf normalization, right (dividing each row by the max of the row)? If this is the case, there are other prims in raft::linalg that could make this easier as well. I guess this assumes log space so it's actually a subtraction, but same effect computationally.

Copy link
Contributor Author

@tarang-jain tarang-jain Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the raft docs and did not find raft::matrix::max, so I did an argmax. Yes, we are dividing each row by the max of the row. The only issue is that we don't want to row to be computed beforehand (i.e., we don't want the exponential to be computed) due to numerical overflow. So while exp(vector) followed by Linf normalization would mathematically be the same, it can still lead to numerical overflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you want here is a row-level reduce that uses a max instead of addition. There are primitives to do this in raft::linalg that should take a custom lambda (and we have a max lambda that can be used).

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Mostly minor things. I want to stress that we need to use raft calls wherever possible (mostly map_offset in this case) so that can centralized operations that we find we're using quite frequently. This helps optimizations to these functions propagate across the board.

@@ -28,6 +28,7 @@ __global__ void merge_height_kernel(value_t* heights,
value_idx* parents,
size_t m,
value_idx n_selected_clusters,
MLCommon::FastIntDiv n,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the version of this in raft::util please? This should really be removed from cuml altogether now that we have a version in raft.

{
auto stream = handle.get_stream();
auto exec_policy = handle.get_thrust_policy();

auto counting = thrust::make_counting_iterator<value_idx>(0);

rmm::device_uvector<value_t> exemplars_dense(n_exemplars * n, stream);
rmm::device_uvector<value_t> exemplars_dense(n_exemplars * 1000, stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be hardcoded to 1k?

// hierarchy
rmm::device_uvector<value_t> nearest_cluster_max_lambda(n_prediction_points, stream);

thrust::for_each(exec_policy,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use raft::matrix::map_offset for this operation.

prob_in_some_cluster[idx] =
heights[idx * n_selected_clusters + (int)height_argmax[idx]] / max_lambda;
heights[idx * n_selected_clusters + height_argmax[idx]] / max_lambda;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please compute height_argmax[idx] once and store it off to reuse instead of having to do this load multiple times.

value_idx n_leaves,
size_t n_prediction_points)
{
value_idx idx = blockDim.x * blockIdx.x + threadIdx.x;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like another candidate for map_offset, right?

return;
};

thrust::for_each(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raft::matrix::map_offset please.


raft::linalg::norm(handle, data_const_view, linf_norm_view, raft::linalg::LinfNorm, raft::linalg::Apply::ALONG_ROWS);

raft::linalg::matrix_vector_op(handle, data_const_view, linf_norm_const_view, data_view, raft::linalg::Apply::ALONG_COLUMNS, [] __device__(value_t mat_in, value_t vec_in) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much better!

@@ -122,7 +122,7 @@ if(BUILD_CUML_TESTS)
ConfigureTest(PREFIX SG NAME GENETIC_PARAM_TEST PATH sg/genetic/param_test.cu OPTIONAL ML_INCLUDE)
endif()

if("${CMAKE_CUDA_COMPILER_VERSION}" VERSION_LESS_EQUAL "11.2")
if("${CMAKE_CUDA_COMPILER_VERSION}" VERSION_GREATER_EQUAL "11.2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if we can remove this conditional altogether? I think we want to strive to have these tests pass in all cuda versions.


#include <algorithm>

#include "../condensed_hierarchy.cu"
#include <common/fast_int_div.cuh>

#include <thrust/copy.h>
#include <thrust/execution_policy.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should eventually move Utils::normalize into RAFT. These are very useful functions. We have functions for norms, but normalization is just as important and pops up just as frequently.

* @param[out] m number of rows
*/
template <typename value_idx, typename value_t>
void softmax(const raft::handle_t& handle, value_t* data, value_idx n, size_t m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be moved into raft at some point too, since it's just a special case of normalizating into a multinomial distribution.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Can you create RAFT issues for the few comments I made about primitives we could move over to RAFT? It would also help if you could reference the issues in the code here with a small note that we'd like to eventually move those computations over. It just helps to keep it on our radar (and for future eyes to know our plans).

@cjnolet
Copy link
Member

cjnolet commented Mar 31, 2023

/merge

@rapids-bot rapids-bot bot merged commit 79bfc47 into rapidsai:branch-23.04 Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CUDA/C++ Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support for HDBSCAN membership_vector and all_points_membership_vectors
3 participants