-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] membership_vector for HDBSCAN #5247
[FEA] membership_vector for HDBSCAN #5247
Conversation
* @param[out] indptr CSR indptr of parents array after sort | ||
*/ | ||
template <typename value_idx, typename value_t> | ||
void softmax(const raft::handle_t& handle, value_t* data, value_idx n, size_t m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to create a copy of the data in order to avoid inconsistency during normalization (for numerical stability)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't sound right- we shouldn't have to make a copy here. If these operations are being scheduled wrt the same cuda stream, it should be guaranteed that any prior operations scheduled on compute resources are complete before new operations are scheduled.
One thing we can do to test the theory that there's a data race somewhere is to use the input directly (instead of copying) and call handle.sync_stream()
before and after computing the argmax. Allocating more memory will also cause a device synchronization, which would have the same effect as doing a stream sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was that if this operation:
data[idx] = exp(data_copy[idx] - data_copy[n*(idx / n) + membership_argmax[idx / n]])
was replaced by this:
data[idx] = exp(data[idx] - data[n*(idx / n) + membership_argmax[idx / n]])
, then it is possible that data[n*(idx / n) + membership_argmax[idx / n]]
got modified while/before data[idx]
was being computed since they are both being modified in parallel in the same thrust::for_each
invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to copy the whole dataset to do this, though. It looks like what you want is to subtract the value in a specific column for each row from each value in data
, right? Instead of copying all m*n
entries from the input, why not just create an array of size of m
that contains all the max values for each row? That way you've reduced your problme down to the number of rows. I also think it would make your arithmetic easier to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I would highly suggest using raft::matrix::map_offset
primitive in raft for this. The benefit to using the raft primitives (even if some of them end up just wrapping thrust calls) is that it provides a facade layer for future optimizations.
cpp/src/hdbscan/detail/utils.h
Outdated
|
||
rmm::device_uvector<value_idx> membership_argmax(m, stream); | ||
|
||
raft::matrix::argmax( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be calling the mdspan variants for new code. The raw pointer versions are deprecated.
cpp/src/hdbscan/detail/utils.h
Outdated
data_copy = data_copy.data(), | ||
membership_argmax = membership_argmax.data(), | ||
n] __device__(auto idx) { | ||
data[idx] = exp(data_copy[idx] - data_copy[n*(idx / n) + membership_argmax[idx / n]]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably consolidate this divison so we only have to do it once. We should also try and use the fast int division utility in raft.
cpp/src/hdbscan/detail/utils.h
Outdated
auto data_const_view = raft::make_device_matrix_view<const value_t, value_idx, raft::row_major>(data, (int)m, n); | ||
auto argmax_view = raft::make_device_vector_view<value_idx, value_idx>(argmax.data(), (int)m); | ||
|
||
raft::matrix::argmax( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need the argmax
here? It looks like you are computing the argmax just to then take the actual max. Could you jsut do a max instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading the math correctly here, it looks like what you want is an Linf
normalization, right (dividing each row by the max of the row)? If this is the case, there are other prims in raft::linalg
that could make this easier as well. I guess this assumes log space so it's actually a subtraction, but same effect computationally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at the raft docs and did not find raft::matrix::max
, so I did an argmax. Yes, we are dividing each row by the max of the row. The only issue is that we don't want to row to be computed beforehand (i.e., we don't want the exponential to be computed) due to numerical overflow. So while exp(vector)
followed by Linf
normalization would mathematically be the same, it can still lead to numerical overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you want here is a row-level reduce that uses a max
instead of addition. There are primitives to do this in raft::linalg
that should take a custom lambda (and we have a max lambda that can be used).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Mostly minor things. I want to stress that we need to use raft
calls wherever possible (mostly map_offset
in this case) so that can centralized operations that we find we're using quite frequently. This helps optimizations to these functions propagate across the board.
@@ -28,6 +28,7 @@ __global__ void merge_height_kernel(value_t* heights, | |||
value_idx* parents, | |||
size_t m, | |||
value_idx n_selected_clusters, | |||
MLCommon::FastIntDiv n, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the version of this in raft::util
please? This should really be removed from cuml altogether now that we have a version in raft.
{ | ||
auto stream = handle.get_stream(); | ||
auto exec_policy = handle.get_thrust_policy(); | ||
|
||
auto counting = thrust::make_counting_iterator<value_idx>(0); | ||
|
||
rmm::device_uvector<value_t> exemplars_dense(n_exemplars * n, stream); | ||
rmm::device_uvector<value_t> exemplars_dense(n_exemplars * 1000, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be hardcoded to 1k?
// hierarchy | ||
rmm::device_uvector<value_t> nearest_cluster_max_lambda(n_prediction_points, stream); | ||
|
||
thrust::for_each(exec_policy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use raft::matrix::map_offset
for this operation.
prob_in_some_cluster[idx] = | ||
heights[idx * n_selected_clusters + (int)height_argmax[idx]] / max_lambda; | ||
heights[idx * n_selected_clusters + height_argmax[idx]] / max_lambda; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please compute height_argmax[idx]
once and store it off to reuse instead of having to do this load multiple times.
value_idx n_leaves, | ||
size_t n_prediction_points) | ||
{ | ||
value_idx idx = blockDim.x * blockIdx.x + threadIdx.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like another candidate for map_offset
, right?
return; | ||
}; | ||
|
||
thrust::for_each( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raft::matrix::map_offset
please.
|
||
raft::linalg::norm(handle, data_const_view, linf_norm_view, raft::linalg::LinfNorm, raft::linalg::Apply::ALONG_ROWS); | ||
|
||
raft::linalg::matrix_vector_op(handle, data_const_view, linf_norm_const_view, data_view, raft::linalg::Apply::ALONG_COLUMNS, [] __device__(value_t mat_in, value_t vec_in) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much better!
@@ -122,7 +122,7 @@ if(BUILD_CUML_TESTS) | |||
ConfigureTest(PREFIX SG NAME GENETIC_PARAM_TEST PATH sg/genetic/param_test.cu OPTIONAL ML_INCLUDE) | |||
endif() | |||
|
|||
if("${CMAKE_CUDA_COMPILER_VERSION}" VERSION_LESS_EQUAL "11.2") | |||
if("${CMAKE_CUDA_COMPILER_VERSION}" VERSION_GREATER_EQUAL "11.2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check if we can remove this conditional altogether? I think we want to strive to have these tests pass in all cuda versions.
|
||
#include <algorithm> | ||
|
||
#include "../condensed_hierarchy.cu" | ||
#include <common/fast_int_div.cuh> | ||
|
||
#include <thrust/copy.h> | ||
#include <thrust/execution_policy.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should eventually move Utils::normalize
into RAFT. These are very useful functions. We have functions for norms, but normalization is just as important and pops up just as frequently.
* @param[out] m number of rows | ||
*/ | ||
template <typename value_idx, typename value_t> | ||
void softmax(const raft::handle_t& handle, value_t* data, value_idx n, size_t m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be moved into raft at some point too, since it's just a special case of normalizating into a multinomial distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Can you create RAFT issues for the few comments I made about primitives we could move over to RAFT? It would also help if you could reference the issues in the code here with a small note that we'd like to eventually move those computations over. It just helps to keep it on our radar (and for future eyes to know our plans).
/merge |
Closes #4724