Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fusedL2NN in ANN kmeans #821

Merged
merged 20 commits into from
Oct 4, 2022
Merged

Conversation

Nyrio
Copy link
Contributor

@Nyrio Nyrio commented Sep 12, 2022

The bottleneck of ivf_flat::build is the label prediction, currently done with GEMM + argmin. This PR gives a 25-30% speedup by replacing it with the fused primitive.

@Nyrio Nyrio requested a review from a team as a code owner September 12, 2022 15:40
@github-actions github-actions bot added the cpp label Sep 12, 2022
@Nyrio Nyrio added 3 - Ready for Review non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Sep 12, 2022
Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very nice speedup! Two main points:

  1. Would you consider to templatize the fusedL2NN to allow unsigned labels instead? It feels like more internals of ann_kmeans there would still need to be changed to properly switch to LabelT if we go this way.
  2. Out of curiosity, did you check if the argmin_along_rows could be optimized to achieve the same speedup not fused? I remember it looked awfully slow in some settings in the nsys timeline when I tested it...

cpp/include/raft/spatial/knn/detail/ann_utils.cuh Outdated Show resolved Hide resolved
cpp/include/raft/spatial/knn/detail/ann_utils.cuh Outdated Show resolved Hide resolved
cpp/include/raft/spatial/knn/detail/ann_utils.cuh Outdated Show resolved Hide resolved
inline void predict_float_core(const handle_t& handle,
const float* centers,
uint32_t n_clusters,
uint32_t dim,
const float* dataset,
const float* dataset_norm,
size_t n_rows,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should n_rows and all variables/arrays referring to the dataset indices/sizes be the same type as indices, i.e. IdxT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I left dim and n_clusters in uint32_t, replaced uint64_t with IdxT, and solved a few overflows (I've tested with very large datasets to debug overflows).

@Nyrio
Copy link
Contributor Author

Nyrio commented Sep 13, 2022

Would you consider to templatize the fusedL2NN to allow unsigned labels instead? It feels like more internals of ann_kmeans there would still need to be changed to properly switch to LabelT if we go this way.

Which internals of ann kmeans still need to be changed? The IVF-Flat tests are working at least. I'll look into fusedL2NN to see if I can simply add support for unsigned label types, but it doesn't hurt to have the flexibility in the ann kmeans code, even if we end up only instantiating one type (be it uint32_t or int).

Out of curiosity, did you check if the argmin_along_rows could be optimized to achieve the same speedup not fused? I remember it looked awfully slow in some settings in the nsys timeline when I tested it...

I'll look into this, but by my quick calculation, I would have to speed up argmin_along_rows by a factor of at least 10 to achieve similar performance.

@achirkin
Copy link
Contributor

Which internals of ann kmeans still need to be changed?

It's the things like fine_clusters_csum variable, which are not obvious and do not raise a compile-time errors. It's not always clear which element type should they have, what makes me hesitant to introduce more index types in the code. I've had already one fix for something there in the merged follow-up pr :)

@Nyrio
Copy link
Contributor Author

Nyrio commented Sep 13, 2022

It's the things like fine_clusters_csum variable, which are not obvious and do not raise a compile-time errors. It's not always clear which element type should they have, what makes me hesitant to introduce more index types in the code.

Yes, it's not always clear which index types should be used, and I tried not to change all the types in this PR but only those related the labels. fine_clusters_csum is not impacted by the current change so I didn't touch it.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Nyrio for the PR! It is nice to see this improvement in the balanced kmeans imlementation. I have a few comments, and I see that some items for @achirkin's review are still pending, please have a look.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Louis, for addressing the issues so far. I have a few more nitpicks, otherwise it looks good.

@Nyrio Nyrio requested a review from tfeher September 29, 2022 18:02
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Louis for addressing the issues, the PR looks good to me!

@achirkin could you also have another look at the PR?

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR looks good so far. I'm just reviewing as comments for now since I have a few questions but I don't want to hold up progress.

* @param n_rows number samples in the `dataset`
* @param[out] labels output predictions [n_rows]
* @param metric
* @param stream
* @param mr (optional) memory resource to use for temporary allocations
*/

template <typename T>
template <typename T, typename IdxT, typename LabelT>
void predict(const handle_t& handle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would it take at this point to move the train and predict functions from raft/spatial/knn/ over to raft/cluster and provide the balanced kmeans as either an option in the existing kmeans or as an additional separate API (maybe raft/cluster/kmeans_balanced.cuh?). Do you think that would be a significant amount of work or would t be pretty trivial once the two are using similar primitives?

The CuOPT team has been interested in using the balanced k-means and I'm sure there are other users that could find it very useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving as a separate API sounds feasible. Merging the two APIs requires first using similar primitives, for which I have some ongoing work. I'm not sure which path will be preferred, but if we expose that as a separate API now, it's an API we're also going to have to maintain for a while once we merge it with the existing kmeans.

@@ -689,21 +821,40 @@ auto build_fine_clusters(const handle_t& handle,
"Number of fine clusters must be non-zero for a non-empty mesocluster");
}

utils::copy_selected(
mesocluster_sizes[i], dim, dataset_mptr, mc_trainset_ids, dim, mc_trainset, dim, stream);
utils::copy_selected((IdxT)mesocluster_sizes[i],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't add this function to the implementation originally, but how does copy_selected differ from raft::matrix::copy_rows which accepts an array of selected ids? I ask because it sounds like if they are similar enough, we could add a note to consolidate them.

Copy link
Contributor

@achirkin achirkin Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is used all over the place in both new ivf-* implementations and it uses the mapping type to transform between input and output along the way (e.g. cast uint8_t to float and then divide by 256).

However, in case of ann_kmeans_balanced, I think we'd better use raft::matrix::copy_rows, because we copy indices here rather than values, and logically it has nothing to do with the ann mapping of the value types.

Copy link
Contributor Author

@Nyrio Nyrio Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, in case of ann_kmeans_balanced, I think we'd better use raft::matrix::copy_rows, because we copy indices here rather than values, and logically it has nothing to do with the ann mapping of the value types.

What do you mean "we copy indices"? The two calls of copy_selected in ann_kmeans_balanced copy parts of the dataset and the norm respectively. While the norm is float and can easily make use of matrix::copyRows (or a better kernel, since n_cols=1), the dataset can be int8/uint8 and definitely needs the mapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe we could adapt copyRows to accept an iterator, and then branch inside matrix::detail::copyRows to select the most efficient implementation based on whether the iterator is a simple pointer or not.

Copy link
Contributor

@achirkin achirkin Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "we copy indices"?

Oh, sorry, I looked the wrong way. Indeed you're right, we do copy values here and do make use of the mapping T -> float. Which, strictly speaking, is a part of our implementation of ANN methods so far.

Copy link
Contributor Author

@Nyrio Nyrio Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in that regard, it matters little whether that hardcoded mapping type is given through an iterator or directly in the kernel such as in copy_selected_kernel, I'd even argue I prefer the former because the primitive doesn't have to know about the mapping.

Copy link
Contributor

@achirkin achirkin Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I totally agree (just make sure we don't lose much in terms of performance). I meant to explain my issues with the other topic discussed above - moving ann_kmeans_balanced into the cluster namespace, whereas it currently contains ANN-implementation-specific logic for transforming the data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nyrio trying to use raft primitives where at all possible, would raft::matrix::gather work in place of thrust::gather?

Also, just FYI, we are in the process of migrating the RAFT APIs to accept mdspan everywhere to make them consistent across the board, cleaner, easier to use. Maybe we could also use raft::span (which can construct an interator)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjnolet matrix::gather will be very inefficient for D=1:

for (int i = threadIdx.x; i < D; i += TPB) {
  out[outRowStart + i] = in[inRowStart + i];
}

We could have an array/vector gather but it would essentially be a simple wrapper around thrust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I used thrust::gather only for the norm, we could however use matrix::gather to select rows in the dataset (in another PR). Its support for iterators is convenient, I could use that to do the int-float mapping.

Don't we have some redundancy between copyRows and gather? The latter seems to be a generalization of the former, am I right?

* @param n_rows
* @param n_cols
* @param[in] a device pointer to the row-major matrix [n_rows, n_cols]
* @param[out] out device pointer to the vector of selected indices [n_rows]
* @param stream
*/
template <typename IdxT, typename LabelT>
inline void argmin_along_rows(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in the todos for future work in the ann impls already, but i want to point out again that this would be a great primitive for raft::matrix (along w/ most of the functions in this file). While I suppose raft/cluster/hierarchical_kmeans.cuh could still reference the raft::spatial::knn::detail namespace, it would also lend itself to better and more clean separation if these were made more centralized within raft.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noted there in the docs that it would be nice to merge this into select-top-k as a special case

* TODO: specialize select_k for the case of `k == 1` and use that one instead.

I suggest I will do this as soon as ivf-pq implementation is merged. Here's the linked issue #853

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argmin and argmax themselves are useful primitives. We do have an argmax primitive that uses cub underneath currently. We could create an argmin from that, or use this one. Either way, i think it's generally useful and could be moved to raft::matrix::argmin

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, all of the k-means changes seem to be compatible with my ivf-pq PR and hence LGTM :)

# Conflicts:
#	cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh
#	cpp/include/raft/spatial/knn/detail/ann_utils.cuh
@cjnolet
Copy link
Member

cjnolet commented Sep 30, 2022

@Nyrio, it looks like there's a couple conflicts but otherwise good to go. We can merge after the conflicts are resolved.

@Nyrio
Copy link
Contributor Author

Nyrio commented Sep 30, 2022

@cjnolet Yes, I am just running some tests on local changes (conflicts + using the new compiled wrapper around fusedL2NN) and after I push it should be ready to merge

@Nyrio Nyrio changed the title [REVIEW] Use fusedL2NN in ANN kmeans Use fusedL2NN in ANN kmeans Sep 30, 2022
@cjnolet
Copy link
Member

cjnolet commented Sep 30, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 830cb89 into rapidsai:branch-22.10 Oct 4, 2022
rapids-bot bot pushed a commit that referenced this pull request Nov 9, 2022
…ANN kmeans (#912)

This PR follows up on [a suggestion](#821 (comment)) from @cjnolet. The new `argmin` primitive is up to 5x faster than `argmin_along_rows` for dimensions relevant to ANN kmeans, and removes code duplication.

The reasons why it is faster are:

- `argmin_along_rows` often misses on doing a sequential reduction before the tree reduction, especially as it uses large block sizes, as much as 1024.
- CUB has a better reduction algorithm than the basic shared-mem reduction used in `argmin_along_rows`.
- If we switch the `argmin` prim to using the `cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY` algorithm, we can get up to 30% further speedup! (I believe it's safe to use the commutative algorithm here since the offset is contained in the key-value pair so the reduction operation is commutative).

The speedup that I have measured for IVF-Flat build with the `InnerProduct` metric is around 15%.

Authors:
  - Louis Sugy (https://github.com/Nyrio)

Approvers:
  - Micka (https://github.com/lowener)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #912
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants