Use fusedL2NN in ANN kmeans #821

Nyrio · 2022-09-12T15:40:39Z

The bottleneck of ivf_flat::build is the label prediction, currently done with GEMM + argmin. This PR gives a 25-30% speedup by replacing it with the fused primitive.

achirkin

That's very nice speedup! Two main points:

Would you consider to templatize the fusedL2NN to allow unsigned labels instead? It feels like more internals of ann_kmeans there would still need to be changed to properly switch to LabelT if we go this way.
Out of curiosity, did you check if the argmin_along_rows could be optimized to achieve the same speedup not fused? I remember it looked awfully slow in some settings in the nsys timeline when I tested it...

cpp/include/raft/spatial/knn/detail/ann_utils.cuh

cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh

achirkin · 2022-09-12T16:33:48Z

cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh

 inline void predict_float_core(const handle_t& handle,
                               const float* centers,
                               uint32_t n_clusters,
                               uint32_t dim,
                               const float* dataset,
+                               const float* dataset_norm,
                               size_t n_rows,


Should n_rows and all variables/arrays referring to the dataset indices/sizes be the same type as indices, i.e. IdxT?

For now, I left dim and n_clusters in uint32_t, replaced uint64_t with IdxT, and solved a few overflows (I've tested with very large datasets to debug overflows).

cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh

Nyrio · 2022-09-13T09:57:53Z

Would you consider to templatize the fusedL2NN to allow unsigned labels instead? It feels like more internals of ann_kmeans there would still need to be changed to properly switch to LabelT if we go this way.

Which internals of ann kmeans still need to be changed? The IVF-Flat tests are working at least. I'll look into fusedL2NN to see if I can simply add support for unsigned label types, but it doesn't hurt to have the flexibility in the ann kmeans code, even if we end up only instantiating one type (be it uint32_t or int).

Out of curiosity, did you check if the argmin_along_rows could be optimized to achieve the same speedup not fused? I remember it looked awfully slow in some settings in the nsys timeline when I tested it...

I'll look into this, but by my quick calculation, I would have to speed up argmin_along_rows by a factor of at least 10 to achieve similar performance.

achirkin · 2022-09-13T10:55:03Z

Which internals of ann kmeans still need to be changed?

It's the things like fine_clusters_csum variable, which are not obvious and do not raise a compile-time errors. It's not always clear which element type should they have, what makes me hesitant to introduce more index types in the code. I've had already one fix for something there in the merged follow-up pr :)

Nyrio · 2022-09-13T11:58:01Z

It's the things like fine_clusters_csum variable, which are not obvious and do not raise a compile-time errors. It's not always clear which element type should they have, what makes me hesitant to introduce more index types in the code.

Yes, it's not always clear which index types should be used, and I tried not to change all the types in this PR but only those related the labels. fine_clusters_csum is not impacted by the current change so I didn't touch it.

…d mem instead of template param

tfeher

Thanks @Nyrio for the PR! It is nice to see this improvement in the balanced kmeans imlementation. I have a few comments, and I see that some items for @achirkin's review are still pending, please have a look.

cpp/include/raft/spatial/knn/detail/ann_utils.cuh

cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh

# Conflicts: # cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh # cpp/include/raft/spatial/knn/detail/ann_utils.cuh

tfeher

Thanks Louis, for addressing the issues so far. I have a few more nitpicks, otherwise it looks good.

cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh

…esource

tfeher

Thanks Louis for addressing the issues, the PR looks good to me!

@achirkin could you also have another look at the PR?

cjnolet

I think this PR looks good so far. I'm just reviewing as comments for now since I have a few questions but I don't want to hold up progress.

cjnolet · 2022-09-29T18:12:33Z

cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh

 * @param n_rows number samples in the `dataset`
 * @param[out] labels output predictions [n_rows]
 * @param metric
 * @param stream
 * @param mr (optional) memory resource to use for temporary allocations
 */
-
-template <typename T>
+template <typename T, typename IdxT, typename LabelT>
 void predict(const handle_t& handle,


What would it take at this point to move the train and predict functions from raft/spatial/knn/ over to raft/cluster and provide the balanced kmeans as either an option in the existing kmeans or as an additional separate API (maybe raft/cluster/kmeans_balanced.cuh?). Do you think that would be a significant amount of work or would t be pretty trivial once the two are using similar primitives?

The CuOPT team has been interested in using the balanced k-means and I'm sure there are other users that could find it very useful.

Moving as a separate API sounds feasible. Merging the two APIs requires first using similar primitives, for which I have some ongoing work. I'm not sure which path will be preferred, but if we expose that as a separate API now, it's an API we're also going to have to maintain for a while once we merge it with the existing kmeans.

cjnolet · 2022-09-29T18:20:54Z

cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh

@@ -689,21 +821,40 @@ auto build_fine_clusters(const handle_t& handle,
                   "Number of fine clusters must be non-zero for a non-empty mesocluster");
    }

-    utils::copy_selected(
-      mesocluster_sizes[i], dim, dataset_mptr, mc_trainset_ids, dim, mc_trainset, dim, stream);
+    utils::copy_selected((IdxT)mesocluster_sizes[i],


I know you didn't add this function to the implementation originally, but how does copy_selected differ from raft::matrix::copy_rows which accepts an array of selected ids? I ask because it sounds like if they are similar enough, we could add a note to consolidate them.

This one is used all over the place in both new ivf-* implementations and it uses the mapping type to transform between input and output along the way (e.g. cast uint8_t to float and then divide by 256).

However, in case of ann_kmeans_balanced, I think we'd better use raft::matrix::copy_rows, because we copy indices here rather than values, and logically it has nothing to do with the ann mapping of the value types.

However, in case of ann_kmeans_balanced, I think we'd better use raft::matrix::copy_rows, because we copy indices here rather than values, and logically it has nothing to do with the ann mapping of the value types.

What do you mean "we copy indices"? The two calls of copy_selected in ann_kmeans_balanced copy parts of the dataset and the norm respectively. While the norm is float and can easily make use of matrix::copyRows (or a better kernel, since n_cols=1), the dataset can be int8/uint8 and definitely needs the mapping.

But maybe we could adapt copyRows to accept an iterator, and then branch inside matrix::detail::copyRows to select the most efficient implementation based on whether the iterator is a simple pointer or not.

What do you mean "we copy indices"?

Oh, sorry, I looked the wrong way. Indeed you're right, we do copy values here and do make use of the mapping T -> float. Which, strictly speaking, is a part of our implementation of ANN methods so far.

Yes, but in that regard, it matters little whether that hardcoded mapping type is given through an iterator or directly in the kernel such as in copy_selected_kernel, I'd even argue I prefer the former because the primitive doesn't have to know about the mapping.

Ah, yes, I totally agree (just make sure we don't lose much in terms of performance). I meant to explain my issues with the other topic discussed above - moving ann_kmeans_balanced into the cluster namespace, whereas it currently contains ANN-implementation-specific logic for transforming the data.

@Nyrio trying to use raft primitives where at all possible, would raft::matrix::gather work in place of thrust::gather?

Also, just FYI, we are in the process of migrating the RAFT APIs to accept mdspan everywhere to make them consistent across the board, cleaner, easier to use. Maybe we could also use raft::span (which can construct an interator)?

@cjnolet matrix::gather will be very inefficient for D=1:

for (int i = threadIdx.x; i < D; i += TPB) { out[outRowStart + i] = in[inRowStart + i]; }

We could have an array/vector gather but it would essentially be a simple wrapper around thrust.

Note that I used thrust::gather only for the norm, we could however use matrix::gather to select rows in the dataset (in another PR). Its support for iterators is convenient, I could use that to do the int-float mapping.

Don't we have some redundancy between copyRows and gather? The latter seems to be a generalization of the former, am I right?

cjnolet · 2022-09-29T18:22:27Z

cpp/include/raft/spatial/knn/detail/ann_utils.cuh

 * @param n_rows
 * @param n_cols
 * @param[in] a device pointer to the row-major matrix [n_rows, n_cols]
 * @param[out] out device pointer to the vector of selected indices [n_rows]
 * @param stream
 */
+template <typename IdxT, typename LabelT>
 inline void argmin_along_rows(


I think this is in the todos for future work in the ann impls already, but i want to point out again that this would be a great primitive for raft::matrix (along w/ most of the functions in this file). While I suppose raft/cluster/hierarchical_kmeans.cuh could still reference the raft::spatial::knn::detail namespace, it would also lend itself to better and more clean separation if these were made more centralized within raft.

I've noted there in the docs that it would be nice to merge this into select-top-k as a special case

raft/cpp/include/raft/spatial/knn/detail/ann_utils.cuh

Line 193 in 2e98138

* TODO: specialize select_k for the case of `k == 1` and use that one instead.

I suggest I will do this as soon as ivf-pq implementation is merged. Here's the linked issue #853

Argmin and argmax themselves are useful primitives. We do have an argmax primitive that uses cub underneath currently. We could create an argmin from that, or use this one. Either way, i think it's generally useful and could be moved to raft::matrix::argmin

achirkin

Just to confirm, all of the k-means changes seem to be compatible with my ivf-pq PR and hence LGTM :)

# Conflicts: # cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh # cpp/include/raft/spatial/knn/detail/ann_utils.cuh

cjnolet · 2022-09-30T17:40:09Z

@Nyrio, it looks like there's a couple conflicts but otherwise good to go. We can merge after the conflicts are resolved.

Nyrio · 2022-09-30T17:42:02Z

@cjnolet Yes, I am just running some tests on local changes (conflicts + using the new compiled wrapper around fusedL2NN) and after I push it should be ready to merge

cjnolet · 2022-09-30T19:17:00Z

@gpucibot merge

@cjnolet

…ANN kmeans (#912) This PR follows up on [a suggestion](#821 (comment)) from @cjnolet. The new `argmin` primitive is up to 5x faster than `argmin_along_rows` for dimensions relevant to ANN kmeans, and removes code duplication. The reasons why it is faster are: - `argmin_along_rows` often misses on doing a sequential reduction before the tree reduction, especially as it uses large block sizes, as much as 1024. - CUB has a better reduction algorithm than the basic shared-mem reduction used in `argmin_along_rows`. - If we switch the `argmin` prim to using the `cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY` algorithm, we can get up to 30% further speedup! (I believe it's safe to use the commutative algorithm here since the offset is contained in the key-value pair so the reduction operation is commutative). The speedup that I have measured for IVF-Flat build with the `InnerProduct` metric is around 15%. Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Micka (https://github.com/lowener) - Corey J. Nolet (https://github.com/cjnolet) - Tamas Bela Feher (https://github.com/tfeher) URL: #912

Nyrio added 6 commits September 6, 2022 16:42

Initial integration of fusedL2NN in ANN kmeans

90730e3

Precompute X norm for L2 metrics

e69d532

Fix integer overflow and pass index type to predict_float_core

c42c0b7

Merge remote-tracking branch 'origin/branch-22.10' into enh-ann-kmeans

eb2ee66

Fix for integer types

76d4908

Cleanup debug function

4b497ff

Nyrio requested a review from a team as a code owner September 12, 2022 15:40

github-actions bot added the cpp label Sep 12, 2022

Nyrio added 3 - Ready for Review non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Sep 12, 2022

achirkin requested changes Sep 12, 2022

View reviewed changes

Nyrio added 2 commits September 13, 2022 19:08

Use int64_t indices in IVF-Flat test to enable large test cases

fd290fa

Replace if by switch-case, C-style casts by C++ style, use int8 share…

8cfee2c

…d mem instead of template param

tfeher requested changes Sep 22, 2022

View reviewed changes

Nyrio added 5 commits September 22, 2022 17:57

Do not batch when using fusedL2NN

4736f93

Support for unsigned label type

2f8500d

Make pre-computed norm an optional argument

0a1ce14

Rewrite batch heuristic + change hard-coded integer types to templated

ab2890a

Merge remote-tracking branch 'origin/branch-22.10' into enh-ann-kmeans

ef9b281

# Conflicts: # cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh # cpp/include/raft/spatial/knn/detail/ann_utils.cuh

tfeher requested changes Sep 28, 2022

View reviewed changes

Index type capacity assertion + use make_device_mdarray with memory r…

1d0ab4c

…esource

Nyrio requested a review from tfeher September 29, 2022 18:02

tfeher approved these changes Sep 29, 2022

View reviewed changes

cjnolet reviewed Sep 29, 2022

View reviewed changes

Nyrio added 2 commits September 30, 2022 14:12

Use thrust::gather instead of copy_selected for norm

3dcac56

Solve warnings due to printf qualifier / type mismatch

0bb179d

achirkin approved these changes Sep 30, 2022

View reviewed changes

Nyrio added 2 commits September 30, 2022 19:06

Merge remote-tracking branch 'origin/branch-22.10' into enh-ann-kmeans

0e4ad82

# Conflicts: # cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh # cpp/include/raft/spatial/knn/detail/ann_utils.cuh

Use fusedL2NN wrapper compiled in distance library

5cfb67e

cjnolet approved these changes Sep 30, 2022

View reviewed changes

Nyrio changed the title ~~[REVIEW] Use fusedL2NN in ANN kmeans~~ Use fusedL2NN in ANN kmeans Sep 30, 2022

Nyrio added 5 - Ready to Merge and removed 3 - Ready for Review labels Sep 30, 2022

Nyrio added 2 commits October 4, 2022 11:15

Merge remote-tracking branch 'origin/branch-22.10' into enh-ann-kmeans

4e35df5

Fix ivf-pq / ann utils merge

fa946c5

rapids-bot bot merged commit 830cb89 into rapidsai:branch-22.10 Oct 4, 2022

Nyrio mentioned this pull request Oct 11, 2022

Create cub-based argmin primitive and replace argmin_along_rows in ANN kmeans #912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use fusedL2NN in ANN kmeans #821

Use fusedL2NN in ANN kmeans #821

Nyrio commented Sep 12, 2022 •

edited

Loading

achirkin left a comment

achirkin Sep 12, 2022

Nyrio Sep 26, 2022

Nyrio commented Sep 13, 2022 •

edited

Loading

achirkin commented Sep 13, 2022

Nyrio commented Sep 13, 2022

tfeher left a comment

tfeher left a comment

tfeher left a comment

cjnolet left a comment

cjnolet Sep 29, 2022

Nyrio Sep 30, 2022

cjnolet Sep 29, 2022

achirkin Sep 30, 2022 •

edited

Loading

Nyrio Sep 30, 2022 •

edited

Loading

Nyrio Sep 30, 2022

achirkin Sep 30, 2022 •

edited

Loading

Nyrio Sep 30, 2022 •

edited

Loading

achirkin Sep 30, 2022 •

edited

Loading

cjnolet Sep 30, 2022

Nyrio Sep 30, 2022

Nyrio Sep 30, 2022

cjnolet Sep 29, 2022

achirkin Sep 30, 2022

cjnolet Sep 30, 2022

achirkin left a comment

cjnolet commented Sep 30, 2022

Nyrio commented Sep 30, 2022

cjnolet commented Sep 30, 2022

Use fusedL2NN in ANN kmeans #821

Use fusedL2NN in ANN kmeans #821

Conversation

Nyrio commented Sep 12, 2022 • edited Loading

achirkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nyrio commented Sep 13, 2022 • edited Loading

achirkin commented Sep 13, 2022

Nyrio commented Sep 13, 2022

tfeher left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

cjnolet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

achirkin Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

Nyrio Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

achirkin Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

Nyrio Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

achirkin Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

achirkin left a comment

Choose a reason for hiding this comment

cjnolet commented Sep 30, 2022

Nyrio commented Sep 30, 2022

cjnolet commented Sep 30, 2022

Nyrio commented Sep 12, 2022 •

edited

Loading

Nyrio commented Sep 13, 2022 •

edited

Loading

achirkin Sep 30, 2022 •

edited

Loading

Nyrio Sep 30, 2022 •

edited

Loading

achirkin Sep 30, 2022 •

edited

Loading

Nyrio Sep 30, 2022 •

edited

Loading

achirkin Sep 30, 2022 •

edited

Loading