[ENH] Add extern template for ivfflat_interleaved_scan #1360

ahendriksen · 2023-03-21T17:56:16Z

This should cut compilation time for refine_d_int64_t_float.cu.o et al from ~900 seconds to 29 seconds.

The refine specialization contain >100 instances of the ivfflat_interleaved_scan kernel, even though these should be seperately compiled by the ivfflat_search specializations.

The call to ivf_flat_interleaved_scan is here.

Depends on (so please merge after) PR #1307.

The calculation of the tile indices are now performed in ldgXY(). This will make it possible to remove all state related to the tile index out of the class in the next commit. Note that the calculation of the tile index can depend on which overloaded constructor is called(!)

This commit moves all grid and tile indexing logic into the caller. Contractions_NT is now only responsible for *intra*-tile indexing. Due to the complexity of the epilog function, the ldgNextGridStride function is not yet called from within the main loop. That is the next goal so that we have all the grid and tile indexing localized in the loop.

This commit removes the epilog function and moves its functionality into the run loop. The next step might be to see if the ldgNextGridStride() method has to be called the current location, or if performance is the same if its called at the start of the loop.

This results in subtle issues with non-square KernelPolicy, as found in fusedL2KNN.

This is more general than just for L1. Making use of it more is work in progress.

By adding yet another struct ^^

This did remove support for the CUTLASS kernels. Has to be put back.

I wasted a lot of time because I had not replaced the op::core() method of the l2_exp_distance_op after I copied it from l2_unexp_distance_op... If I copy something from the template and forget to fill it in, I get a compile error.

I am testing on CUDA 12, where it does not seem to work. Prior to my commits, the CUTLASS kernels were also not working. So not sure what's up. In any case: consider this untested.

This indicates that the operator uses expensive operations (pow, exp, log) in the inner loop. Therefore, unrolling and or veclen parameters should be adjusted

…ations-reduce-compile-times

This cuts compilation time for refine_d_int64_t_float.cu.o from ~900 seconds to 29 seconds.

…at_interleaved_scan-spec

ahendriksen · 2023-03-23T11:29:34Z

I think we should find a more sustainable way to manage the template instantiations in the specializations directories.. It is very easy to call a function that does not have an extern template definition and as a result recompile many duplicate kernels. Right now, these are only noticeable if you hawkishly check the ninja logs and compare the kernels contained in each translation unit (using cuobjdump).

For now, I have added comments at each declaration, definition and call of ivfflat_interleaved_scan with a tag greppable-id-specializations-ivf-flat-search.

ahendriksen · 2023-03-23T16:05:00Z

Marking this as ready for review since most of the builds and tests were succeeding in CI. Changes should be straightforward.

cpp/src/neighbors/specializations/ivfflat_search_float_int64_t.cu

cjnolet

Thinking about this a little further, I'm okay merging this as-is, even though we have separated all of our other source files.

I would like it if we would continue investigating why those overheads are so high, though.

cjnolet · 2023-03-25T19:09:30Z

/merge

ahendriksen added 30 commits February 8, 2023 14:56

pairwise_distance_base: Fix typo

a15d5fc

This results in subtle issues with non-square KernelPolicy, as found in fusedL2KNN.

Remove deprecated header

71c6da6

Replace lambdas by raft::void_op

4bbedf6

Use an operator for L1 distance

c3d1f6e

Add launch function

3e3478b

This is more general than just for L1. Making use of it more is work in progress.

l1: Replace run-time -> compile-time dispatch

264a9d2

pairwise matrix: move files into subdirectories

b232057

pairwise matrix: Untangle dispatching and kernel template parameters

06f6ffa

By adding yet another struct ^^

l2 unexp: Use pairwise matrix dispatch

2f41faa

l2 exp: Use pairwise matrix dispatch

7938614

This did remove support for the CUTLASS kernels. Has to be put back.

Add template for distance operator

7afe6cc

I wasted a lot of time because I had not replaced the op::core() method of the l2_exp_distance_op after I copied it from l2_unexp_distance_op... If I copy something from the template and forget to fill it in, I get a compile error.

Reenable cutlass-based kernels for CUDA 12.0

5fe3292

pairwise matrix l2: Add support for CUTLASS kernels

c623332

I am testing on CUDA 12, where it does not seem to work. Prior to my commits, the CUTLASS kernels were also not working. So not sure what's up. In any case: consider this untested.

Canberra: use dispatching mechanism

27511fc

Chebyshev: use pairwise matrix dispatch

58ce6f8

Correlation: use pairwise matrix dispatch

d397c17

Hamming: use pairwise matrix dispatch

7005a4f

Hellinger: use pairwise matrix dispatch

7831deb

Jensen-Shannon: use pairwise matrix dispatch

4dc72ce

remove old hamming code

b0d36c1

KL divergence: use pairwise matrix dispatch

e95a65b

Minkowski: use pairwise matrix dispatch

f1c105b

Russel-Rao: use pairwise matrix dispatch

ac66e3f

Cosine: use pairwise matrix dispatch

a89896a

Fix include for l1 op

16b2acd

kl_divergence: Use raft::log instead of raft::myLog

1326e34

distance_op: Add expensive_inner_loop marker

0169b26

This indicates that the operator uses expensive operations (pow, exp, log) in the inner loop. Therefore, unrolling and or veclen parameters should be adjusted

ahendriksen added 5 commits March 20, 2023 17:53

tune_distance: Enable changing distance op without recompile

c2970ba

Use std::declval

6c0d944

Merge remote-tracking branch 'rapids/branch-23.04' into enh-specializ…

05a4743

…ations-reduce-compile-times

distance_ops: Use static shared_mem_size

1df17be

Add extern template for ivfflat_interleaved_scan

44bc742

This cuts compilation time for refine_d_int64_t_float.cu.o from ~900 seconds to 29 seconds.

github-actions bot added CMake cpp labels Mar 21, 2023

ahendriksen added 5 - Merge After Dependencies Depends on another PR: do not merge out of order improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Build Time Improvement and removed CMake labels Mar 21, 2023

cjnolet assigned ahendriksen Mar 21, 2023

cjnolet and others added 3 commits March 22, 2023 22:52

Merge branch 'branch-23.04' into enh-add-ivfflat_interleaved_scan-spec

6a1d7b8

Instantiate template somewhere as well...

3ad29c3

Merge remote-tracking branch 'rapids/branch-23.04' into enh-add-ivffl…

1332a03

…at_interleaved_scan-spec

cjnolet removed the 5 - Merge After Dependencies Depends on another PR: do not merge out of order label Mar 23, 2023

ahendriksen marked this pull request as ready for review March 23, 2023 16:03

ahendriksen requested a review from a team as a code owner March 23, 2023 16:03

Merge branch 'branch-23.04' into enh-add-ivfflat_interleaved_scan-spec

47f3dea

ahendriksen added the 3 - Ready for Review label Mar 23, 2023

cjnolet reviewed Mar 23, 2023

View reviewed changes

cpp/src/neighbors/specializations/ivfflat_search_float_int64_t.cu Show resolved Hide resolved

Merge branch 'branch-23.04' into enh-add-ivfflat_interleaved_scan-spec

e0cdd2e

ahendriksen mentioned this pull request Mar 24, 2023

[ENH] Add specialization for coalescedReductionThinKernel? #1372

Open

cjnolet approved these changes Mar 25, 2023

View reviewed changes

rapids-bot bot merged commit 76c828d into rapidsai:branch-23.04 Mar 25, 2023

ahendriksen mentioned this pull request Apr 14, 2023

[DISCUSSION] A robust solution to precompiling function templates #1416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Add extern template for ivfflat_interleaved_scan #1360

[ENH] Add extern template for ivfflat_interleaved_scan #1360

ahendriksen commented Mar 21, 2023 •

edited

Loading

ahendriksen commented Mar 23, 2023

ahendriksen commented Mar 23, 2023

cjnolet left a comment

cjnolet commented Mar 25, 2023

[ENH] Add extern template for ivfflat_interleaved_scan #1360

[ENH] Add extern template for ivfflat_interleaved_scan #1360

Conversation

ahendriksen commented Mar 21, 2023 • edited Loading

ahendriksen commented Mar 23, 2023

ahendriksen commented Mar 23, 2023

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet commented Mar 25, 2023

ahendriksen commented Mar 21, 2023 •

edited

Loading