Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add extern template for ivfflat_interleaved_scan #1360

Conversation

ahendriksen
Copy link
Contributor

@ahendriksen ahendriksen commented Mar 21, 2023

This should cut compilation time for refine_d_int64_t_float.cu.o et al from ~900 seconds to 29 seconds.

The refine specialization contain >100 instances of the ivfflat_interleaved_scan kernel, even though these should be seperately compiled by the ivfflat_search specializations.

The call to ivf_flat_interleaved_scan is here.

Depends on (so please merge after) PR #1307.

The calculation of the tile indices are now performed in ldgXY(). This
will make it possible to remove all state related to the tile index out
of the class in the next commit.

Note that the calculation of the tile index can depend on which
overloaded constructor is called(!)
This commit moves all grid and tile indexing logic into the caller.
Contractions_NT is now only responsible for *intra*-tile indexing.

Due to the complexity of the epilog function, the ldgNextGridStride
function is not yet called from within the main loop. That is the next
goal so that we have all the grid and tile indexing localized in the
loop.
This commit removes the epilog function and moves its functionality into
the run loop. The next step might be to see if the ldgNextGridStride()
method has to be called the current location, or if performance is the
same if its called at the start of the loop.
This results in subtle issues with non-square KernelPolicy, as found in
fusedL2KNN.
This is more general than just for L1. Making use of it more is work in
progress.
This did remove support for the CUTLASS kernels. Has to be put back.
I wasted a lot of time because I had not replaced the op::core() method
of the l2_exp_distance_op after I copied it from l2_unexp_distance_op...

If I copy something from the template and forget to fill it in, I get a
compile error.
I am testing on CUDA 12, where it does not seem to work. Prior to my
commits, the CUTLASS kernels were also not working. So not sure what's
up.

In any case: consider this untested.
This indicates that the operator uses expensive operations (pow, exp,
log) in the inner loop. Therefore, unrolling and or veclen parameters
should be adjusted
@ahendriksen ahendriksen added 5 - Merge After Dependencies Depends on another PR: do not merge out of order improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Build Time Improvement and removed CMake labels Mar 21, 2023
@ahendriksen
Copy link
Contributor Author

I think we should find a more sustainable way to manage the template instantiations in the specializations directories.. It is very easy to call a function that does not have an extern template definition and as a result recompile many duplicate kernels. Right now, these are only noticeable if you hawkishly check the ninja logs and compare the kernels contained in each translation unit (using cuobjdump).

For now, I have added comments at each declaration, definition and call of ivfflat_interleaved_scan with a tag greppable-id-specializations-ivf-flat-search.

@cjnolet cjnolet removed the 5 - Merge After Dependencies Depends on another PR: do not merge out of order label Mar 23, 2023
@ahendriksen ahendriksen marked this pull request as ready for review March 23, 2023 16:03
@ahendriksen ahendriksen requested a review from a team as a code owner March 23, 2023 16:03
@ahendriksen
Copy link
Contributor Author

Marking this as ready for review since most of the builds and tests were succeeding in CI. Changes should be straightforward.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a little further, I'm okay merging this as-is, even though we have separated all of our other source files.

I would like it if we would continue investigating why those overheads are so high, though.

@cjnolet
Copy link
Member

cjnolet commented Mar 25, 2023

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Build Time Improvement cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants