New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamically assign number of threads in innerdim scan #103435
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103435
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1d8cd35: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are instantiating just one template per op per dtype, like it was before, so it's fine as is (and shouldn't affect compile time), but I think it would be better to remove that template parameter altogether if possible.
T init, | ||
BinaryFunction binary_op) { | ||
__shared__ T sbuf[num_threads_y][2 * num_threads_x]; | ||
T* row_buf = sbuf[threadIdx.y]; | ||
__shared__ T sbuf[num_threads * 2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need num_threads as a template parameter? You can dynamically set shared memory depending on num_threads at the call site
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can do that. In line with this, is there an internal function that determines the maximum number of threads automatically rather than manually specified as 512?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could query maxThreadsPerBlock
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g19b1696533069c03f646e2ce2beacc00, however, that's not necessarily the best perf, because launching with maxThreads thus returned may result in lower occupancy. 512
is usually a good middle ground.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried setting the dynamic memory by using extern __shared__ T sbuf[]
and set the argument in <<<...>>>
. However, I got the redeclaration error:
/git/pytorch/aten/src/ATen/native/cuda/ScanUtils.cuh(397): warning #20042-D: a host variable("sbuf") redeclared with __shared__
base_t sbuf[];
If removing the template argument is not necessary, can I leave it for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's because of instantiations with the different datatypes, you can look how other kernels handle that, it's basically declaring shared memory with char
type and doing reinterpret_cast, but it's fine to leave for now.
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
38688af
to
bc0e8dd
Compare
Thanks for your help and education, @ngimel |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
#103502) What this PR does is (continuation from #103435): - Applying dynamic number of threads for innerdim scan with index function. - Using dynamically allocated shared memory to get rid of `num_threads` template arguments. @ngimel Pull Request resolved: #103502 Approved by: https://github.com/ngimel
This is the continuation of optimizing inner-dimension scan operations (
torch.cumsum
,torch.cumprod
,torch.logcumsumexp
) by dynamically setting the number of threads based on the input shape from #103314.What I found that just setting the number of x-threads and y-threads following the ratio of the tensor's shape works quite well (with some clamping).
Here is the speed-up of this PR, compared to
2.0.0+cu118
(not compared to #103314) using A100 with 40GB memory (up to 23x faster):The first row is the innermost dimension, the first column is the outermost dimension (i.e. the batch size).
The float numbers are the speed up while the integers within the brackets are the log2 of number of x-threads.
The blank cells (the ones with dashes) are not compared because of my GPU's memory limitation.
There are some slowdowns that I observed (like
(2048, 8)
and(4096, 32)
). The slowdown is because in this PR, the scan loop (the one I use with Sklansky) is not optimized by the compiler due to dynamic number of iterations (it islog2(num_threads_x)
), while in the previous version, the scan loop can be unrolled and optimized by the compiler due to fixed number of iterations.That's why I slightly modified the operations within the scan loop to use bit operations in order to compensate for this slowdown.
The most significant acceleration comes from the tensors with relatively small batch size (<= 4096) and with very long sequence.
As the batch size increases, the speed up is not that significant because the previous implementation is most likely to be optimized.
NOTE: I haven't optimized scan dim with indices, it could come in another PR.
As for the build time, I tried not to write more templated functions than necessary.
I will report the build time when I already have the numbers.
UPDATE: I compared the build time when I changed ScanUtils.cuh only. In
main
branch, it took 4m2s, while in this PR, it took 3m39s.What do you think, @ngimel?