Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically assign number of threads in innerdim scan #103435

Closed
wants to merge 2 commits into from

Conversation

mfkasim1
Copy link
Contributor

@mfkasim1 mfkasim1 commented Jun 12, 2023

This is the continuation of optimizing inner-dimension scan operations (torch.cumsum, torch.cumprod, torch.logcumsumexp) by dynamically setting the number of threads based on the input shape from #103314.
What I found that just setting the number of x-threads and y-threads following the ratio of the tensor's shape works quite well (with some clamping).
Here is the speed-up of this PR, compared to 2.0.0+cu118 (not compared to #103314) using A100 with 40GB memory (up to 23x faster):

                2        8       32      128      512     1024     2048     4096     8096    16348    65536   262144  1048576 
       2:  1.07(4)  1.02(5)  1.01(6)  1.07(7)  2.16(8)  4.94(9)  8.71(9) 11.00(9) 12.99(9) 14.77(9) 16.41(9) 16.81(9) 16.97(9) 
       8:  1.20(4)  1.00(4)  1.01(5)  1.08(6)  2.85(7)  4.90(8)  6.34(8) 11.76(9) 13.86(9) 15.26(9) 16.96(9) 17.45(9) 19.75(9) 
      32:  1.08(4)  1.00(4)  1.00(4)  1.23(5)  2.48(6)  4.23(7)  5.04(7)  9.16(8) 10.11(8) 18.72(9) 20.64(9) 23.13(9) 23.50(9) 
     128:  1.09(4)  1.02(4)  1.03(4)  1.02(4)  1.64(5)  2.84(6)  3.08(6)  5.61(7)  5.86(7) 10.72(8) 19.22(9) 19.75(9) 19.97(9) 
     512:  1.06(4)  1.14(4)  1.01(4)  1.10(4)  1.02(4)  1.78(5)  1.85(5)  3.26(6)  3.34(6)  5.56(7)  8.56(8)  9.55(9)  9.62(9) 
    1024:  1.21(4)  1.22(4)  1.20(4)  1.06(4)  1.03(4)  1.05(4)  1.81(5)  1.86(5)  3.06(6)  3.12(6)  4.76(7)  5.20(8)  5.56(9) 
    2048:  1.04(4)  0.88(4)  1.00(4)  1.01(4)  1.02(4)  1.03(4)  1.02(4)  1.72(5)  1.73(5)  2.62(6)  2.86(7)  3.06(8) -------- 
    4096:  1.02(4)  1.12(4)  0.98(4)  1.60(4)  1.16(4)  1.09(4)  1.10(4)  1.10(4)  1.74(5)  1.75(5)  1.86(6)  2.00(7) -------- 
    8096:  1.03(4)  1.00(4)  1.00(4)  1.16(4)  1.17(4)  1.17(4)  1.18(4)  1.18(4)  1.18(4)  1.27(5)  1.43(6) -------- -------- 
   16348:  1.02(4)  1.15(4)  1.11(4)  1.17(4)  1.12(4)  1.11(4)  1.13(4)  1.12(4)  1.11(4)  1.08(4)  1.32(5) -------- -------- 
   65536:  1.17(4)  1.17(4)  1.16(4)  1.15(4)  1.12(4)  1.12(4)  1.12(4)  1.10(4)  1.10(4)  1.07(4) -------- -------- -------- 
  262144:  1.20(4)  1.20(4)  1.08(4)  1.13(4)  1.10(4)  1.09(4)  1.10(4)  1.08(4) -------- -------- -------- -------- -------- 
 1048576:  1.21(4)  1.14(4)  1.10(4)  1.13(4)  1.09(4)  1.08(4) -------- -------- -------- -------- -------- -------- --------

The first row is the innermost dimension, the first column is the outermost dimension (i.e. the batch size).
The float numbers are the speed up while the integers within the brackets are the log2 of number of x-threads.
The blank cells (the ones with dashes) are not compared because of my GPU's memory limitation.

There are some slowdowns that I observed (like (2048, 8) and (4096, 32)). The slowdown is because in this PR, the scan loop (the one I use with Sklansky) is not optimized by the compiler due to dynamic number of iterations (it is log2(num_threads_x)), while in the previous version, the scan loop can be unrolled and optimized by the compiler due to fixed number of iterations.
That's why I slightly modified the operations within the scan loop to use bit operations in order to compensate for this slowdown.

The most significant acceleration comes from the tensors with relatively small batch size (<= 4096) and with very long sequence.
As the batch size increases, the speed up is not that significant because the previous implementation is most likely to be optimized.
NOTE: I haven't optimized scan dim with indices, it could come in another PR.

As for the build time, I tried not to write more templated functions than necessary.
I will report the build time when I already have the numbers.
UPDATE: I compared the build time when I changed ScanUtils.cuh only. In main branch, it took 4m2s, while in this PR, it took 3m39s.

What do you think, @ngimel?

@pytorch-bot pytorch-bot bot added the release notes: cuda release notes category label Jun 12, 2023
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 12, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/103435

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1d8cd35:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are instantiating just one template per op per dtype, like it was before, so it's fine as is (and shouldn't affect compile time), but I think it would be better to remove that template parameter altogether if possible.

aten/src/ATen/native/cuda/ScanUtils.cuh Outdated Show resolved Hide resolved
T init,
BinaryFunction binary_op) {
__shared__ T sbuf[num_threads_y][2 * num_threads_x];
T* row_buf = sbuf[threadIdx.y];
__shared__ T sbuf[num_threads * 2];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need num_threads as a template parameter? You can dynamically set shared memory depending on num_threads at the call site

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can do that. In line with this, is there an internal function that determines the maximum number of threads automatically rather than manually specified as 512?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could query maxThreadsPerBlock https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g19b1696533069c03f646e2ce2beacc00, however, that's not necessarily the best perf, because launching with maxThreads thus returned may result in lower occupancy. 512 is usually a good middle ground.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried setting the dynamic memory by using extern __shared__ T sbuf[] and set the argument in <<<...>>>. However, I got the redeclaration error:

/git/pytorch/aten/src/ATen/native/cuda/ScanUtils.cuh(397): warning #20042-D: a host variable("sbuf") redeclared with __shared__
                     base_t sbuf[];

If removing the template argument is not necessary, can I leave it for now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's because of instantiations with the different datatypes, you can look how other kernels handle that, it's basically declaring shared memory with char type and doing reinterpret_cast, but it's fine to leave for now.

@mfkasim1
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased optscan2 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout optscan2 && git pull --rebase)

@mfkasim1
Copy link
Contributor Author

Thanks for your help and education, @ngimel
@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 13, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Jun 14, 2023
#103502)

What this PR does is (continuation from #103435):
- Applying dynamic number of threads for innerdim scan with index function.
- Using dynamically allocated shared memory to get rid of `num_threads` template arguments.

@ngimel
Pull Request resolved: #103502
Approved by: https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: cuda release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants