-
Notifications
You must be signed in to change notification settings - Fork 25k
CUDA_KERNEL_LOOP: prevent int overflow in loop increment. #24818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The issue Even for a valid array size, the loop variable in This is a minimal reproducer: import torch
x = torch.randn(1, 1, 1, 1073741825, dtype=torch.float16, device="cuda:0")
torch.functional.F.avg_pool2d(x, kernel_size=[1, 1], stride=[1, 1], padding=[0, 0], count_include_pad=True)
Performance @ngimel has recommended caution when using In this benchmark, the timings are the same before and after:
|
Incidentally,
|
…_loop_overflow
@pytorchbot retest this please. |
Summary: Spin-off from pytorch/pytorch#24818. Pull Request resolved: pytorch/pytorch#24820 Differential Revision: D16890917 Pulled By: ezyang fbshipit-source-id: 88df6d3ba98600acc810eda406daa1d850ed3320
I can see how the change you made avoids overflow, but I don't see how you get a correct result afterwards. If you always cast index to I think the traditional way we've solved this problem is have two versions of the kernel, one with 32-bit indexing and 64-bit indexing, and switch to the slower 64-bit kernel if applicable. Something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this works.
It works because
The overflow only matters in the last iteration, where e.g. With |
If the macro wasn't used, the cast would just be inside the loop:
This is not possible currently due to the way the macro is used. It would be cleaner in case In that case we'd need to write out the loop in full and get rid of the macro. |
Correction: of course signed casts are not UB, but are implementation-defined, so it should be fine. |
@pytorchbot retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Fixes pytorch/pytorch#24309. Pull Request resolved: pytorch/pytorch#24818 Differential Revision: D16927215 Pulled By: ezyang fbshipit-source-id: aeab5226fec6045941399693479975db4542c79e
Fixes #24309.