-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JIT] Ensure offset is a multiple of 4 to fix "Philox" RNG in jitted kernels #50169
[JIT] Ensure offset is a multiple of 4 to fix "Philox" RNG in jitted kernels #50169
Conversation
💊 CI failures summary and remediationsAs of commit 25cdf69 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. This comment has been revised 6 times. |
Codecov Report
@@ Coverage Diff @@
## master #50169 +/- ##
==========================================
+ Coverage 80.49% 80.68% +0.18%
==========================================
Files 1900 1900
Lines 206254 206254
==========================================
+ Hits 166018 166409 +391
+ Misses 40236 39845 -391 |
Please add the PR description as a note somewhere in the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Immediately-upstreamable part of #50148.
This PR fixes what I'm fairly sure is a subtle bug with custom
Philox
class usage in jitted kernels.Philox
constructors in kernels take the cuda rng generator's current offset. The Philox constructor then carries outoffset/4
(a uint64_t division) to compute its internal offset in its virtual Philox bitstream of 128-bit chunks. In other words, it assumes the incoming offset is a multiple of 4. But (in current code) that's not guaranteed. For example, the increments used by these eager kernels could easily make offset not divisible by 4.I figured the easiest fix was to round all incoming increments up to the nearest multiple of 4 in CUDAGeneratorImpl itself.
Another option would be to round the current offset up to the next multiple of 4 at the jit point of use. But that would be a jit-specific offset jump, so jit rng kernels wouldn't have a prayer of being bitwise accurate with eager rng kernels that used non-multiple-of-4 offsets. Restricting the offset to multiples of 4 for everyone at least gives jit rng the chance to match eager rng. (Of course, there are still many other ways the numerics could diverge, like if a jit kernel launches a different number of threads than an eager kernel, or assigns threads to data elements differently.)