-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add non-deterministic alert to CUDA operations that use atomicAdd()
#40056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add non-deterministic alert to CUDA operations that use atomicAdd()
#40056
Conversation
Turns out cuda has another source of non-determinism. Starting from cuda 10.2, when someone runs matmuls in the different streams the results are not guaranteed to be deterministic. In pytorch, this manifests itself in LSTMs #39849, there could also be user code triggering this behavior, if users are launching work on multiple streams. Here's a link to cudnn documentation https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_8.html#rel-800-Preview__section_qhc_jc1_5kb. |
Thanks for finding this @ngimel. We could set the |
2009b1a
to
02d2805
Compare
💊 CI failures summary and remediationsAs of commit b47ed4d (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 39 times. |
02d2805
to
10a4a3e
Compare
torch/testing/_internal/common_nn.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any way to disable CPU for the non-deterministic alert tests for operations in the nn module, so I had to add this.
10a4a3e
to
a200688
Compare
I've added tests for all the operations that are supposed to throw an error, except for |
They are most likely in test_nn generated tests. |
a200688
to
5a3f615
Compare
atomicAdd()
atomicAdd()
5a3f615
to
8f204d8
Compare
The clang-tidy job is failing. This is the first error I see in the log and I'm not sure what it means: |
My guess is that there's an issue with accessing
EDIT: Oh nevermind, I had the wrong clang-tidy installed. I installed clang-tidy-8 and I can reproduce now. |
I don't think the clang-tidy failure is actually caused by my changes. I tried running it on a file that I didn't change, and I get basically the same error as far as I can tell:
|
clang-tidy is disabled again as of #40764, sorry about the trouble. |
Also add non-deterministic alert test for embedding_bag
73978a0
to
222758e
Compare
const Tensor& gradOutput, | ||
const Tensor& input) | ||
{ | ||
globalContext().alertNotDeterministic("adaptive_avg_pool2d_backward_out_cuda"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at these alerts, it would probably useful if we had "receipts" for why the are not deterministic (some sort of comment saying something like // Non-deterministic because of use of atomicAdd below
). If these ever go out of date, the receipt would help a future developer understand when the warning could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll add those.
Just a thought--would it be useful to print out this information as part of the alert? I could potentially add a reason
argument to alertNotDeterministic
. Perhaps this information wouldn't be useful for the user though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
366783b
to
b47ed4d
Compare
FYI, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This PR breaks
|
Looks like it's more than just Windows. I'm guessing something related to |
…cAdd()` (#41538) Summary: Reland PR #40056 A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it. Pull Request resolved: #41538 Reviewed By: zou3519 Differential Revision: D22608376 Pulled By: ezyang fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82
Issue #15359