Add non-deterministic alert to CUDA operations that use `atomicAdd()` #40056

kurtamohler · 2020-06-15T22:01:09Z

test/test_torch.py

ngimel · 2020-06-17T22:19:01Z

Turns out cuda has another source of non-determinism. Starting from cuda 10.2, when someone runs matmuls in the different streams the results are not guaranteed to be deterministic. In pytorch, this manifests itself in LSTMs #39849, there could also be user code triggering this behavior, if users are launching work on multiple streams. Here's a link to cudnn documentation https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_8.html#rel-800-Preview__section_qhc_jc1_5kb.
There's a workaround with setting an environment variable to force deterministic behavior.

kurtamohler · 2020-06-18T22:33:50Z

There's a workaround with setting an environment variable to force deterministic behavior.

Thanks for finding this @ngimel. We could set the CUBLAS_WORKSPACE_CONFIG variable with setenv() inside Context::setDeterministic() if cuda >= 10.2. I guess we'd probably want to error out if the user has already set that variable though. I can make another PR to add this feature.

dr-ci · 2020-06-22T18:51:05Z

💊 CI failures summary and remediations

As of commit b47ed4d (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-xenial-rocm3.5.1-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 39 times.

torch/testing/_internal/common_methods_invocations.py

kurtamohler · 2020-06-24T18:02:53Z

torch/testing/_internal/common_nn.py

I didn't see any way to disable CPU for the non-deterministic alert tests for operations in the nn module, so I had to add this.

kurtamohler · 2020-06-26T23:38:25Z

I've added tests for all the operations that are supposed to throw an error, except for grid_sampler_2d_backward_cuda and grid_sampler_3d_backward_cuda. I haven't figured out yet where those get exercised in the tests.

ngimel · 2020-06-26T23:44:02Z

They are most likely in test_nn generated tests.

kurtamohler · 2020-06-29T23:59:50Z

The clang-tidy job is failing. This is the first error I see in the log and I'm not sure what it means: error: cannot find libdevice for sm_20. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice. [clang-diagnostic-error]

kurtamohler · 2020-06-30T16:20:53Z

My guess is that there's an issue with accessing at::globalContext() from a CUDA file,

~~but I'm having trouble debugging because I can't reproduce the issue. When I run clang-tidy, I get this:~~

``` $ python tools/clang_tidy.py --paths torch/csrc/ aten/src/ATen/ --diff HEAD~4 -g-torch/csrc/jit/serialization/export.cpp -g-torch/csrc/jit/serialization/import.cpp -g-torch/csrc/jit/serialization/import_legacy.cpp -g-torch/csrc/onnx/init.cpp '-g-torch/csrc/cuda/nccl.*' -g-torch/csrc/cuda/python_nccl.cpp Skipping /home/kurtamohler/development/pytorch-deterministic-flag-callers/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu. Compile command not found. Skipping /home/kurtamohler/development/pytorch-deterministic-flag-callers/aten/src/ATen/native/cuda/AdaptiveAveragePooling3d.cu. Compile command not found. ```

~~I'm probably missing some dependency or configuration that clang-tidy needs, but I don't know yet.~~

EDIT: Oh nevermind, I had the wrong clang-tidy installed. I installed clang-tidy-8 and I can reproduce now.

kurtamohler · 2020-06-30T16:52:26Z

I don't think the clang-tidy failure is actually caused by my changes. I tried running it on a file that I didn't change, and I get basically the same error as far as I can tell:

$ python tools/clang_tidy.py --paths torch/csrc/ aten/src/ATen/ -g aten/src/ATen/native/cuda/Dropout.cu -e /usr/lib/llvm-8/bin/clang-tidy 
Error while processing /home/kurtamohler/development/pytorch-deterministic-flag-callers/aten/src/ATen/native/cuda/Dropout.cu.
Found compiler error(s).
Traceback (most recent call last):
  File "tools/clang_tidy.py", line 55, in run_shell_command
    output = subprocess.check_output(arguments).decode().strip()
  File "/home/kurtamohler/miniconda3/envs/pytorch-det-flag/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/kurtamohler/miniconda3/envs/pytorch-det-flag/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/lib/llvm-8/bin/clang-tidy', '-p', 'build', '-config', '{"Checks": "-*, bugprone-*, -bugprone-forward-declaration-namespace, -bugprone-macro-parentheses, -bugprone-lambda-function-name, cppcoreguidelines-*, -cppcoreguidelines-interfaces-global-init, -cppcoreguidelines-owning-memory, -cppcoreguidelines-pro-bounds-array-to-pointer-decay, -cppcoreguidelines-pro-bounds-constant-array-index, -cppcoreguidelines-pro-bounds-pointer-arithmetic, -cppcoreguidelines-pro-type-cstyle-cast, -cppcoreguidelines-pro-type-reinterpret-cast, -cppcoreguidelines-pro-type-static-cast-downcast, -cppcoreguidelines-pro-type-union-access, -cppcoreguidelines-pro-type-vararg, -cppcoreguidelines-special-member-functions, hicpp-exception-baseclass, hicpp-avoid-goto, modernize-*, -modernize-return-braced-init-list, -modernize-use-auto, -modernize-use-default-member-init, -modernize-use-using, -modernize-use-trailing-return-type, performance-*, -performance-noexcept-move-constructor, ", "HeaderFilterRegex": "torch/csrc/.*", "AnalyzeTemporaryDtors": false, "CheckOptions": null}', 'aten/src/ATen/native/cuda/Dropout.cu']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/clang_tidy.py", line 306, in <module>
    main()
  File "tools/clang_tidy.py", line 298, in main
    clang_tidy_output = run_clang_tidy(options, line_filters, files)
  File "tools/clang_tidy.py", line 191, in run_clang_tidy
    output = run_shell_command(command)
  File "tools/clang_tidy.py", line 59, in run_shell_command
    raise RuntimeError("Error executing {}: {}".format(" ".join(arguments), error_output))
RuntimeError: Error executing /usr/lib/llvm-8/bin/clang-tidy -p build -config {"Checks": "-*, bugprone-*, -bugprone-forward-declaration-namespace, -bugprone-macro-parentheses, -bugprone-lambda-function-name, cppcoreguidelines-*, -cppcoreguidelines-interfaces-global-init, -cppcoreguidelines-owning-memory, -cppcoreguidelines-pro-bounds-array-to-pointer-decay, -cppcoreguidelines-pro-bounds-constant-array-index, -cppcoreguidelines-pro-bounds-pointer-arithmetic, -cppcoreguidelines-pro-type-cstyle-cast, -cppcoreguidelines-pro-type-reinterpret-cast, -cppcoreguidelines-pro-type-static-cast-downcast, -cppcoreguidelines-pro-type-union-access, -cppcoreguidelines-pro-type-vararg, -cppcoreguidelines-special-member-functions, hicpp-exception-baseclass, hicpp-avoid-goto, modernize-*, -modernize-return-braced-init-list, -modernize-use-auto, -modernize-use-default-member-init, -modernize-use-using, -modernize-use-trailing-return-type, performance-*, -performance-noexcept-move-constructor, ", "HeaderFilterRegex": "torch/csrc/.*", "AnalyzeTemporaryDtors": false, "CheckOptions": null} aten/src/ATen/native/cuda/Dropout.cu: error: cannot find libdevice for sm_20. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice. [clang-diagnostic-error]
...

ngimel · 2020-06-30T17:19:38Z

clang-tidy is disabled again as of #40764, sorry about the trouble.

Also add non-deterministic alert test for embedding_bag

ezyang · 2020-07-10T17:51:03Z

aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu

    const Tensor& gradOutput,
    const Tensor& input)
  {
+    globalContext().alertNotDeterministic("adaptive_avg_pool2d_backward_out_cuda");


Looking at these alerts, it would probably useful if we had "receipts" for why the are not deterministic (some sort of comment saying something like // Non-deterministic because of use of atomicAdd below). If these ever go out of date, the receipt would help a future developer understand when the warning could be removed.

Good idea, I'll add those.

Just a thought--would it be useful to print out this information as part of the alert? I could potentially add a reason argument to alertNotDeterministic. Perhaps this information wouldn't be useful for the user though.

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

kurtamohler · 2020-07-14T22:25:35Z

FYI, the caffe2-pytorch-linux-xenial-rocm3.5.1-py3.6-test CI job passed in the 222758e commit. Then I added some comments in b47ed4d commit and the job failed with Build timed out (after 180 minutes). It doesn't seem possible that just adding comments would cause that failure, so I think it's just an anomaly.

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-07-15T18:18:44Z

@ezyang merged this pull request in 6ff306b.

mrshenli · 2020-07-15T19:06:44Z

This PR breaks pytorch_windows_vs2017_14.13_py36_cuda10.1_test1 test on master, see the following runs. Reverting.

======================================================================
FAIL [0.024s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 674, in efail_fn_no_device
    return efail_fn(slf, None, *args, **kwargs)
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 665, in efail_fn
    slf.fail('expected a non-deterministic error, but it was not raised')
AssertionError: expected a non-deterministic error, but it was not raised

https://app.circleci.com/pipelines/github/pytorch/pytorch/190846/workflows/9a4318d3-e490-4fe4-98ba-954d06b7beb6/jobs/6249277/steps

https://app.circleci.com/pipelines/github/pytorch/pytorch/190849/workflows/d8a49560-b286-4a96-94ab-36ab51da063c/jobs/6249804/steps

kurtamohler · 2020-07-15T19:15:37Z

Looks like it's more than just Windows. I'm guessing something related to interpolate_linear_1d changed since CI was last run for this PR. I'll fix and resubmit.

…cAdd()` (#41538) Summary: Reland PR #40056 A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it. Pull Request resolved: #41538 Reviewed By: zou3519 Differential Revision: D22608376 Pulled By: ezyang fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82

kurtamohler requested review from colesbury, ezyang and t-vi June 15, 2020 22:01

kurtamohler commented Jun 15, 2020

View reviewed changes

test/test_torch.py Outdated Show resolved Hide resolved

kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 2009b1a to 02d2805 Compare June 22, 2020 18:49

kurtamohler commented Jun 22, 2020

View reviewed changes

torch/testing/_internal/common_methods_invocations.py Outdated Show resolved Hide resolved

kurtamohler marked this pull request as draft June 22, 2020 18:58

pytorchbot added the open source label Jun 22, 2020

kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 02d2805 to 10a4a3e Compare June 24, 2020 17:55

t-vi removed their request for review June 24, 2020 17:59

kurtamohler commented Jun 24, 2020

View reviewed changes

kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 10a4a3e to a200688 Compare June 26, 2020 23:12

kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from a200688 to 5a3f615 Compare June 29, 2020 19:18

kurtamohler changed the title ~~[WIP] Add non-deterministic alert to CUDA operations that use atomicAdd()~~ Add non-deterministic alert to CUDA operations that use atomicAdd() Jun 29, 2020

kurtamohler marked this pull request as ready for review June 29, 2020 19:20

kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 5a3f615 to 8f204d8 Compare June 29, 2020 23:43

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 30, 2020

kurtamohler added 4 commits July 9, 2020 16:35

Add non-deterministic alert to callers of atomicAdd

a09bd32

[WIP] Add unit tests for atomicAdd deterministic alerts

7efae9c

Add support for test_nn to expectedAlertNondeterministic decorator

32f1016

Also add non-deterministic alert test for embedding_bag

Add remaining nondeterministic alert tests for atomicAdd

4c58e19

Remove unnecessary object inheritance

222758e

kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 73978a0 to 222758e Compare July 9, 2020 21:36

ezyang reviewed Jul 10, 2020

View reviewed changes

ezyang approved these changes Jul 10, 2020

View reviewed changes

facebook-github-bot reviewed Jul 10, 2020

View reviewed changes

Add comments explaining why these operations are nondeterministic

b47ed4d

kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 366783b to b47ed4d Compare July 10, 2020 18:41

facebook-github-bot reviewed Jul 15, 2020

View reviewed changes

facebook-github-bot closed this in 6ff306b Jul 15, 2020

facebook-github-bot added the merged label Jul 15, 2020

This was referenced Jul 15, 2020

Reland Add non-deterministic alert to CUDA operations that use atomicAdd() #41497

Closed

Reland Add non-deterministic alert to CUDA operations that use atomicAdd() #41538

Closed

mruberry added the Merged label Oct 28, 2020

Add non-deterministic alert to CUDA operations that use atomicAdd() #40056

Add non-deterministic alert to CUDA operations that use atomicAdd() #40056

Uh oh!

Conversation

kurtamohler commented Jun 15, 2020

Uh oh!

Uh oh!

ngimel commented Jun 17, 2020

Uh oh!

kurtamohler commented Jun 18, 2020

Uh oh!

dr-ci bot commented Jun 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

Uh oh!

Uh oh!

kurtamohler Jun 24, 2020

Choose a reason for hiding this comment

Uh oh!

kurtamohler commented Jun 26, 2020

Uh oh!

ngimel commented Jun 26, 2020

Uh oh!

kurtamohler commented Jun 29, 2020

Uh oh!

kurtamohler commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kurtamohler commented Jun 30, 2020

Uh oh!

ngimel commented Jun 30, 2020

Uh oh!

ezyang Jul 10, 2020

Choose a reason for hiding this comment

Uh oh!

kurtamohler Jul 10, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

kurtamohler commented Jul 14, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 15, 2020

Uh oh!

mrshenli commented Jul 15, 2020

Uh oh!

kurtamohler commented Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Add non-deterministic alert to CUDA operations that use `atomicAdd()` #40056

Add non-deterministic alert to CUDA operations that use `atomicAdd()` #40056

dr-ci bot commented Jun 22, 2020 •

edited

Loading

kurtamohler commented Jun 30, 2020 •

edited

Loading

kurtamohler commented Jul 15, 2020 •

edited

Loading