Skip to content

Conversation

kurtamohler
Copy link
Collaborator

Issue #15359

@kurtamohler kurtamohler requested review from colesbury, ezyang and t-vi June 15, 2020 22:01
@ngimel
Copy link
Collaborator

ngimel commented Jun 17, 2020

Turns out cuda has another source of non-determinism. Starting from cuda 10.2, when someone runs matmuls in the different streams the results are not guaranteed to be deterministic. In pytorch, this manifests itself in LSTMs #39849, there could also be user code triggering this behavior, if users are launching work on multiple streams. Here's a link to cudnn documentation https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_8.html#rel-800-Preview__section_qhc_jc1_5kb.
There's a workaround with setting an environment variable to force deterministic behavior.

@kurtamohler
Copy link
Collaborator Author

There's a workaround with setting an environment variable to force deterministic behavior.

Thanks for finding this @ngimel. We could set the CUBLAS_WORKSPACE_CONFIG variable with setenv() inside Context::setDeterministic() if cuda >= 10.2. I guess we'd probably want to error out if the user has already set that variable though. I can make another PR to add this feature.

@kurtamohler kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 2009b1a to 02d2805 Compare June 22, 2020 18:49
@dr-ci
Copy link

dr-ci bot commented Jun 22, 2020

💊 CI failures summary and remediations

As of commit b47ed4d (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 39 times.

@kurtamohler kurtamohler marked this pull request as draft June 22, 2020 18:58
@kurtamohler kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 02d2805 to 10a4a3e Compare June 24, 2020 17:55
@t-vi t-vi removed their request for review June 24, 2020 17:59
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any way to disable CPU for the non-deterministic alert tests for operations in the nn module, so I had to add this.

@kurtamohler kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 10a4a3e to a200688 Compare June 26, 2020 23:12
@kurtamohler
Copy link
Collaborator Author

I've added tests for all the operations that are supposed to throw an error, except for grid_sampler_2d_backward_cuda and grid_sampler_3d_backward_cuda. I haven't figured out yet where those get exercised in the tests.

@ngimel
Copy link
Collaborator

ngimel commented Jun 26, 2020

They are most likely in test_nn generated tests.

@kurtamohler kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from a200688 to 5a3f615 Compare June 29, 2020 19:18
@kurtamohler kurtamohler changed the title [WIP] Add non-deterministic alert to CUDA operations that use atomicAdd() Add non-deterministic alert to CUDA operations that use atomicAdd() Jun 29, 2020
@kurtamohler kurtamohler marked this pull request as ready for review June 29, 2020 19:20
@kurtamohler kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 5a3f615 to 8f204d8 Compare June 29, 2020 23:43
@kurtamohler
Copy link
Collaborator Author

The clang-tidy job is failing. This is the first error I see in the log and I'm not sure what it means: error: cannot find libdevice for sm_20. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice. [clang-diagnostic-error]

@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 30, 2020
@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Jun 30, 2020

My guess is that there's an issue with accessing at::globalContext() from a CUDA file,

but I'm having trouble debugging because I can't reproduce the issue. When I run clang-tidy, I get this:

``` $ python tools/clang_tidy.py --paths torch/csrc/ aten/src/ATen/ --diff HEAD~4 -g-torch/csrc/jit/serialization/export.cpp -g-torch/csrc/jit/serialization/import.cpp -g-torch/csrc/jit/serialization/import_legacy.cpp -g-torch/csrc/onnx/init.cpp '-g-torch/csrc/cuda/nccl.*' -g-torch/csrc/cuda/python_nccl.cpp Skipping /home/kurtamohler/development/pytorch-deterministic-flag-callers/aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu. Compile command not found. Skipping /home/kurtamohler/development/pytorch-deterministic-flag-callers/aten/src/ATen/native/cuda/AdaptiveAveragePooling3d.cu. Compile command not found. ```

I'm probably missing some dependency or configuration that clang-tidy needs, but I don't know yet.

EDIT: Oh nevermind, I had the wrong clang-tidy installed. I installed clang-tidy-8 and I can reproduce now.

@kurtamohler
Copy link
Collaborator Author

I don't think the clang-tidy failure is actually caused by my changes. I tried running it on a file that I didn't change, and I get basically the same error as far as I can tell:

$ python tools/clang_tidy.py --paths torch/csrc/ aten/src/ATen/ -g aten/src/ATen/native/cuda/Dropout.cu -e /usr/lib/llvm-8/bin/clang-tidy 
Error while processing /home/kurtamohler/development/pytorch-deterministic-flag-callers/aten/src/ATen/native/cuda/Dropout.cu.
Found compiler error(s).
Traceback (most recent call last):
  File "tools/clang_tidy.py", line 55, in run_shell_command
    output = subprocess.check_output(arguments).decode().strip()
  File "/home/kurtamohler/miniconda3/envs/pytorch-det-flag/lib/python3.8/subprocess.py", line 411, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/kurtamohler/miniconda3/envs/pytorch-det-flag/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/lib/llvm-8/bin/clang-tidy', '-p', 'build', '-config', '{"Checks": "-*, bugprone-*, -bugprone-forward-declaration-namespace, -bugprone-macro-parentheses, -bugprone-lambda-function-name, cppcoreguidelines-*, -cppcoreguidelines-interfaces-global-init, -cppcoreguidelines-owning-memory, -cppcoreguidelines-pro-bounds-array-to-pointer-decay, -cppcoreguidelines-pro-bounds-constant-array-index, -cppcoreguidelines-pro-bounds-pointer-arithmetic, -cppcoreguidelines-pro-type-cstyle-cast, -cppcoreguidelines-pro-type-reinterpret-cast, -cppcoreguidelines-pro-type-static-cast-downcast, -cppcoreguidelines-pro-type-union-access, -cppcoreguidelines-pro-type-vararg, -cppcoreguidelines-special-member-functions, hicpp-exception-baseclass, hicpp-avoid-goto, modernize-*, -modernize-return-braced-init-list, -modernize-use-auto, -modernize-use-default-member-init, -modernize-use-using, -modernize-use-trailing-return-type, performance-*, -performance-noexcept-move-constructor, ", "HeaderFilterRegex": "torch/csrc/.*", "AnalyzeTemporaryDtors": false, "CheckOptions": null}', 'aten/src/ATen/native/cuda/Dropout.cu']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/clang_tidy.py", line 306, in <module>
    main()
  File "tools/clang_tidy.py", line 298, in main
    clang_tidy_output = run_clang_tidy(options, line_filters, files)
  File "tools/clang_tidy.py", line 191, in run_clang_tidy
    output = run_shell_command(command)
  File "tools/clang_tidy.py", line 59, in run_shell_command
    raise RuntimeError("Error executing {}: {}".format(" ".join(arguments), error_output))
RuntimeError: Error executing /usr/lib/llvm-8/bin/clang-tidy -p build -config {"Checks": "-*, bugprone-*, -bugprone-forward-declaration-namespace, -bugprone-macro-parentheses, -bugprone-lambda-function-name, cppcoreguidelines-*, -cppcoreguidelines-interfaces-global-init, -cppcoreguidelines-owning-memory, -cppcoreguidelines-pro-bounds-array-to-pointer-decay, -cppcoreguidelines-pro-bounds-constant-array-index, -cppcoreguidelines-pro-bounds-pointer-arithmetic, -cppcoreguidelines-pro-type-cstyle-cast, -cppcoreguidelines-pro-type-reinterpret-cast, -cppcoreguidelines-pro-type-static-cast-downcast, -cppcoreguidelines-pro-type-union-access, -cppcoreguidelines-pro-type-vararg, -cppcoreguidelines-special-member-functions, hicpp-exception-baseclass, hicpp-avoid-goto, modernize-*, -modernize-return-braced-init-list, -modernize-use-auto, -modernize-use-default-member-init, -modernize-use-using, -modernize-use-trailing-return-type, performance-*, -performance-noexcept-move-constructor, ", "HeaderFilterRegex": "torch/csrc/.*", "AnalyzeTemporaryDtors": false, "CheckOptions": null} aten/src/ATen/native/cuda/Dropout.cu: error: cannot find libdevice for sm_20. Provide path to different CUDA installation via --cuda-path, or pass -nocudalib to build without linking with libdevice. [clang-diagnostic-error]
...

@ngimel
Copy link
Collaborator

ngimel commented Jun 30, 2020

clang-tidy is disabled again as of #40764, sorry about the trouble.

@kurtamohler kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 73978a0 to 222758e Compare July 9, 2020 21:36
const Tensor& gradOutput,
const Tensor& input)
{
globalContext().alertNotDeterministic("adaptive_avg_pool2d_backward_out_cuda");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at these alerts, it would probably useful if we had "receipts" for why the are not deterministic (some sort of comment saying something like // Non-deterministic because of use of atomicAdd below). If these ever go out of date, the receipt would help a future developer understand when the warning could be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll add those.

Just a thought--would it be useful to print out this information as part of the alert? I could potentially add a reason argument to alertNotDeterministic. Perhaps this information wouldn't be useful for the user though.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@kurtamohler kurtamohler force-pushed the deterministic-flag-atomicadd-15359 branch from 366783b to b47ed4d Compare July 10, 2020 18:41
@kurtamohler
Copy link
Collaborator Author

FYI, the caffe2-pytorch-linux-xenial-rocm3.5.1-py3.6-test CI job passed in the 222758e commit. Then I added some comments in b47ed4d commit and the job failed with Build timed out (after 180 minutes). It doesn't seem possible that just adding comments would cause that failure, so I think it's just an anomaly.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 6ff306b.

@mrshenli
Copy link
Contributor

This PR breaks pytorch_windows_vs2017_14.13_py36_cuda10.1_test1 test on master, see the following runs. Reverting.

======================================================================
FAIL [0.024s]: test_interpolate_linear_1d_alert_nondeterministic_cuda (__main__.TestNN)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 674, in efail_fn_no_device
    return efail_fn(slf, None, *args, **kwargs)
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_device_type.py", line 665, in efail_fn
    slf.fail('expected a non-deterministic error, but it was not raised')
AssertionError: expected a non-deterministic error, but it was not raised

https://app.circleci.com/pipelines/github/pytorch/pytorch/190846/workflows/9a4318d3-e490-4fe4-98ba-954d06b7beb6/jobs/6249277/steps

https://app.circleci.com/pipelines/github/pytorch/pytorch/190849/workflows/d8a49560-b286-4a96-94ab-36ab51da063c/jobs/6249804/steps

@kurtamohler
Copy link
Collaborator Author

kurtamohler commented Jul 15, 2020

Looks like it's more than just Windows. I'm guessing something related to interpolate_linear_1d changed since CI was last run for this PR. I'll fix and resubmit.

facebook-github-bot pushed a commit that referenced this pull request Jul 22, 2020
…cAdd()` (#41538)

Summary:
Reland PR #40056

A new overload of upsample_linear1d_backward_cuda was added in a recent commit, so I had to add the nondeterministic alert to it.

Pull Request resolved: #41538

Reviewed By: zou3519

Differential Revision: D22608376

Pulled By: ezyang

fbshipit-source-id: 54a2aa127e069197471f1feede6ad8f8dc6a2f82
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants