Throw error if `torch.set_deterministic(True)` is called with nondeterministic CuBLAS config #41377

kurtamohler · 2020-07-14T00:04:54Z

For CUDA >= 10.2, the CUBLAS_WORKSPACE_CONFIG environment variable must be set to either :4096:8 or :16:8 to ensure deterministic CUDA stream usage. This PR adds some logic inside torch.set_deterministic() to raise an error if this environment variable is not set properly and CUDA >= 10.2.

Issue #15359

dr-ci · 2020-07-14T00:06:28Z

💊 CI failures summary and remediations

As of commit 3f4c77c (more details on the Dr. CI page):

1/4 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)
3/4 broken upstream at merge base 1b18adb on Aug 03 from 6:49pm to 11:15pm PDT (11 commits; fb56299 - ae67f4c)

🚧 3 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test on Aug 03 from 6:49pm to 11:15pm PDT (11 commits; fb56299 - ae67f4c)
- 🔁 rerun
pytorch_linux_xenial_py3_clang5_asan_test2 on Aug 03 from 4:28pm to 9:03pm PDT (11 commits; d3acfe3 - 0cb86af)
- 🔁 rerun
caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04_test on Aug 03 from 6:49pm to 11:15pm PDT (11 commits; fb56299 - ae67f4c)
- 🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-xenial-rocm3.5.1-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 103 times.

kurtamohler · 2020-07-15T22:05:36Z

The reason that pytorch-linux-xenial-rocm3.5.1-py3.6 is failing is because I had incorrectly assumed that if CUDAHooks::hasCUDA() returns true, then the CUDART_VERSION preprocessor variable would always exist.

I was only calling CUDAHooks::versionCUDART() if hasCUDA() was true. But hasCUDA() actually checks if any CUDA devices are available. So even if the CUDA runtime does not exist, like when we're using ROCm, then hasCUDA() can still return true.

So I just need to create a function CUDAHooks::hasCUDART() to call before calling versionCUDART().

aten/src/ATen/Context.cpp

kurtamohler · 2020-07-21T19:34:43Z

aten/src/ATen/native/cuda/LinearAlgebra.cu

  result.resize_({ self.size(0), mat2.size(1) });
  return addmm_out_cuda_impl(result, result, self, mat2, 0, 1);
 }

 Tensor mm_cuda(const Tensor& self, const Tensor& mat2) {
+  globalContext().alertCuBLASConfigNotDeterministic();


Initially, I only added alerts in the internal CuBLAS wrapper functions, in CUDABlas.cpp and THCBlas.cu. But then when I created and ran tests for a handful of the torch operations that use these functions (like torch.mm, torch.dot, etc), I was getting some CUDA memory access errors when I ran the tests back to back.

The problem seemed to be that the error was being thrown halfway through some operations and memory could sometimes be left in an unsafe state. So I had to call the alert function here instead, before the operations have a chance to do anything to the memory.

I think we should keep the alerts in the CuBLAS wrappers though, so that we automatically have error coverage over every operation that calls them.

But I wonder if I should continue to add alerts and tests for each existing torch operation that calls the CuBLAS wrappers. I feel like that might be overkill, but not sure.

Oof, that's not very nice. In principle it should be safe to unwind the stack at any given point if we are using proper C++ destructors, I wonder if something is still using legacy behavior. It's possible this is related specifically to the th_ wrappers. This is probably expedient for now.

I'll pin down exactly what the issue was.

It was simpler than I assumed. at::native::dot_cuda() changes the CuBLAS pointer mode with cublasSetPointerMode() before calling at::cuda::blas::dot(). It restores the previous setting after the call, but only if the call doesn't throw an error. So I had to just put the at::cuda::blas::dot() call in a try-catch. If there's an error, it will now restore the pointer mode and rethrow.

It's even simpler than that, dot is always deterministic, regardless of the workplace setting.

"gemv and gemm" meaning only the non-batched versions? Or are the batched ones affected too?

Only non-batched versions, batched are not affected. What if batch is 1, you might ask? Well, I don't know, I'm just relaying the message. @ptrblck ?

Batched versions of gemv and gemm with a batch size of 1 can be non-deterministic, so we would need to disallow them.

Great, thanks for letting me know! So I'll remove the alerts for everything but gemv and gemm, batched and unbatched. I think there are a couple questions to answer about how to handle the batch size 1 case though. It seems like these are our options:

Alert batched gemv and gemm always

Alert batched gemv and gemm only if batch size is 1
a. Add a message to the alert explaining that it's nondeterministic because the batch size is 1
b. Don't add the message

I think 2.a. might be the best, but I'm not sure.

I removed the errors for ger and dot, so now only mm, mv, and batched multiplies throw errors. This is option 1 from my above comment. Please let me know if it would be better to actually check the batch size and provide more detail in the error message.

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

test/cpp/api/misc.cpp

aten/src/ATen/Context.cpp

ezyang · 2020-07-27T19:07:00Z

It will be good to also get a look from @ngimel

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

aten/src/ATen/Context.cpp

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2020-08-04T03:33:24Z

Nevertheless, it feels wrong to introduce a unit test that needs to be run first in order to work. So for now I'll just disable it on Windows and make an issue summarizing the problem.

Thank you for the sleuthing, this seems like a good outcome to me. Maybe we should file an nvbug, not sure.

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

kurtamohler · 2020-08-04T15:45:53Z

Looks like pr/caffe2-pytorch-linux-xenial-rocm3.5.1-py3.6-test timed out, so it's probably unrelated.

aten/src/ATen/native/cuda/LinearAlgebra.cu

ezyang · 2020-08-05T14:10:19Z

test/test_torch.py

+                    if os.environ.get(cublas_var_name) is not None:
+                        del os.environ[cublas_var_name]
+                else:
+                    os.environ[cublas_var_name] = config


This is not the idiomatic way to set environment variables in a subprocess, as manual edits to os.environ here are process global and will affect all other tests in the process. Instead, you can simply pass the desired environment variables of the subprocess call directly to subprocess itself.

ezyang · 2020-08-05T14:10:47Z

test/test_torch.py

+        def test_case_info():
+            return 'function "%s", config "%s"' % (fn_name, '' if config is None else config)
+
+        # Wait for each process to finish and check for correct error behavior


Do you... actually want to run each process in parallel? Running them serially seems a lot safer.

ezyang · 2020-08-05T14:13:42Z

test/test_torch.py

+                processes.append((p, fn_name, config, should_throw_error))
+
+        def test_case_info():
+            return 'function "%s", config "%s"' % (fn_name, '' if config is None else config)


nit: use more modern formatting here, function "{}".format(fn_name) or even better f-strings f'function "{fn_name}"'

ezyang · 2020-08-05T14:15:48Z

test/test_torch.py

+                # It would have been preferable to use the `multiprocessing` module to avoid having
+                # to execute code from a string, but that caused issues in Windows
+                # https://github.com/pytorch/pytorch/pull/41377#issuecomment-666641223
+                p = subprocess.Popen(


I was expecting to see check_output here, which would have reduced a lot of the Popen boilerplate here. Here is one way you could make this happen: catch the error throw in the subprocess itself, and then convert that into a success condition (and raise errors otherwise). Then, you only need to test for the exit code in the parent process.

ezyang

Approving to move things along

facebook-github-bot · 2020-08-06T00:18:40Z

@ezyang merged this pull request in df7c059.

Summary: Adds an RAII guard for `cublasSetPointerMode()`. Updates `dot_cuda` to use the guard, rather than exception catching. Addresses this comment: #41377 (comment) Pull Request resolved: #42639 Reviewed By: malfet Differential Revision: D22969985 Pulled By: ezyang fbshipit-source-id: b05c35d1884bb890f8767d6a4ef8b4724a329471

…st (#42627) Summary: Addresses some comments that were left unaddressed after PR #41377 was merged: * Use `check_output` instead of `Popen` to run each subprocess sequentially * Use f-strings rather than old python format string style * Provide environment variables to subprocess through the `env` kwarg * Check for correct error behavior inside the subprocess, and raise another error if incorrect. Then the main process fails the test if any error is raised Pull Request resolved: #42627 Reviewed By: malfet Differential Revision: D22969231 Pulled By: ezyang fbshipit-source-id: 38d5f3f0d641c1590a93541a5e14d90c2e20acec

Summary: Follow up to #41377 to update the error message to match the removed arguments Pull Request resolved: #46397 Reviewed By: malfet Differential Revision: D24336009 Pulled By: albanD fbshipit-source-id: b9bf2f9ef7fd2ae622c4079384afc93e9c473f47

Summary: Follow up to pytorch#41377 to update the error message to match the removed arguments Pull Request resolved: pytorch#46397 Reviewed By: malfet Differential Revision: D24336009 Pulled By: albanD fbshipit-source-id: b9bf2f9ef7fd2ae622c4079384afc93e9c473f47

Summary: Follow up to #41377 to update the error message to match the removed arguments Pull Request resolved: #46397 Reviewed By: malfet Differential Revision: D24336009 Pulled By: albanD fbshipit-source-id: b9bf2f9ef7fd2ae622c4079384afc93e9c473f47

kurtamohler requested a review from ngimel July 14, 2020 00:04

pytorchbot added the open source label Jul 14, 2020

kurtamohler force-pushed the nondeterministic-cuda-stream-error-15359 branch from 740dcac to 6f22888 Compare July 14, 2020 00:23

kurtamohler requested review from ebetica, goldsborough and yf225 as code owners July 14, 2020 16:08

kurtamohler force-pushed the nondeterministic-cuda-stream-error-15359 branch from 99a3dff to ad08b74 Compare July 14, 2020 22:18

kurtamohler force-pushed the nondeterministic-cuda-stream-error-15359 branch 2 times, most recently from 082d01c to 5a047d0 Compare July 15, 2020 23:57

ezyang self-requested a review July 16, 2020 02:15

ezyang reviewed Jul 16, 2020

View reviewed changes

aten/src/ATen/Context.cpp Show resolved Hide resolved

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 16, 2020

kurtamohler force-pushed the nondeterministic-cuda-stream-error-15359 branch from 5a047d0 to 1ff8d99 Compare July 21, 2020 19:08

kurtamohler commented Jul 21, 2020

View reviewed changes

kurtamohler force-pushed the nondeterministic-cuda-stream-error-15359 branch 2 times, most recently from 65dd62a to 2cc3b42 Compare July 21, 2020 20:44

mruberry mentioned this pull request Jul 22, 2020

Update determinism documentation #41692

Closed

facebook-github-bot reviewed Jul 27, 2020

View reviewed changes

ezyang reviewed Jul 27, 2020

View reviewed changes

test/cpp/api/misc.cpp Outdated Show resolved Hide resolved

ezyang reviewed Jul 27, 2020

View reviewed changes

test/cpp/api/misc.cpp Outdated Show resolved Hide resolved

ezyang reviewed Jul 27, 2020

View reviewed changes

aten/src/ATen/Context.cpp Outdated Show resolved Hide resolved

kurtamohler force-pushed the nondeterministic-cuda-stream-error-15359 branch 2 times, most recently from ea76ce6 to f2211c9 Compare July 28, 2020 22:23

facebook-github-bot reviewed Jul 29, 2020

View reviewed changes

ezyang reviewed Jul 29, 2020

View reviewed changes

aten/src/ATen/Context.cpp Outdated Show resolved Hide resolved

facebook-github-bot reviewed Jul 30, 2020

View reviewed changes

kurtamohler force-pushed the nondeterministic-cuda-stream-error-15359 branch from f29abaa to a731bc9 Compare July 30, 2020 16:11

kurtamohler mentioned this pull request Aug 4, 2020

CUDA out of memory in subprocesses spawned by unit tests in Windows #42501

Closed

kurtamohler added 3 commits August 3, 2020 21:00

Disable unit test for Windows

8834258

Merge branch 'master' into nondeterministic-cuda-stream-error-15359

2e2dfeb

Fix flake8 error

3f4c77c

facebook-github-bot reviewed Aug 4, 2020

View reviewed changes

ezyang reviewed Aug 5, 2020

View reviewed changes

aten/src/ATen/native/cuda/LinearAlgebra.cu Show resolved Hide resolved

ezyang reviewed Aug 5, 2020

View reviewed changes

ezyang approved these changes Aug 5, 2020

View reviewed changes

facebook-github-bot closed this in df7c059 Aug 5, 2020

This was referenced Aug 5, 2020

Fix coding style and safety issues in CuBLAS nondeterministic unit test #42627

Closed

Create CuBLAS PointerModeGuard #42639

Closed

facebook-github-bot added the merged label Aug 6, 2020

zasdfgbnm mentioned this pull request Aug 6, 2020

Port bmm and baddbmm from TH to ATen #42553

Closed

albanD mentioned this pull request Oct 15, 2020

Fix error message for scatter reduction #46397

Closed

mruberry added the Merged label Oct 28, 2020

Throw error if torch.set_deterministic(True) is called with nondeterministic CuBLAS config #41377

Throw error if torch.set_deterministic(True) is called with nondeterministic CuBLAS config #41377

Uh oh!

Conversation

kurtamohler commented Jul 14, 2020

Uh oh!

dr-ci bot commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🚧 3 fixed upstream failures:

ci.pytorch.org: 1 failed

Uh oh!

kurtamohler commented Jul 15, 2020

Uh oh!

Uh oh!

kurtamohler Jul 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ezyang commented Jul 27, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Aug 4, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

kurtamohler commented Aug 4, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 6, 2020

Uh oh!

Uh oh!

Throw error if `torch.set_deterministic(True)` is called with nondeterministic CuBLAS config #41377

Throw error if `torch.set_deterministic(True)` is called with nondeterministic CuBLAS config #41377

dr-ci bot commented Jul 14, 2020 •

edited

Loading

kurtamohler Jul 21, 2020 •

edited

Loading