New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm CI is intermittently failing with std::out_of_range #49652
Comments
This issue may be happening on other builds. This just popped up, for example: But the error is different. |
We do have an engineer looking into this. We started seeing this error intermittently perhaps a month back when our CI was on ROCm 3.9. Right before we switched to ROCm 3.10, the error became more frequent. We've had difficulty reproducing it on our local systems. But we will continue our efforts. Good to know it might not be specific to ROCm, but it certainly seems to be more frequent there. |
@mruberry our engineer @jaglinux was able to produce a stack trace.
pytorch/torch/csrc/autograd/engine.cpp Lines 1040 to 1051 in 74dcb6d
Should the if condition be changed to |
I think the invariant is that this should never be called with The relevant call site is
|
I think we have a race in the engine in the case where the task is queued on a worker thread here. And the worker thread start processing it faster than the next line that runs in this function that sets the You can argue that this happens because the call site that @ezyang mention above does not lock the mutex for the |
So therefore, not only a ROCm problem |
Also removing "module: flaky-tests" even though this is causing intermittent test failures. This will let it appear in our triage review meeting. |
…rch#50164) Summary: This solves a race condition where the worker thread might see a partially initialized graph_task Fixes pytorch#49652 I don't know how to reliably trigger the race so I didn't add any test. But the rocm build flakyness (it just happens to race more often on rocm builds) should disappear after this PR. Pull Request resolved: pytorch#50164 Reviewed By: zou3519 Differential Revision: D25824954 Pulled By: albanD fbshipit-source-id: 6a3391753cb2afd2ab415d3fb2071a837cc565bb
See https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm3.10-py3.6-test1/126//console and https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm3.10-py3.6-test1/110//console for examples.
As best I can tell this error always happens around the same place in the test suite (but note that in these examples the actual tests where the error occurs are different).
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @albanD @gqchen @pearu @nikitaved @soulitzer @jeffdaily @sunway513
The text was updated successfully, but these errors were encountered: