New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.autograd.backward() fails to sync with other stream #47028
Comments
So the proper fix is just to add the synchronize in the test right? |
I'm fairly sure backward() always syncs with the default stream when it's finished. However, if you run backward in a non-default stream context, I'm not sure if it also syncs with the ambient non-default stream instead of/in addition to the default stream. I'd have to look at engine.cpp again. |
We're still seeing this issue with our ROCm CI. Do we put a sync into the test to work around the issue, or fix the engine? |
I would like to work towards a resolution on this issue. The test continues to be flaky on ROCm, and I'd rather fix it than skip it. Is this where the fix should be? pytorch/torch/csrc/autograd/engine.cpp Lines 502 to 513 in 671ee71
|
Otherwise, this test will appear flaky for ROCm even though it is a generic PyTorch issue.
Summary: Otherwise, this test will appear flaky for ROCm even though it is a generic PyTorch issue. CC albanD Pull Request resolved: #48405 Reviewed By: mrshenli Differential Revision: D25183473 Pulled By: ngimel fbshipit-source-id: 0fa19b5497a713cc6c5d251598e57cc7068604be
Sorry about the delay, this is a pretty tricky issue and I had to spend some time reading code. The relevant comment for existing stream synchronization logic in autograd engine is:
The leaf streams in this situation, however, are the streams associated with the gradient accumulators, which should all be associated with @mcarilli would you agree with this analysis? |
Looking over old issues I forgot I was assigned, I saw this one and realized #57833 probably fixes it. One of the patterns #57833 wants to fix, the snippet under "Because of the inconsistency, in some cases it's hard to be safe:" in the original submission #54227, looks like it matches the case that was breaking in rocm CI (https://github.com/pytorch/pytorch/pull/45787/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R1772-R1773). |
The new test written for #45787 suggests a possible failure scenario, which indeed occurs. It is a race condition, most often encountered by ROCm CI.
https://github.com/pytorch/pytorch/pull/45787/files#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R1772-R1773
Putting torch.cuda.synchronize() right after backward() fixes the problem.
Originally posted by @mcarilli in #45787 (comment)
cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved
The text was updated successfully, but these errors were encountered: