-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nondeterministic error for scatter
#88244
Add nondeterministic error for scatter
#88244
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88244
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c90f382: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -500,6 +500,7 @@ def use_deterministic_algorithms(mode, *, warn_only=False): | |||
``mode='max'`` | |||
* :func:`torch.Tensor.put_` when ``accumulate=False`` | |||
* :func:`torch.Tensor.put_` when ``accumulate=True`` and called on a CUDA tensor | |||
* :func:`torch.Tensor.scatter` when ``src`` is a tensor and ``reduce=None`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the nondeterministic behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, this is a good point; the original bug report just reports that cpu/cuda mismatch, not that the devices internally are nondeterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If duplicate indices are given, each with different corresponding values, it's unclear which value should be chosen.
If each device is actually choosing the value deterministically, then I suppose we shouldn't have the nondeterministic alert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense -- but do we know which it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, looking into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out it is nondeterministic at least on CUDA:
import torch
torch.set_default_tensor_type('torch.cuda.FloatTensor')
torch.use_deterministic_algorithms(True)
for n in range(1, 256):
indices = torch.zeros(n).long()
last_result = None
results = []
for _ in range(10):
result = torch.scatter(
torch.ones(n).long(),
0,
indices,
torch.arange(n))[0].item()
results.append(result)
if (torch.tensor(results) != results[0]).any():
print(f'with n = {n}:')
print(results)
Output:
with n = 65:
[64, 16, 16, 16, 16, 16, 16, 16, 16, 16]
with n = 73:
[64, 16, 16, 16, 16, 16, 16, 16, 16, 16]
with n = 97:
[96, 80, 80, 80, 80, 80, 80, 80, 80, 80]
...
with n = 223:
[208, 144, 208, 144, 208, 144, 144, 144, 144, 144]
with n = 226:
[208, 208, 224, 208, 224, 224, 224, 208, 224, 224]
with n = 228:
[208, 224, 224, 208, 208, 208, 224, 224, 224, 224]
This seems to be because TensorIterator
breaks up the operation into multiple parallelized sub-operations if the input size is greater than 64 in this case. If two sub-operations write to the same location, the order isn't guaranteed.
Looking at the code, it seems possible that CPU has this same issue because it seems to use TensorIterator
in a similar way. At least I don't think TensorIterator
on CPU has any measures to enforce a specific order of writes to the same location from different sub-operations, but I could be wrong (I don't know how something like that would be implemented without significant performance costs). However, CPU is apparently much more regular than the CUDA impl, because I'm having trouble exercising the possible nondeterminism for it.
(Actually, I have an idea why the CPU impl is doing this. I'll have to verify it, but I think as the input gets larger, the Nevermind, my understanding of how grain_size
given to TensorIterator::for_each
is being shrunk down to 1, meaning that each individual element gets its own thread. That seems like an accident to me, because the threads end up being executed as quickly as they are created. But I'm not 100% sure of this, I'll double check.TensorIterator::for_each
breaks things up into different threads was wrong.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good enough and we should just mark it nondeterministic just to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Maybe we should elaborate slightly on what is nondeterministic about these operations in the future.
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes pytorch#88096 Pull Request resolved: pytorch#88244 Approved by: https://github.com/ezyang, https://github.com/mruberry
@kurtamohler Unfortunately, we need to revert this PR. It is causing multiple internal test failures like
|
@pytorchbot revert -m "Internal test failures" -c ghfirst |
@pytorchbot successfully started a revert job. Check the current status here. |
@kurtamohler your PR has been successfully reverted. |
This reverts commit e940a2f. Reverted #88244 on behalf of https://github.com/mehtanirav due to Internal test failures
@mehtanirav, can you give more info about the failing tests? Why are they using |
@kurtamohler Let me follow up with the test owners and get back to you. |
Fixes pytorch#88096 Pull Request resolved: pytorch#88244 Approved by: https://github.com/ezyang, https://github.com/mruberry
This reverts commit e940a2f. Reverted pytorch#88244 on behalf of https://github.com/mehtanirav due to Internal test failures
Fixes #88096
cc @mruberry