Add nondeterministic error for `scatter` #88244

kurtamohler · 2022-11-01T20:28:06Z

Fixes #88096

cc @mruberry

pytorch-bot · 2022-11-01T20:28:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88244

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c90f382:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mruberry · 2022-11-02T14:05:58Z

torch/__init__.py

@@ -500,6 +500,7 @@ def use_deterministic_algorithms(mode, *, warn_only=False):
          ``mode='max'``
        * :func:`torch.Tensor.put_` when ``accumulate=False``
        * :func:`torch.Tensor.put_` when ``accumulate=True`` and called on a CUDA tensor
+        * :func:`torch.Tensor.scatter` when ``src`` is a tensor and ``reduce=None``


What's the nondeterministic behavior?

actually, this is a good point; the original bug report just reports that cpu/cuda mismatch, not that the devices internally are nondeterministic.

If duplicate indices are given, each with different corresponding values, it's unclear which value should be chosen.

If each device is actually choosing the value deterministically, then I suppose we shouldn't have the nondeterministic alert

That makes sense -- but do we know which it is?

I'm not sure, looking into it

Turns out it is nondeterministic at least on CUDA:

import torch torch.set_default_tensor_type('torch.cuda.FloatTensor') torch.use_deterministic_algorithms(True) for n in range(1, 256): indices = torch.zeros(n).long() last_result = None results = [] for _ in range(10): result = torch.scatter( torch.ones(n).long(), 0, indices, torch.arange(n))[0].item() results.append(result) if (torch.tensor(results) != results[0]).any(): print(f'with n = {n}:') print(results)

Output:

with n = 65: [64, 16, 16, 16, 16, 16, 16, 16, 16, 16] with n = 73: [64, 16, 16, 16, 16, 16, 16, 16, 16, 16] with n = 97: [96, 80, 80, 80, 80, 80, 80, 80, 80, 80] ... with n = 223: [208, 144, 208, 144, 208, 144, 144, 144, 144, 144] with n = 226: [208, 208, 224, 208, 224, 224, 224, 208, 224, 224] with n = 228: [208, 224, 224, 208, 208, 208, 224, 224, 224, 224]

This seems to be because TensorIterator breaks up the operation into multiple parallelized sub-operations if the input size is greater than 64 in this case. If two sub-operations write to the same location, the order isn't guaranteed.

Looking at the code, it seems possible that CPU has this same issue because it seems to use TensorIterator in a similar way. At least I don't think TensorIterator on CPU has any measures to enforce a specific order of writes to the same location from different sub-operations, but I could be wrong (I don't know how something like that would be implemented without significant performance costs). However, CPU is apparently much more regular than the CUDA impl, because I'm having trouble exercising the possible nondeterminism for it.

(Actually, I have an idea why the CPU impl is doing this. I'll have to verify it, but I think as the input gets larger, the grain_size given to TensorIterator::for_each is being shrunk down to 1, meaning that each individual element gets its own thread. That seems like an accident to me, because the threads end up being executed as quickly as they are created. But I'm not 100% sure of this, I'll double check. Nevermind, my understanding of how TensorIterator::for_each breaks things up into different threads was wrong.)

I think this is good enough and we should just mark it nondeterministic just to be safe.

mruberry

Cool!

Maybe we should elaborate slightly on what is nondeterministic about these operations in the future.

kurtamohler · 2022-11-04T17:20:32Z

@pytorchbot merge

pytorchmergebot · 2022-11-04T17:22:08Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes pytorch#88096 Pull Request resolved: pytorch#88244 Approved by: https://github.com/ezyang, https://github.com/mruberry

mehtanirav · 2022-11-10T23:54:47Z

@kurtamohler Unfortunately, we need to revert this PR. It is causing multiple internal test failures like

RuntimeError: scatter with src tensor and reduce=None does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

mehtanirav · 2022-11-10T23:55:13Z

@pytorchbot revert -m "Internal test failures" -c ghfirst

pytorchmergebot · 2022-11-10T23:56:44Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2022-11-10T23:56:53Z

@kurtamohler your PR has been successfully reverted.

This reverts commit e940a2f. Reverted #88244 on behalf of https://github.com/mehtanirav due to Internal test failures

kurtamohler · 2022-11-11T00:01:52Z

@mehtanirav, can you give more info about the failing tests? Why are they using torch.use_deterministic_algorithms(True), and could they be modified to avoid using scatter if they really need to be deterministic?

mehtanirav · 2022-11-11T00:40:55Z

@kurtamohler Let me follow up with the test owners and get back to you.

Fixes pytorch#88096 Pull Request resolved: pytorch#88244 Approved by: https://github.com/ezyang, https://github.com/mruberry

This reverts commit e940a2f. Reverted pytorch#88244 on behalf of https://github.com/mehtanirav due to Internal test failures

Add nondeterministic error for scatter

c90f382

kurtamohler added the module: determinism label Nov 1, 2022

kurtamohler requested a review from ezyang November 1, 2022 20:28

pytorchbot added the open source label Nov 1, 2022

ezyang approved these changes Nov 2, 2022

View reviewed changes

mruberry reviewed Nov 2, 2022

View reviewed changes

mruberry approved these changes Nov 4, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 4, 2022

pytorchmergebot added the Merged label Nov 4, 2022

pytorchmergebot closed this in e940a2f Nov 4, 2022

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Nov 5, 2022

Add nondeterministic error for scatter (pytorch#88244)

7b33059

Fixes pytorch#88096 Pull Request resolved: pytorch#88244 Approved by: https://github.com/ezyang, https://github.com/mruberry

pytorchmergebot added the Reverted label Nov 10, 2022

pytorchmergebot added a commit that referenced this pull request Nov 10, 2022

Revert "Add nondeterministic error for scatter (#88244)"

8441443

This reverts commit e940a2f. Reverted #88244 on behalf of https://github.com/mehtanirav due to Internal test failures

kurtamohler mentioned this pull request Nov 14, 2022

Add nondeterministic alert to torch.Tensor.scatter() #88096

Open

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022

Add nondeterministic error for scatter (pytorch#88244)

418963b

Fixes pytorch#88096 Pull Request resolved: pytorch#88244 Approved by: https://github.com/ezyang, https://github.com/mruberry

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022

Revert "Add nondeterministic error for scatter (pytorch#88244)"

afc6660

This reverts commit e940a2f. Reverted pytorch#88244 on behalf of https://github.com/mehtanirav due to Internal test failures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nondeterministic error for `scatter` #88244

Add nondeterministic error for `scatter` #88244

kurtamohler commented Nov 1, 2022 •

edited by pytorch-bot bot

pytorch-bot bot commented Nov 1, 2022 •

edited

mruberry Nov 2, 2022

ezyang Nov 2, 2022

kurtamohler Nov 3, 2022 •

edited

mruberry Nov 3, 2022

kurtamohler Nov 3, 2022

kurtamohler Nov 3, 2022 •

edited

ezyang Nov 4, 2022

mruberry left a comment

kurtamohler commented Nov 4, 2022

pytorchmergebot commented Nov 4, 2022

mehtanirav commented Nov 10, 2022

mehtanirav commented Nov 10, 2022

pytorchmergebot commented Nov 10, 2022

pytorchmergebot commented Nov 10, 2022

kurtamohler commented Nov 11, 2022

mehtanirav commented Nov 11, 2022

Add nondeterministic error for scatter #88244

Add nondeterministic error for scatter #88244

Conversation

kurtamohler commented Nov 1, 2022 • edited by pytorch-bot bot

pytorch-bot bot commented Nov 1, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88244

✅ No Failures

mruberry Nov 2, 2022

Choose a reason for hiding this comment

ezyang Nov 2, 2022

Choose a reason for hiding this comment

kurtamohler Nov 3, 2022 • edited

Choose a reason for hiding this comment

mruberry Nov 3, 2022

Choose a reason for hiding this comment

kurtamohler Nov 3, 2022

Choose a reason for hiding this comment

kurtamohler Nov 3, 2022 • edited

Choose a reason for hiding this comment

ezyang Nov 4, 2022

Choose a reason for hiding this comment

mruberry left a comment

Choose a reason for hiding this comment

kurtamohler commented Nov 4, 2022

pytorchmergebot commented Nov 4, 2022

Merge started

mehtanirav commented Nov 10, 2022

mehtanirav commented Nov 10, 2022

pytorchmergebot commented Nov 10, 2022

pytorchmergebot commented Nov 10, 2022

kurtamohler commented Nov 11, 2022

mehtanirav commented Nov 11, 2022

Add nondeterministic error for `scatter` #88244

Add nondeterministic error for `scatter` #88244

kurtamohler commented Nov 1, 2022 •

edited by pytorch-bot bot

pytorch-bot bot commented Nov 1, 2022 •

edited

kurtamohler Nov 3, 2022 •

edited

kurtamohler Nov 3, 2022 •

edited