Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nondeterministic error for scatter #88244

Closed

Conversation

kurtamohler
Copy link
Collaborator

@kurtamohler kurtamohler commented Nov 1, 2022

Fixes #88096

cc @mruberry

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 1, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88244

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c90f382:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@@ -500,6 +500,7 @@ def use_deterministic_algorithms(mode, *, warn_only=False):
``mode='max'``
* :func:`torch.Tensor.put_` when ``accumulate=False``
* :func:`torch.Tensor.put_` when ``accumulate=True`` and called on a CUDA tensor
* :func:`torch.Tensor.scatter` when ``src`` is a tensor and ``reduce=None``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the nondeterministic behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, this is a good point; the original bug report just reports that cpu/cuda mismatch, not that the devices internally are nondeterministic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If duplicate indices are given, each with different corresponding values, it's unclear which value should be chosen.

If each device is actually choosing the value deterministically, then I suppose we shouldn't have the nondeterministic alert

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense -- but do we know which it is?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, looking into it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it is nondeterministic at least on CUDA:

import torch
torch.set_default_tensor_type('torch.cuda.FloatTensor')
torch.use_deterministic_algorithms(True)

for n in range(1, 256):
    indices = torch.zeros(n).long()
    last_result = None
    results = []

    for _ in range(10):
        result = torch.scatter(
            torch.ones(n).long(),
            0,
            indices,
            torch.arange(n))[0].item()

        results.append(result)

    if (torch.tensor(results) != results[0]).any():
        print(f'with n = {n}:')
        print(results)

Output:

with n = 65:
[64, 16, 16, 16, 16, 16, 16, 16, 16, 16]
with n = 73:
[64, 16, 16, 16, 16, 16, 16, 16, 16, 16]
with n = 97:
[96, 80, 80, 80, 80, 80, 80, 80, 80, 80]
...
with n = 223:
[208, 144, 208, 144, 208, 144, 144, 144, 144, 144]
with n = 226:
[208, 208, 224, 208, 224, 224, 224, 208, 224, 224]
with n = 228:
[208, 224, 224, 208, 208, 208, 224, 224, 224, 224]

This seems to be because TensorIterator breaks up the operation into multiple parallelized sub-operations if the input size is greater than 64 in this case. If two sub-operations write to the same location, the order isn't guaranteed.

Looking at the code, it seems possible that CPU has this same issue because it seems to use TensorIterator in a similar way. At least I don't think TensorIterator on CPU has any measures to enforce a specific order of writes to the same location from different sub-operations, but I could be wrong (I don't know how something like that would be implemented without significant performance costs). However, CPU is apparently much more regular than the CUDA impl, because I'm having trouble exercising the possible nondeterminism for it.

(Actually, I have an idea why the CPU impl is doing this. I'll have to verify it, but I think as the input gets larger, the grain_size given to TensorIterator::for_each is being shrunk down to 1, meaning that each individual element gets its own thread. That seems like an accident to me, because the threads end up being executed as quickly as they are created. But I'm not 100% sure of this, I'll double check. Nevermind, my understanding of how TensorIterator::for_each breaks things up into different threads was wrong.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good enough and we should just mark it nondeterministic just to be safe.

Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Maybe we should elaborate slightly on what is nondeterministic about these operations in the future.

@kurtamohler
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 4, 2022
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Nov 5, 2022
@mehtanirav
Copy link
Contributor

@kurtamohler Unfortunately, we need to revert this PR. It is causing multiple internal test failures like

RuntimeError: scatter with src tensor and reduce=None does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this operation, or you can use the 'warn_only=True' option, if that's acceptable for your application. You can also file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation.

@mehtanirav
Copy link
Contributor

@pytorchbot revert -m "Internal test failures" -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@kurtamohler your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Nov 10, 2022
This reverts commit e940a2f.

Reverted #88244 on behalf of https://github.com/mehtanirav due to Internal test failures
@kurtamohler
Copy link
Collaborator Author

@mehtanirav, can you give more info about the failing tests? Why are they using torch.use_deterministic_algorithms(True), and could they be modified to avoid using scatter if they really need to be deterministic?

@mehtanirav
Copy link
Contributor

@kurtamohler Let me follow up with the test owners and get back to you.

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
This reverts commit e940a2f.

Reverted pytorch#88244 on behalf of https://github.com/mehtanirav due to Internal test failures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add nondeterministic alert to torch.Tensor.scatter()
6 participants