Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various functions exhibit nondeterministic behavior #82004

Closed
ttanpcs opened this issue Jul 22, 2022 · 12 comments
Closed

Various functions exhibit nondeterministic behavior #82004

ttanpcs opened this issue Jul 22, 2022 · 12 comments
Assignees
Labels
high priority module: determinism triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ttanpcs
Copy link
Contributor

ttanpcs commented Jul 22, 2022

🐛 Describe the bug

When testing the nondeterminism_seeded and bitwise tags on operators, the following operators fail:
Nondeterministic_seeded:

  • aten.empty.memory_format
  • aten.empty_like.default
  • aten.new_empty.default
  • aten._ctc_loss.default
  • aten.resize_.default
  • prims.uniform.default
  • aten.empty_strided.default
  • aten.empty.SymInt

Nondeterministic_bitwise:
-aten.linalg_lstsq.default

(I've bolded the ops that are interesting). The 'empty' ops fail due to their use of uninitialized data.

The following lines of code were added to test_ops.py and test_tags was run.

def test_nondeterministic_seeded(func, args, kwargs):
    if func is None or torch.Tag.inplace_view in func.tags:
        return
    results = []
    for i in range(2):
        results.append(func(*args, **kwargs))
    try:
        TestCase.assertEqual(TestCase(), results[0], results[1])
    except AssertionError:
        assert torch.Tag.nondeterministic_seeded in func.tags, f'{func} should be nondeterministic_seeded'
    try:
        TestCase.assertEqual(TestCase(), results[0], results[1], atol=0, rtol=0)
    except AssertionError:
        has_nondeterminism_tag = torch.Tag.nondeterministic_bitwise in func.tags or torch.Tag.nondeterministic_seeded in func.tags
        assert has_nondeterminism_tag, f'{func} should be nondeterministic_bitwise'

...

class TestTagsMode(TorchDispatchMode):
    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
        test_nondeterministic_seeded(func, args, kwargs)

...

Versions

N/A

cc @ezyang @gchanan @zou3519 @mruberry @kurtamohler

@ezyang ezyang added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: determinism labels Jul 24, 2022
@ezyang
Copy link
Contributor

ezyang commented Jul 24, 2022

good call. When deterministic mode is on we should fill empty with deterministic garbage to make it deterministic

@lezcano
Copy link
Collaborator

lezcano commented Jul 25, 2022

Mandatory meme:
image

@lezcano
Copy link
Collaborator

lezcano commented Jul 25, 2022

On a more serious note, lstsq is known to have a non-deterministic LAPACK implementation. There should be a few open issues where we have discussed this. There's not much that we can do on that front other than throwing an error.

@albanD
Copy link
Collaborator

albanD commented May 15, 2023

From offline discussion, we should fill empty* functions values with NaN in deterministic mode to make sure we don't "hide" any bug by filling it with ok-looking values.

@kurtamohler
Copy link
Collaborator

I can make that change

@kurtamohler
Copy link
Collaborator

And I assume for integer types we should just fill with zeros?

@lezcano
Copy link
Collaborator

lezcano commented May 15, 2023

more like MAX_INT, no?

@albanD
Copy link
Collaborator

albanD commented May 15, 2023

I think some not-ok looking value would be best yes.

@kurtamohler
Copy link
Collaborator

kurtamohler commented May 17, 2023

I'm having trouble finding where at::empty_symint is implemented. git grep -w empty_symint only shows call sites. I see that it shows up in build/aten/src/ATen/ops/empty.h , but I'm not sure what build/aten/src/ATen/ops is or what generates it. Where can I find it's implementation so I can add the deterministic fill to it?

EDIT: Nevermind, I used gdb to find out that at::empty_symint just ends up calling at::native::empty_cpu

@albanD
Copy link
Collaborator

albanD commented May 22, 2023

at::empty_symint is indeed codegened like all other aten native functions. You can find them at build/aten/src/ATen/ops/{op_name}.h for your local build (you can see there the magic about the regular and regular_symint version of these functions!). The impl for the ::call is in the Operators_{i}.cpp files next to this. You can see it calls the op via the dispatcher.
So in your case, I guess you passed CPU Tensors and so called into the native cpu impl.

@kurtamohler
Copy link
Collaborator

kurtamohler commented Jun 22, 2023

After #101849, the remaining functions from the issue description still need to be addressed:

  • aten._ctc_loss.default
  • aten.resize_.default
  • prims.uniform.default
  • aten.linalg_lstsq.default

For resize_, I imagine we'd want to do almost the same thing that empty does, but we'd only want to fill new elements with NaN/MAX_INT, keeping old elements untouched. I will submit a PR for this. (EDIT: fixed in #104300)

For linalg.lstsq, discussion is tracked in #71222.

For ctc_loss, discussion is tracked in #17798.

If I understand correctly, prims.uniform just calls into Tensor.uniform_. I would like to see a reproducer of the nondeterministic behavior for this one. It seems that it should be deterministic if the random seed is always initialized to the same value, but perhaps that's not true for some reason

pytorchmergebot pushed a commit that referenced this issue Jul 7, 2023
New elements added to a tensor by `torch.Tensor.resize_` are set to NaN/MAX_INT when deterministic mode is turned on.

When `torch.Tensor.resize_` is called on a quantized tensor and deterministic mode is turned on, a nondeterministic error is raised.

Part of #82004

Pull Request resolved: #104300
Approved by: https://github.com/albanD
pytorchmergebot pushed a commit that referenced this issue Jul 13, 2023
…int (#104995)

Relands #101849 after #104302 reverted it.

torchrec PR pytorch/torchrec#1269 fixes the torchrec failure that caused #101849 to be reverted

Part of #82004

Pull Request resolved: #104995
Approved by: https://github.com/albanD
@kurtamohler
Copy link
Collaborator

The only thing left for this issue is the nondeterminism of Tensor.uniform_, which I haven't been able to reproduce. I tried running the modification to test_ops.py that was mentioned in the issue summary, and test_tags_uniform_cpu_float32 passes, which is the only test_tags test that seems to be related to uniform. So I think it's probably safe to close this issue now. If Tensor.uniform_ shows any nondeterminism in the future either this issue can be reopened or a new issue can be created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: determinism triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants