Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bits types cannot be used under deterministic mode #109802

Open
ngimel opened this issue Sep 21, 2023 · 18 comments
Open

Bits types cannot be used under deterministic mode #109802

ngimel opened this issue Sep 21, 2023 · 18 comments
Assignees
Labels
module: determinism triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ngimel
Copy link
Collaborator

ngimel commented Sep 21, 2023

#104995 made it impossible to use bits types under deterministic mode:

import torch
torch.set_deterministic_debug_mode("warn")
x=torch.empty(4, dtype=torch.bits8)

produces RuntimeError: "fill_empty_deterministic_" not implemented for 'Bits8'
While I'm sympathetic to the goals of #104995, I think that a blanket approach forcing filling all empty tensors is too strict. In many cases, determinism mode is set in production runs (because if only deterministic ops are used it imposes very little overhead and provides nice guarantees), and with valid code the empty calls are not a problem. The approach taken in #104995 requires paying filling penalty unconditionally, and also requires implementing fill op for all bits types which pytorch currently doesn't do.
Deterministic empty tensors is a valuable option to have, but in my opinion should not be mixed with existing deterministic mode.

cc @mruberry @kurtamohler

@mikaylagawarecki mikaylagawarecki added module: determinism triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 21, 2023
@mikaylagawarecki
Copy link
Contributor

cc @kurtamohler @albanD

@vadimkantorov
Copy link
Contributor

One way could be introducing some special arg torch.empty(..., swear_that_i_fill_it_later_deterministically=True) :)

@kurtamohler kurtamohler self-assigned this Sep 23, 2023
@kurtamohler
Copy link
Collaborator

kurtamohler commented Sep 28, 2023

We could separate the empty fill feature into its own global setting, so that users can keep it disabled to get better performance.

We could add torch.enable_fill_empty(mode: bool) and torch.is_fill_empty_enabled() -> bool.

It seems like this feature could potentially be useful for troubleshooting even if determinism is not enabled. If we do this, we should probably mention it on the Reproducibility page.

However, I wonder whether fill empty mode should be turned on by default when the user turns on deterministic mode, just to be safe

@albanD
Copy link
Collaborator

albanD commented Sep 28, 2023

I wonder whether fill empty mode should be turned on by default when the user turns on deterministic mode,

I'm not sure to follow what you mean? It already is.

We just need to implement fill_empty_deterministic_ for the bits* dtypes. And filling them with 0s sounds like a good default for bits dtypes.

@kurtamohler
Copy link
Collaborator

kurtamohler commented Sep 28, 2023

I wonder whether fill empty mode should be turned on by default when the user turns on deterministic mode,

I'm not sure to follow what you mean? It already is.

Yes, it is currently, but I was talking about what potentially should happen if we decided to separate the fill empty feature into its own global setting.

There are really two separate issues that Natalia brought up:

  • Fill empty does not support bits dtypes
  • Fill empty degrades performance, and it would be nice if the user could turn it off if they know that they are not using empty data as inputs to operations

I was only addressing the second issue in my earlier comment

@ngimel
Copy link
Collaborator Author

ngimel commented Sep 29, 2023

I'm not sure to follow what you mean? It already is.

Agree with @kurtamohler, it already is, but I argue it shouldn't be, because non-deterministic empty is not a problem for valid programs that nonetheless want to use deterministic mode, so there's no point in paying this overhead.

@albanD
Copy link
Collaborator

albanD commented Sep 29, 2023

I would still expect that, at least by default, when in deterministic mode, the function returns always the same thing. This is useful for example for testing :D

What about:

  • We add a new torch.utils.deterministic namespace
  • We add a torch.utils.deterministic.fill_uninitialized_memory that controls this behavior
  • Have fill_uninitialized_memory=True by default

@albanD
Copy link
Collaborator

albanD commented Sep 29, 2023

From offline discussion with Natalia, that sounds good. @kurtamohler will you have a bit of time to add that please?

@kurtamohler
Copy link
Collaborator

Sounds good! Yep I can add that

@kurtamohler
Copy link
Collaborator

I am almost ready to submit a PR that adds torch.utils.deterministic.fill_uninitialized_memory. But I'm trying to decide how to write a test for the fill_uninitialized_memory == False case. For context, here is the test_deterministic_empty test for the existing fill_uninitialized_memory == True case: link

One idea is to just assert that the empty tensor is not full of NaN/max_int when fill_uninitialized_memory == True. For instance, self.assertFalse(res.isnan().all()). But the problem is that there is a possibility that the allocated buffer just happened to be full of NaN/max_int already.

I modified test_deterministic_empty to test the fill_uninitialized_memory == False case using the above assert. If I run it in a loop, the assert fails in one out of every handful of iterations. But, I found that if I zero out all of the tensors that get filled with NaN/max_int during the test before they go out of scope, then the assert does not fail after a large number of iterations that I tested it with. I ran test_torch.py -k fill_uninitialized_memory 140 times in a loop with no failure. I also tried adding a 10,000 iteration loop around all of the code in test_deterministic_empty, and I ran test_torch.py -k fill_uninitialized_memory 10 times without seeing any failures. However, there still is a nonzero probability that any one of these asserts can fail.

If uninitialized memory was completely random, then the probability that an int8 tensor of size 10 fails the assert is (1 / 2**8) ** 10 = 2**(-80), which is very small. But uninitialized memory is not completely random, so the probability of failure is really unknown.

So I don't think that self.assertFalse(res.isnan().all()) is a great way to test this.

I think it would be better to somehow check if at::native::fill_empty_deterministic_ was called. I'm considering adding a toggleable warning to at::native::fill_empty_deterministic_, which is turned off by default, and cannot be toggled in the public API. The tests can turn on that warning and then assert that it was not emitted in the fill_uninitialized_memory == False case, confirming that at::native::fill_empty_deterministic_ is not called. And we can also assert that the warning does get emitted in the fill_uninitialized_memory == True case. Is that a fair solution?

@albanD
Copy link
Collaborator

albanD commented Oct 9, 2023

You're calling at::fill_empty_deterministic_() right? If so, you can use something like the LoggingMode to detect if this function is being called or not.

@kurtamohler
Copy link
Collaborator

What is LoggingMode? I searched the repo for it and didn't get any results. Are you talking about c10::LogAPIUsage?

@albanD
Copy link
Collaborator

albanD commented Oct 9, 2023

Sorry, LoggingTensorMode, here is how it is used:

def test_torch_dispatch_mode_basic(self) -> None:
with capture_logs(is_mode=True) as logs:
with LoggingTensorMode():
torch.empty([])
self.assertExpectedInline('\n'.join(logs), """\
$0: f32[] = torch._ops.aten.empty.memory_format([], device=device(type='cpu'), pin_memory=False)""")

@kurtamohler
Copy link
Collaborator

That seems like a good idea. I'm not 100% sure how to get it working though.

It seems like I'll need to move fill_empty_deterministic_ from at::native to at and add it to native_functions.yaml. That part is simple enough.

However, I'm not sure how to make the at::fill_empty_deterministic_ call show up in the logs captured by LoggingTensorMode. I tried adding a call to an already existing at operator in at::native::empty_cpu, and it doesn't show up in the logs captured by LoggingTensorMode. I just added result = at::resize(result, result.sizes()) to the deterministic case, and I ran this:

import torch

def get_logs(fn):
    from torch.testing._internal.logging_tensor import LoggingTensorMode, capture_logs
    with capture_logs(is_mode=True) as logs:
        with LoggingTensorMode():
            fn()
    return logs

torch.use_deterministic_algorithms(True)
print(get_logs(lambda: torch.empty(10)))

I confirmed in gdb that at::native::_resize does get called when I expected, but the log capture script just prints this:

["$0: f32[10] = torch._ops.aten.empty.memory_format(['10'], device=device(type='cpu'), pin_memory=False)"]

@albanD
Copy link
Collaborator

albanD commented Oct 10, 2023

The mode will only capture calls that go through the dispatcher. Calls into the at::native namespace don't go through the dispatcher and thus will not show up there.
Also thinking more about this, this might not work for this use case as you won't be able to see "nested calls" in the logging mode: if you already see the empty() call, you won't see its constituents :/

@kurtamohler
Copy link
Collaborator

kurtamohler commented Oct 10, 2023

Ah I see, that explains why I didn't see the call to result = at::resize(result, result.sizes()) in the log. So is it alright if I add a togglable warning like I mentioned before?

@albanD
Copy link
Collaborator

albanD commented Oct 10, 2023

The toggable warning sounds a bit over-engineered to check that we don't do something. Especially if we don't do anything, it's hard to have code in there to say we're not doing it.

I'm personally fine with only testing the deterministic path (like we do today) and manually check that when the flag is disabled we have the expected behavior.

@kurtamohler
Copy link
Collaborator

I'm not sure how to add support for bits types to fill_empty_deterministic_. It doesn't seem to be supported in torch.empty even with deterministic mode turned off.

>>> import torch
>>> torch.empty(4, dtype=torch.bits8)
...
RuntimeError: "_local_scalar_dense_cpu" not implemented for 'Bits8'

andreigh pushed a commit to andreigh/pytorch that referenced this issue Oct 26, 2023
kurtamohler added a commit to kurtamohler/pytorch that referenced this issue Nov 1, 2023
xuhancn pushed a commit to xuhancn/pytorch that referenced this issue Nov 7, 2023
Skylion007 pushed a commit to Skylion007/pytorch that referenced this issue Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: determinism triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants