Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FakeTensorProp assert consistency of sizes when metadata previously existed #124059

Closed
wants to merge 7 commits into from

Conversation

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Apr 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124059

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c0977e4 with merge base 7cd7a7a (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: fx release notes category label Apr 15, 2024
ezyang added a commit that referenced this pull request Apr 15, 2024
…xisted

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

ghstack-source-id: 5ab348d66aecb64337894e22fe2b3891e1d9ced5
Pull Request resolved: #124059
[ghstack-poisoned]
ezyang added a commit that referenced this pull request Apr 15, 2024
…xisted

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

ghstack-source-id: 5085f0d4228808190d30adb47dcdf7613374171b
Pull Request resolved: #124059
@ezyang
Copy link
Contributor Author

ezyang commented Apr 15, 2024

The test failure here is pretty interesting. The test we're failing on is

        def func2(x, y):
            x.data = y
            y.data = torch.zeros([0])
            return x

where x and y are f32[6] tensors.

The failure is:

  File "/data/users/ezyang/a/pytorch/torch/utils/_pytree.py", line 781, in unflatten                                                                    
    leaves = list(leaves)                                                   
  File "/data/users/ezyang/a/pytorch/torch/fx/passes/fake_tensor_prop.py", line 49, in check_consistent                                                 
    torch._check(i == j)                                                                                                                                
  File "/data/users/ezyang/a/pytorch/torch/__init__.py", line 1148, in _check                                                                           
    _check_with(RuntimeError, cond, message)                                
  File "/data/users/ezyang/a/pytorch/torch/__init__.py", line 1131, in _check_with                                                                      
    raise error_type(message_evaluated)                                                                                                                 
torch._dynamo.exc.BackendCompilerFailed: backend='<torch._dynamo.testing.CompileCounterWithBackend object at 0x7f2cd3b49b10>' raised:                   
RuntimeError: Expected cond to be True, but got False. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)  
                                                                                                                                                        
While executing %primals_3 : [num_users=2] = placeholder[target=primals_3]                                                                              
Original traceback:                                                                                                                                     
None    

I'm not exactly sure what happened, but because there is metadata mutation the before and after sizes are inconsistent. It's extra annoying, because while I expected this might happen on the Dynamo to AOTAutograd boundary, this is happening in Inductor boundary.

if 'val' in n.meta:
meta_arg = [n.meta['val']]

meta = tree_map(check_consistent, result, *meta_arg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about the rests argument to tree_map

if isinstance(new, torch.Tensor):
if old is not nil:
assert isinstance(old, torch.Tensor)
torch._check(old.dim() == new.dim())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a torch._check and not a plain assert? In my head, it would be a bug in the compiler somewhere if we were re-running fake tensor prop on our graph and one of the nodes' meta tensor values had a different number of dims.

(So far I've pictured torch._check mainly as an API for simplying dynamic shape equalities, but do we also use it in other places instead of plain assertions?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually changing into rename_unbacked_to later. And yes, it is more assert-y than check-y, but I use check here because I need the other behavior it has (it's irrefutable, if there are unbacked symints, put it into deferred runtime asserts).

else:
# TODO: How is it possible that we get a non fake tensor? We
# should be running under the mode...
return snapshot_fake(self._mode.from_tensor(new, static_shapes=True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 this does feel sketchy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh just realized this is code motion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preexisting problem!

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Apr 15, 2024
…xisted

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

ghstack-source-id: 433f8b8ea0d1d3ffcba6a2534ebf08c0f371ba63
Pull Request resolved: #124059
)
prim_outputs = FakeTensorProp(
graph_module, check_consistency=False
).propagate(*args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we wouldn't want to check consistency here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it fails your unit tests, and I couldn't figure out what ONNX was doing lol. The relevant code is `python test/onnx/dynamo/test_dynamo_with_onnxruntime_backend.py -k test_elementwise_function_multiple_output_1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will track this on #124173

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Apr 17, 2024
I'm going to setup some extra behavior when we set example value, so
I need a convenient place to interpose.  I cannot easily do it on
meta itself because its a generic dict with no interposition point.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: #124176
Approved by: https://github.com/oulgen
ghstack dependencies: #124105, #124059
trieuat pushed a commit to trieuat/pytorch that referenced this pull request Apr 21, 2024
…xisted (pytorch#124059)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: pytorch#124059
Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi
ghstack dependencies: pytorch#124105
pytorch-bot bot pushed a commit that referenced this pull request Apr 21, 2024
I'm going to setup some extra behavior when we set example value, so
I need a convenient place to interpose.  I cannot easily do it on
meta itself because its a generic dict with no interposition point.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: #124176
Approved by: https://github.com/oulgen
ghstack dependencies: #124105, #124059
pytorchmergebot pushed a commit that referenced this pull request Apr 21, 2024
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: #124283
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #124105, #124059, #124176
pytorchmergebot pushed a commit that referenced this pull request Apr 21, 2024
…led (#124284)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: #124284
Approved by: https://github.com/Chillee
ghstack dependencies: #124105, #124059, #124176, #124283
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…xisted (pytorch#124059)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: pytorch#124059
Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi
ghstack dependencies: pytorch#124105
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
)

I'm going to setup some extra behavior when we set example value, so
I need a convenient place to interpose.  I cannot easily do it on
meta itself because its a generic dict with no interposition point.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#124176
Approved by: https://github.com/oulgen
ghstack dependencies: pytorch#124105, pytorch#124059
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: #124283
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #124105, #124059, #124176
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…led (pytorch#124284)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#124284
Approved by: https://github.com/Chillee
ghstack dependencies: pytorch#124105, pytorch#124059, pytorch#124176, pytorch#124283
@eellison
Copy link
Contributor

This is causing a number of failures in huggingface. https://ossci-raw-job-status.s3.amazonaws.com/log/24052568002

should we revert or try to forward fix ?

@eellison
Copy link
Contributor

cc oncall @masnesral

@ezyang
Copy link
Contributor Author

ezyang commented Apr 22, 2024

A few notes:

  1. How come it wasn't caught in CI? Where is the linked job from?
  2. This is adding more assertions, and in my experience, these assertions all uncovered latent bugs. So we should fix the bugs.
  3. I actually reverted portions of this in the rework PR FakeTensorProp works with unbacked bindings #124310 , because adding a similar assert to other parts of the stack uncovered too many bugs, and so I narrowed the assert and for consistency narrowed it here as well. So the forward fix may already exist. The stack is waiting on code review right now.
  4. I can't tell from the CI logs what the unexpected failures are, but I see some that are GuardOnDataDependent errors? I'm happy to follow up on these, although I was under the impression we weren't enabling unbacked SymInts in the benchmark suite yet.

@eellison
Copy link
Contributor

@ezyang this is on the benchmarking run, not the tests we run on each pr. I don't if there are full model dynamic shape training test for each model, or if the set includes the newly failing tests. It is a significant number of newly failing models :

image

@eellison
Copy link
Contributor

Part of this may have been because of huggingface outages. we can at least wait another day and see

pytorchmergebot pushed a commit that referenced this pull request Apr 25, 2024
This is a partial revert of #124059

Like in #124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding.  This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case.

We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In #113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!)

Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: #124310
Approved by: https://github.com/lezcano
alat-rights pushed a commit to alat-rights/pytorch that referenced this pull request Apr 26, 2024
This is a partial revert of pytorch#124059

Like in pytorch#124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding.  This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case.

We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In pytorch#113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!)

Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#124310
Approved by: https://github.com/lezcano
carmocca pushed a commit to carmocca/pytorch that referenced this pull request Apr 29, 2024
This is a partial revert of pytorch#124059

Like in pytorch#124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding.  This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case.

We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In pytorch#113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!)

Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#124310
Approved by: https://github.com/lezcano
andoorve pushed a commit to andoorve/pytorch that referenced this pull request May 1, 2024
…ch#124283)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: pytorch#124283
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: pytorch#124105, pytorch#124059, pytorch#124176
andoorve pushed a commit to andoorve/pytorch that referenced this pull request May 1, 2024
…led (pytorch#124284)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#124284
Approved by: https://github.com/Chillee
ghstack dependencies: pytorch#124105, pytorch#124059, pytorch#124176, pytorch#124283
andoorve pushed a commit to andoorve/pytorch that referenced this pull request May 1, 2024
This is a partial revert of pytorch#124059

Like in pytorch#124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding.  This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case.

We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In pytorch#113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!)

Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#124310
Approved by: https://github.com/lezcano
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
…xisted (pytorch#124059)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: pytorch#124059
Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi
ghstack dependencies: pytorch#124105
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
)

I'm going to setup some extra behavior when we set example value, so
I need a convenient place to interpose.  I cannot easily do it on
meta itself because its a generic dict with no interposition point.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#124176
Approved by: https://github.com/oulgen
ghstack dependencies: pytorch#124105, pytorch#124059
pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: #124283
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #124105, #124059, #124176
pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024
…led (#124284)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: #124284
Approved by: https://github.com/Chillee
ghstack dependencies: #124105, #124059, #124176, #124283
petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
This is a partial revert of pytorch#124059

Like in pytorch#124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding.  This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case.

We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In pytorch#113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!)

Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: pytorch#124310
Approved by: https://github.com/lezcano
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: fx release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants