FakeTensorProp assert consistency of sizes when metadata previously existed #124059

ezyang · 2024-04-15T13:48:43Z

Stack from ghstack (oldest at bottom):

Signed-off-by: Edward Z. Yang ezyang@meta.com

[ghstack-poisoned]

pytorch-bot · 2024-04-15T13:48:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124059

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c0977e4 with merge base 7cd7a7a ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu) (gh)
test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bfloat16

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…xisted Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 5ab348d66aecb64337894e22fe2b3891e1d9ced5 Pull Request resolved: #124059

[ghstack-poisoned]

…xisted Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 5085f0d4228808190d30adb47dcdf7613374171b Pull Request resolved: #124059

ezyang · 2024-04-15T15:54:20Z

The test failure here is pretty interesting. The test we're failing on is

        def func2(x, y):
            x.data = y
            y.data = torch.zeros([0])
            return x

where x and y are f32[6] tensors.

The failure is:

  File "/data/users/ezyang/a/pytorch/torch/utils/_pytree.py", line 781, in unflatten                                                                    
    leaves = list(leaves)                                                   
  File "/data/users/ezyang/a/pytorch/torch/fx/passes/fake_tensor_prop.py", line 49, in check_consistent                                                 
    torch._check(i == j)                                                                                                                                
  File "/data/users/ezyang/a/pytorch/torch/__init__.py", line 1148, in _check                                                                           
    _check_with(RuntimeError, cond, message)                                
  File "/data/users/ezyang/a/pytorch/torch/__init__.py", line 1131, in _check_with                                                                      
    raise error_type(message_evaluated)                                                                                                                 
torch._dynamo.exc.BackendCompilerFailed: backend='<torch._dynamo.testing.CompileCounterWithBackend object at 0x7f2cd3b49b10>' raised:                   
RuntimeError: Expected cond to be True, but got False. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)  
                                                                                                                                                        
While executing %primals_3 : [num_users=2] = placeholder[target=primals_3]                                                                              
Original traceback:                                                                                                                                     
None

I'm not exactly sure what happened, but because there is metadata mutation the before and after sizes are inconsistent. It's extra annoying, because while I expected this might happen on the Dynamo to AOTAutograd boundary, this is happening in Inductor boundary.

bdhirsh · 2024-04-15T19:51:14Z

torch/fx/passes/fake_tensor_prop.py

+        if 'val' in n.meta:
+            meta_arg = [n.meta['val']]
+
+        meta = tree_map(check_consistent, result, *meta_arg)


TIL about the rests argument to tree_map

bdhirsh · 2024-04-15T19:53:54Z

torch/fx/passes/fake_tensor_prop.py

+            if isinstance(new, torch.Tensor):
+                if old is not nil:
+                    assert isinstance(old, torch.Tensor)
+                    torch._check(old.dim() == new.dim())


Why is this a torch._check and not a plain assert? In my head, it would be a bug in the compiler somewhere if we were re-running fake tensor prop on our graph and one of the nodes' meta tensor values had a different number of dims.

(So far I've pictured torch._check mainly as an API for simplying dynamic shape equalities, but do we also use it in other places instead of plain assertions?)

It's actually changing into rename_unbacked_to later. And yes, it is more assert-y than check-y, but I use check here because I need the other behavior it has (it's irrefutable, if there are unbacked symints, put it into deferred runtime asserts).

bdhirsh · 2024-04-15T19:55:00Z

torch/fx/passes/fake_tensor_prop.py

+                else:
+                    # TODO: How is it possible that we get a non fake tensor?  We
+                    # should be running under the mode...
+                    return snapshot_fake(self._mode.from_tensor(new, static_shapes=True))


🤔 this does feel sketchy

oh just realized this is code motion

preexisting problem!

[ghstack-poisoned]

…xisted Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 433f8b8ea0d1d3ffcba6a2534ebf08c0f371ba63 Pull Request resolved: #124059

thiagocrepaldi · 2024-04-15T22:30:42Z

torch/onnx/_internal/onnxruntime.py

-                    )
+                    prim_outputs = FakeTensorProp(
+                        graph_module, check_consistency=False
+                    ).propagate(*args, **kwargs)


Any reason why we wouldn't want to check consistency here?

Well, it fails your unit tests, and I couldn't figure out what ONNX was doing lol. The relevant code is `python test/onnx/dynamo/test_dynamo_with_onnxruntime_backend.py -k test_elementwise_function_multiple_output_1

We will track this on #124173

[ghstack-poisoned]

I'm going to setup some extra behavior when we set example value, so I need a convenient place to interpose. I cannot easily do it on meta itself because its a generic dict with no interposition point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124176 Approved by: https://github.com/oulgen ghstack dependencies: #124105, #124059

…xisted (pytorch#124059) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124059 Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi ghstack dependencies: pytorch#124105

I'm going to setup some extra behavior when we set example value, so I need a convenient place to interpose. I cannot easily do it on meta itself because its a generic dict with no interposition point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124176 Approved by: https://github.com/oulgen ghstack dependencies: #124105, #124059

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124283 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #124105, #124059, #124176

…led (#124284) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124284 Approved by: https://github.com/Chillee ghstack dependencies: #124105, #124059, #124176, #124283

…xisted (pytorch#124059) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124059 Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi ghstack dependencies: pytorch#124105

) I'm going to setup some extra behavior when we set example value, so I need a convenient place to interpose. I cannot easily do it on meta itself because its a generic dict with no interposition point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124176 Approved by: https://github.com/oulgen ghstack dependencies: pytorch#124105, pytorch#124059

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124283 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #124105, #124059, #124176

…led (pytorch#124284) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124284 Approved by: https://github.com/Chillee ghstack dependencies: pytorch#124105, pytorch#124059, pytorch#124176, pytorch#124283

eellison · 2024-04-22T21:55:50Z

This is causing a number of failures in huggingface. https://ossci-raw-job-status.s3.amazonaws.com/log/24052568002

should we revert or try to forward fix ?

eellison · 2024-04-22T21:56:04Z

cc oncall @masnesral

ezyang · 2024-04-22T23:57:32Z

A few notes:

How come it wasn't caught in CI? Where is the linked job from?
This is adding more assertions, and in my experience, these assertions all uncovered latent bugs. So we should fix the bugs.
I actually reverted portions of this in the rework PR FakeTensorProp works with unbacked bindings #124310 , because adding a similar assert to other parts of the stack uncovered too many bugs, and so I narrowed the assert and for consistency narrowed it here as well. So the forward fix may already exist. The stack is waiting on code review right now.
I can't tell from the CI logs what the unexpected failures are, but I see some that are GuardOnDataDependent errors? I'm happy to follow up on these, although I was under the impression we weren't enabling unbacked SymInts in the benchmark suite yet.

eellison · 2024-04-23T00:16:10Z

@ezyang this is on the benchmarking run, not the tests we run on each pr. I don't if there are full model dynamic shape training test for each model, or if the set includes the newly failing tests. It is a significant number of newly failing models :

eellison · 2024-04-23T00:43:41Z

Part of this may have been because of huggingface outages. we can at least wait another day and see

This is a partial revert of #124059 Like in #124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case. We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In #113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!) Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124310 Approved by: https://github.com/lezcano

This is a partial revert of pytorch#124059 Like in pytorch#124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case. We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In pytorch#113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!) Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124310 Approved by: https://github.com/lezcano

…ch#124283) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124283 Approved by: https://github.com/IvanKobzarev ghstack dependencies: pytorch#124105, pytorch#124059, pytorch#124176

…led (pytorch#124284) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124284 Approved by: https://github.com/Chillee ghstack dependencies: pytorch#124105, pytorch#124059, pytorch#124176, pytorch#124283

This is a partial revert of pytorch#124059 Like in pytorch#124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case. We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In pytorch#113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!) Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124310 Approved by: https://github.com/lezcano

…xisted (pytorch#124059) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124059 Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi ghstack dependencies: pytorch#124105

) I'm going to setup some extra behavior when we set example value, so I need a convenient place to interpose. I cannot easily do it on meta itself because its a generic dict with no interposition point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124176 Approved by: https://github.com/oulgen ghstack dependencies: pytorch#124105, pytorch#124059

Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124283 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #124105, #124059, #124176

…led (#124284) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124284 Approved by: https://github.com/Chillee ghstack dependencies: #124105, #124059, #124176, #124283

This is a partial revert of pytorch#124059 Like in pytorch#124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case. We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In pytorch#113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!) Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124310 Approved by: https://github.com/lezcano

Update

7106ab6

[ghstack-poisoned]

pytorch-bot bot added the release notes: fx release notes category label Apr 15, 2024

github-actions bot requested review from albanD, antoniojkim, bdhirsh, miladm and SherlockNoMad April 15, 2024 13:48

ezyang added the ciflow/inductor label Apr 15, 2024

ezyang mentioned this pull request Apr 15, 2024

Test for FakeTensorProp assert consistency of sizes #124061

Closed

Update

e47e0b5

[ghstack-poisoned]

This was referenced Apr 15, 2024

PropagateUnbackedSymInts assert consistency of sizes when metadata previously existed #124069

Closed

FakeTensorUpdater assert consistency of sizes when metadata previously existed #124072

Open

Preserve unbacked SymInt on SymNode #120816

Closed

bdhirsh reviewed Apr 15, 2024

View reviewed changes

bdhirsh approved these changes Apr 15, 2024

View reviewed changes

Update

5f829c0

[ghstack-poisoned]

ezyang mentioned this pull request Apr 15, 2024

Refactor test_tensor_set_data to be parametrized #124105

Closed

ezyang requested review from BowenBao, abock, thiagocrepaldi and wschin as code owners April 15, 2024 21:59

thiagocrepaldi reviewed Apr 15, 2024

View reviewed changes

Update

0e5ad3c

[ghstack-poisoned]

pytorchmergebot closed this in cebf651 Apr 16, 2024

pytorchmergebot added Merged and removed merging labels Apr 16, 2024

ezyang mentioned this pull request Apr 17, 2024

FakeTensorProp works with unbacked bindings #124310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FakeTensorProp assert consistency of sizes when metadata previously existed #124059

FakeTensorProp assert consistency of sizes when metadata previously existed #124059

ezyang commented Apr 15, 2024 •

edited

pytorch-bot bot commented Apr 15, 2024 •

edited

ezyang commented Apr 15, 2024

bdhirsh Apr 15, 2024

bdhirsh Apr 15, 2024

ezyang Apr 15, 2024

bdhirsh Apr 15, 2024

bdhirsh Apr 15, 2024

ezyang Apr 15, 2024

thiagocrepaldi Apr 15, 2024

ezyang Apr 16, 2024

thiagocrepaldi Apr 16, 2024

eellison commented Apr 22, 2024

eellison commented Apr 22, 2024

ezyang commented Apr 22, 2024

eellison commented Apr 23, 2024

eellison commented Apr 23, 2024

FakeTensorProp assert consistency of sizes when metadata previously existed #124059

FakeTensorProp assert consistency of sizes when metadata previously existed #124059

Conversation

ezyang commented Apr 15, 2024 • edited

pytorch-bot bot commented Apr 15, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124059

✅ You can merge normally! (1 Unrelated Failure)

ezyang commented Apr 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eellison commented Apr 22, 2024

eellison commented Apr 22, 2024

ezyang commented Apr 22, 2024

eellison commented Apr 23, 2024

eellison commented Apr 23, 2024

ezyang commented Apr 15, 2024 •

edited

pytorch-bot bot commented Apr 15, 2024 •

edited