New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FakeTensorProp assert consistency of sizes when metadata previously existed #124059
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124059
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit c0977e4 with merge base 7cd7a7a (): FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…xisted Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 5ab348d66aecb64337894e22fe2b3891e1d9ced5 Pull Request resolved: #124059
…xisted Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 5085f0d4228808190d30adb47dcdf7613374171b Pull Request resolved: #124059
The test failure here is pretty interesting. The test we're failing on is
where x and y are f32[6] tensors. The failure is:
I'm not exactly sure what happened, but because there is metadata mutation the before and after sizes are inconsistent. It's extra annoying, because while I expected this might happen on the Dynamo to AOTAutograd boundary, this is happening in Inductor boundary. |
torch/fx/passes/fake_tensor_prop.py
Outdated
if 'val' in n.meta: | ||
meta_arg = [n.meta['val']] | ||
|
||
meta = tree_map(check_consistent, result, *meta_arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about the rests
argument to tree_map
torch/fx/passes/fake_tensor_prop.py
Outdated
if isinstance(new, torch.Tensor): | ||
if old is not nil: | ||
assert isinstance(old, torch.Tensor) | ||
torch._check(old.dim() == new.dim()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a torch._check
and not a plain assert
? In my head, it would be a bug in the compiler somewhere if we were re-running fake tensor prop on our graph and one of the nodes' meta tensor values had a different number of dims.
(So far I've pictured torch._check
mainly as an API for simplying dynamic shape equalities, but do we also use it in other places instead of plain assertions?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually changing into rename_unbacked_to later. And yes, it is more assert-y than check-y, but I use check here because I need the other behavior it has (it's irrefutable, if there are unbacked symints, put it into deferred runtime asserts).
torch/fx/passes/fake_tensor_prop.py
Outdated
else: | ||
# TODO: How is it possible that we get a non fake tensor? We | ||
# should be running under the mode... | ||
return snapshot_fake(self._mode.from_tensor(new, static_shapes=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 this does feel sketchy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh just realized this is code motion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preexisting problem!
…xisted Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 433f8b8ea0d1d3ffcba6a2534ebf08c0f371ba63 Pull Request resolved: #124059
) | ||
prim_outputs = FakeTensorProp( | ||
graph_module, check_consistency=False | ||
).propagate(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we wouldn't want to check consistency here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it fails your unit tests, and I couldn't figure out what ONNX was doing lol. The relevant code is `python test/onnx/dynamo/test_dynamo_with_onnxruntime_backend.py -k test_elementwise_function_multiple_output_1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will track this on #124173
I'm going to setup some extra behavior when we set example value, so I need a convenient place to interpose. I cannot easily do it on meta itself because its a generic dict with no interposition point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124176 Approved by: https://github.com/oulgen ghstack dependencies: #124105, #124059
…xisted (pytorch#124059) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124059 Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi ghstack dependencies: pytorch#124105
I'm going to setup some extra behavior when we set example value, so I need a convenient place to interpose. I cannot easily do it on meta itself because its a generic dict with no interposition point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124176 Approved by: https://github.com/oulgen ghstack dependencies: #124105, #124059
Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124283 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #124105, #124059, #124176
…xisted (pytorch#124059) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124059 Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi ghstack dependencies: pytorch#124105
) I'm going to setup some extra behavior when we set example value, so I need a convenient place to interpose. I cannot easily do it on meta itself because its a generic dict with no interposition point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124176 Approved by: https://github.com/oulgen ghstack dependencies: pytorch#124105, pytorch#124059
Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124283 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #124105, #124059, #124176
…led (pytorch#124284) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124284 Approved by: https://github.com/Chillee ghstack dependencies: pytorch#124105, pytorch#124059, pytorch#124176, pytorch#124283
This is causing a number of failures in huggingface. https://ossci-raw-job-status.s3.amazonaws.com/log/24052568002 should we revert or try to forward fix ? |
cc oncall @masnesral |
A few notes:
|
@ezyang this is on the benchmarking run, not the tests we run on each pr. I don't if there are full model dynamic shape training test for each model, or if the set includes the newly failing tests. It is a significant number of newly failing models : |
Part of this may have been because of huggingface outages. we can at least wait another day and see |
This is a partial revert of #124059 Like in #124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case. We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In #113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!) Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124310 Approved by: https://github.com/lezcano
This is a partial revert of pytorch#124059 Like in pytorch#124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case. We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In pytorch#113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!) Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124310 Approved by: https://github.com/lezcano
This is a partial revert of pytorch#124059 Like in pytorch#124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case. We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In pytorch#113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!) Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124310 Approved by: https://github.com/lezcano
…ch#124283) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124283 Approved by: https://github.com/IvanKobzarev ghstack dependencies: pytorch#124105, pytorch#124059, pytorch#124176
…led (pytorch#124284) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124284 Approved by: https://github.com/Chillee ghstack dependencies: pytorch#124105, pytorch#124059, pytorch#124176, pytorch#124283
This is a partial revert of pytorch#124059 Like in pytorch#124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case. We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In pytorch#113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!) Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124310 Approved by: https://github.com/lezcano
…xisted (pytorch#124059) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124059 Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi ghstack dependencies: pytorch#124105
) I'm going to setup some extra behavior when we set example value, so I need a convenient place to interpose. I cannot easily do it on meta itself because its a generic dict with no interposition point. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124176 Approved by: https://github.com/oulgen ghstack dependencies: pytorch#124105, pytorch#124059
Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124283 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #124105, #124059, #124176
This is a partial revert of pytorch#124059 Like in pytorch#124297, profiling has revealed that testing equality on *every* output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case. We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In pytorch#113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!) Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124310 Approved by: https://github.com/lezcano
Stack from ghstack (oldest at bottom):
Signed-off-by: Edward Z. Yang ezyang@meta.com