Reimplement unbacked symbol bindings in Inductor #124394

ezyang · 2024-04-18T14:23:36Z

Stack from ghstack (oldest at bottom):

This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down.

torch/_inductor/graph.py - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures.
torch/_inductor/ir.py - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also torch/_inductor/lowering.py, torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/cpp_wrapper_cpu.py for the lowering and codegen changes for item)
- process_kernel - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in process_kernel. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node.
- codegen_unbacked_symbol_defs - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like [0].size()[0] representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming.
_rename_unbacked_to in torch/fx/experimental/symbolic_shapes.py - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement u0 -> s0 but then we renamed u0 to u1, we would now setup the replacement u0 -> u1, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However...
- torch/_functorch/_aot_autograd/collect_metadata_analysis.py - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with ignore_fresh_unbacked_symbols so that no rebinding happens at all.
- torch/_dynamo/eval_frame.py - same deal; I just searched for all sites we called clear() on pending
The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor
- torch/_dynamo/eval_frame.py - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the unbacked_bindings (apparently this didn't matter before without the Inductor changes)
- torch/_export/pass_base.py - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too! Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication.
- torch/_subclasses/fake_tensor.py, torch/_subclasses/fake_impls.py (with call site updates at torch/_functorch/_aot_autograd/traced_function_transforms.py and torch/fx/passes/fake_tensor_prop.py) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and unbacked_bindings is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos.
- torch/_inductor/scheduler.py - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running get_estimated_runtime.
- torch/fx/experimental/symbolic_shapes.py - A few things
  - rebind_unbacked (re _tensor_version). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. _tensor_version violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case.
  - rebind_unbacked (re Simplify SymBool binding). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not u1, it's sympy.Piecewise(... sympy.Eq(u1, 1) ...). This is actually just u1, but Sympy doesn't know it because it doesn't know that u1 value range is [0, 1]. So we manually implement the simplification needed to get the assert to pass.
  - compute_unbacked_bindings (re This is pretty fragile). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked.

Signed-off-by: Edward Z. Yang ezyang@meta.com

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-04-18T14:23:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124394

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 3b4cfa8 with merge base a22847a ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / rocm6.0-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2) (gh)
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShard1DTrainingCore::test_train_parity_multi_group_cpu_offload_eager

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ings Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 7b7b7a61197c00adeedbd4b035739b829211aa47 Pull Request resolved: #124394

[ghstack-poisoned]

…ings Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 324a8a54ea7ac134b0b231e3d0bc0abe95946a51 Pull Request resolved: #124394

[ghstack-poisoned]

…ings Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: e58954774b4b3d73522d4b46e1da465f192829cb Pull Request resolved: #124394

[ghstack-poisoned]

…ings Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: bb3c8d25f5189b8e151e466c53805cdcaf505a6d Pull Request resolved: #124394

[ghstack-poisoned]

ezyang · 2024-04-24T20:44:02Z

@pytorchbot merge

pytorchmergebot · 2024-04-24T20:45:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-04-24T20:46:00Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-py3.8-gcc11 / build

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]

ezyang · 2024-04-25T02:07:02Z

@pytorchbot merge -f "waiting rocm only, low risk"

pytorchmergebot · 2024-04-25T02:08:39Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ings Signed-off-by: Edward Z. Yang <ezyang@meta.com> ghstack-source-id: 0afe225c64a45010348ebd6298bda637bd27292c Pull Request resolved: pytorch#124394

This is a subset of changes extracted from #124683 This PR contains modifications to make Inductor work with unbacked symbol inputs, which can occur when a data-dependent sized tensor is saved for backwards. The problems to be fixed: * When binding initial symbols, we unconditionally bind unbacked symbols (instead of computing if they are needed, which only looks at backed symbols) * Benchmark generation code doesn't work with unbacked symints as we have no hints to actually feed in real values. So I pick a random number and you are expected to fix it if it doesn't work * Need to make sure we don't install dependencies on unbacked SymInt inputs, that puts us down the "promptly deallocate the input" path, but that's pointless for unbacked SymInt Fixes #124652 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124739 Approved by: https://github.com/jansel ghstack dependencies: #124310, #124314, #124316, #124394

#124782) Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them. Now, we do the replacement to preserve the old symbol. Actually doing this is a bit tricky. Here’s the order things happen when retracing data dependent: 1. Run fake tensor prop: allocate new unbacked SymInt 2. Run proxy tensor mode, calculate bindings and associate them with FX node 3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent So the problem is when we calculate bindings in step (2), we don't know what the original names are yet, we only find out later at (3). But by the time (3) runs, we've already stuffed some new bindings in meta["unbacked_bindings"] and we don't know how to update them! To fix this, I introduce resolve_unbacked_bindings which post facto applies any of the renamings we discovered in (3). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124782 Approved by: https://github.com/lezcano ghstack dependencies: #124310, #124314, #124316, #124394, #124739

…124785) This ensures we have correct SymInts when we allocate tangents. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #124785 Approved by: https://github.com/lezcano ghstack dependencies: #124310, #124314, #124316, #124394, #124739, #124782

…ass (#123735) This ensures that first argument to record_shapeenv_event is a ShapeEnv so we can appropriately short circuit when recording is not in progress. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #123735 Approved by: https://github.com/ysiraichi, https://github.com/zou3519, https://github.com/albanD ghstack dependencies: #124310, #124314, #124316, #124394, #124739, #124782, #124785

This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down. 1. **torch/_inductor/graph.py** - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures. 2. **torch/_inductor/ir.py** - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also **torch/_inductor/lowering.py**, **torch/_inductor/codegen/wrapper.py** and **torch/_inductor/codegen/cpp_wrapper_cpu.py** for the lowering and codegen changes for item) * **process_kernel** - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node. * **codegen_unbacked_symbol_defs** - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming. 3. **_rename_unbacked_to** in **torch/fx/experimental/symbolic_shapes.py** - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However... * **torch/_functorch/_aot_autograd/collect_metadata_analysis.py** - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all. * **torch/_dynamo/eval_frame.py** - same deal; I just searched for all sites we called clear() on pending 4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor * **torch/_dynamo/eval_frame.py** - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes) * **torch/_export/pass_base.py** - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too! Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication. * **torch/_subclasses/fake_tensor.py**, **torch/_subclasses/fake_impls.py** (with call site updates at **torch/_functorch/_aot_autograd/traced_function_transforms.py** and **torch/fx/passes/fake_tensor_prop.py**) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos. * **torch/_inductor/scheduler.py** - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`. * **torch/fx/experimental/symbolic_shapes.py** - A few things * **rebind_unbacked** (re **_tensor_version**). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case. * **rebind_unbacked** (re **Simplify SymBool binding**). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass. * **compute_unbacked_bindings** (re **This is pretty fragile**). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124394 Approved by: https://github.com/jansel ghstack dependencies: pytorch#124310, pytorch#124314, pytorch#124316

This is a subset of changes extracted from pytorch#124683 This PR contains modifications to make Inductor work with unbacked symbol inputs, which can occur when a data-dependent sized tensor is saved for backwards. The problems to be fixed: * When binding initial symbols, we unconditionally bind unbacked symbols (instead of computing if they are needed, which only looks at backed symbols) * Benchmark generation code doesn't work with unbacked symints as we have no hints to actually feed in real values. So I pick a random number and you are expected to fix it if it doesn't work * Need to make sure we don't install dependencies on unbacked SymInt inputs, that puts us down the "promptly deallocate the input" path, but that's pointless for unbacked SymInt Fixes pytorch#124652 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124739 Approved by: https://github.com/jansel ghstack dependencies: pytorch#124310, pytorch#124314, pytorch#124316, pytorch#124394

pytorch#124782) Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them. Now, we do the replacement to preserve the old symbol. Actually doing this is a bit tricky. Here’s the order things happen when retracing data dependent: 1. Run fake tensor prop: allocate new unbacked SymInt 2. Run proxy tensor mode, calculate bindings and associate them with FX node 3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent So the problem is when we calculate bindings in step (2), we don't know what the original names are yet, we only find out later at (3). But by the time (3) runs, we've already stuffed some new bindings in meta["unbacked_bindings"] and we don't know how to update them! To fix this, I introduce resolve_unbacked_bindings which post facto applies any of the renamings we discovered in (3). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124782 Approved by: https://github.com/lezcano ghstack dependencies: pytorch#124310, pytorch#124314, pytorch#124316, pytorch#124394, pytorch#124739

…ytorch#124785) This ensures we have correct SymInts when we allocate tangents. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124785 Approved by: https://github.com/lezcano ghstack dependencies: pytorch#124310, pytorch#124314, pytorch#124316, pytorch#124394, pytorch#124739, pytorch#124782

…ass (pytorch#123735) This ensures that first argument to record_shapeenv_event is a ShapeEnv so we can appropriately short circuit when recording is not in progress. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#123735 Approved by: https://github.com/ysiraichi, https://github.com/zou3519, https://github.com/albanD ghstack dependencies: pytorch#124310, pytorch#124314, pytorch#124316, pytorch#124394, pytorch#124739, pytorch#124782, pytorch#124785

This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down. 1. **torch/_inductor/graph.py** - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures. 2. **torch/_inductor/ir.py** - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also **torch/_inductor/lowering.py**, **torch/_inductor/codegen/wrapper.py** and **torch/_inductor/codegen/cpp_wrapper_cpu.py** for the lowering and codegen changes for item) * **process_kernel** - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node. * **codegen_unbacked_symbol_defs** - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming. 3. **_rename_unbacked_to** in **torch/fx/experimental/symbolic_shapes.py** - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However... * **torch/_functorch/_aot_autograd/collect_metadata_analysis.py** - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all. * **torch/_dynamo/eval_frame.py** - same deal; I just searched for all sites we called clear() on pending 4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor * **torch/_dynamo/eval_frame.py** - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes) * **torch/_export/pass_base.py** - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too! Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication. * **torch/_subclasses/fake_tensor.py**, **torch/_subclasses/fake_impls.py** (with call site updates at **torch/_functorch/_aot_autograd/traced_function_transforms.py** and **torch/fx/passes/fake_tensor_prop.py**) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos. * **torch/_inductor/scheduler.py** - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`. * **torch/fx/experimental/symbolic_shapes.py** - A few things * **rebind_unbacked** (re **_tensor_version**). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case. * **rebind_unbacked** (re **Simplify SymBool binding**). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass. * **compute_unbacked_bindings** (re **This is pretty fragile**). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124394 Approved by: https://github.com/jansel ghstack dependencies: pytorch#124310, pytorch#124314, pytorch#124316

This is a subset of changes extracted from pytorch#124683 This PR contains modifications to make Inductor work with unbacked symbol inputs, which can occur when a data-dependent sized tensor is saved for backwards. The problems to be fixed: * When binding initial symbols, we unconditionally bind unbacked symbols (instead of computing if they are needed, which only looks at backed symbols) * Benchmark generation code doesn't work with unbacked symints as we have no hints to actually feed in real values. So I pick a random number and you are expected to fix it if it doesn't work * Need to make sure we don't install dependencies on unbacked SymInt inputs, that puts us down the "promptly deallocate the input" path, but that's pointless for unbacked SymInt Fixes pytorch#124652 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124739 Approved by: https://github.com/jansel ghstack dependencies: pytorch#124310, pytorch#124314, pytorch#124316, pytorch#124394

pytorch#124782) Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them. Now, we do the replacement to preserve the old symbol. Actually doing this is a bit tricky. Here’s the order things happen when retracing data dependent: 1. Run fake tensor prop: allocate new unbacked SymInt 2. Run proxy tensor mode, calculate bindings and associate them with FX node 3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent So the problem is when we calculate bindings in step (2), we don't know what the original names are yet, we only find out later at (3). But by the time (3) runs, we've already stuffed some new bindings in meta["unbacked_bindings"] and we don't know how to update them! To fix this, I introduce resolve_unbacked_bindings which post facto applies any of the renamings we discovered in (3). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124782 Approved by: https://github.com/lezcano ghstack dependencies: pytorch#124310, pytorch#124314, pytorch#124316, pytorch#124394, pytorch#124739

…ytorch#124785) This ensures we have correct SymInts when we allocate tangents. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#124785 Approved by: https://github.com/lezcano ghstack dependencies: pytorch#124310, pytorch#124314, pytorch#124316, pytorch#124394, pytorch#124739, pytorch#124782

…ass (pytorch#123735) This ensures that first argument to record_shapeenv_event is a ShapeEnv so we can appropriately short circuit when recording is not in progress. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#123735 Approved by: https://github.com/ysiraichi, https://github.com/zou3519, https://github.com/albanD ghstack dependencies: pytorch#124310, pytorch#124314, pytorch#124316, pytorch#124394, pytorch#124739, pytorch#124782, pytorch#124785

Update

68284dc

[ghstack-poisoned]

This was referenced Apr 18, 2024

Factor var_to_range assignments to _update_var_to_range helper #124283

Closed

Assert that TracingContext is available when set_example_value is called #124284

Closed

Add pending_fresh_unbacked_symbols, populate unbacked_bindings for Dynamo #124290

Closed

ezyang mentioned this pull request Apr 18, 2024

Handle Tensor returns in PropagateUnbackedSymInts #124297

Closed

pytorch-bot bot added ci-td-distributed ciflow/inductor module: inductor labels Apr 18, 2024

This was referenced Apr 18, 2024

FakeTensorProp works with unbacked bindings #124310

Closed

Rebind and refresh unbacked bindings in FakeTensorUpdater #124314

Closed

Ban replacements with unbacked SymInt on both sides #124316

Closed

github-actions bot requested review from albanD, antoniojkim, bdhirsh, miladm and SherlockNoMad April 18, 2024 14:23

albanD removed their request for review April 18, 2024 15:27

Update

3cd2c81

[ghstack-poisoned]

pytorch-bot bot added module: dynamo release notes: fx release notes category labels Apr 18, 2024

Update

af77534

[ghstack-poisoned]

Update

a02facb

[ghstack-poisoned]

Update

27d8ddc

[ghstack-poisoned]

ezyang requested review from avikchaudhuri and gmagogsfm as code owners April 19, 2024 13:52

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 24, 2024

pytorchmergebot added the merging label Apr 24, 2024

pytorchmergebot removed the merging label Apr 24, 2024

Update

3b4cfa8

[ghstack-poisoned]

pytorchmergebot added the merging label Apr 25, 2024

pytorchmergebot added the Merged label Apr 25, 2024

pytorchmergebot closed this in 13ab24f Apr 25, 2024

pytorchmergebot removed the merging label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement unbacked symbol bindings in Inductor #124394

Reimplement unbacked symbol bindings in Inductor #124394

ezyang commented Apr 18, 2024 •

edited

pytorch-bot bot commented Apr 18, 2024 •

edited

ezyang commented Apr 24, 2024

pytorchmergebot commented Apr 24, 2024

pytorchmergebot commented Apr 24, 2024

ezyang commented Apr 25, 2024

pytorchmergebot commented Apr 25, 2024

Reimplement unbacked symbol bindings in Inductor #124394

Reimplement unbacked symbol bindings in Inductor #124394

Conversation

ezyang commented Apr 18, 2024 • edited

pytorch-bot bot commented Apr 18, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124394

✅ You can merge normally! (1 Unrelated Failure)

ezyang commented Apr 24, 2024

pytorchmergebot commented Apr 24, 2024

Merge started

pytorchmergebot commented Apr 24, 2024

Merge failed

ezyang commented Apr 25, 2024

pytorchmergebot commented Apr 25, 2024

Merge started

ezyang commented Apr 18, 2024 •

edited

pytorch-bot bot commented Apr 18, 2024 •

edited