torch.compile should auto-functionalize certain mutable ops #114955

zou3519 · 2023-12-01T16:22:17Z

Stack from ghstack (oldest at bottom):

Users may wish to torch.compile custom ops that mutate their inputs
and return nothing (this is a common class of operators).
torch.compile will automatically support this op without anyone needing
to provide a functionalization kernel for it. Here's how.

Let's say we have a hypothetical mylib::sin_(Tensor(a!) x) -> ()
op. First, when FakeTensor sees this op, it can just return None.
This is the case because custom ops are not allowed to mutate input
metadata, so the FakeTensor rule for one that returns nothing is trivial.

Next, when Python FunctionalTensor sees the op, it will functionalize
it by emitting a call to an auto_functionalize(op, ["x"], {"x": ...})
HOP and replacing the mutated inputs with the outputs of this HOP.
This HOP effectively runs the functional version of the op when
called: it clones inputs that will be mutated, runs the op, and
then returns Tensors with the new values.

In the future we can teach Inductor how to do re-inplacing when it sees
this HOP (like how triton kernels do it) but this isn't urgent (and is
more of a performance problem).

Test Plan:

new tests

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @aakhundov @kadeng

Users may wish to torch.compile custom ops that mutate their inputs and return nothing (this is a common class of operators). torch.compile will automatically support this op without anyone needing to provide a functionalization kernel for it. Here's how. Let's say we have a hypothetical mylib::sin_(Tensor(a!) x) -> () op. First, when FakeTensor sees this op, it can just return None. This is the case because custom ops are not allowed to mutate input metadata, so the FakeTensor rule for one that returns nothing is trivial. Next, when Python FunctionalTensor sees the op, it will functionalize it by emitting a call to an auto_functionalize(op, ["x"], {"x": ...}) HOP and replacing the mutated inputs with the outputs of this HOP. This HOP effectively runs the functional version of the op when called: it clones inputs that will be mutated, runs the op, and then returns Tensors with the new values. In the future we can teach Inductor how to do re-inplacing when it sees this HOP (like how triton kernels do it) but this isn't urgent (and is more of a performance problem). Test Plan: - new tests [ghstack-poisoned]

pytorch-bot · 2023-12-01T16:22:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114955

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit e44a9f0 with merge base 8dbae73 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Users may wish to torch.compile custom ops that mutate their inputs and return nothing (this is a common class of operators). torch.compile will automatically support this op without anyone needing to provide a functionalization kernel for it. Here's how. Let's say we have a hypothetical mylib::sin_(Tensor(a!) x) -> () op. First, when FakeTensor sees this op, it can just return None. This is the case because custom ops are not allowed to mutate input metadata, so the FakeTensor rule for one that returns nothing is trivial. Next, when Python FunctionalTensor sees the op, it will functionalize it by emitting a call to an auto_functionalize(op, ["x"], {"x": ...}) HOP and replacing the mutated inputs with the outputs of this HOP. This HOP effectively runs the functional version of the op when called: it clones inputs that will be mutated, runs the op, and then returns Tensors with the new values. In the future we can teach Inductor how to do re-inplacing when it sees this HOP (like how triton kernels do it) but this isn't urgent (and is more of a performance problem). Test Plan: - new tests cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]

bdhirsh · 2023-12-04T22:22:12Z

test/dynamo/test_misc.py

+            )
+
+            @torch.library.impl("mylib::foo", "cpu", lib=lib)
+            @torch._dynamo.disable


curiosity: why do we need the dynamo.disable here? (Won't dynamo never end up seeing foo_impl anyway?)

Dynamo interposes on all frames. The Pytorch dispatcher invoking this function is a frame, so Dynamo ends up trying to compile it. Dynamo errors out loudly here because the Python key has been disabled at this point, but it tries to construct FakeTensors. The workaround is to disable Dynamo here, but a longer-term solution is to get Dynamo to reset dispatcher state temporarily.

The Pytorch dispatcher invoking this function is a frame, so Dynamo ends up trying to compile it. Dynamo errors out loudly here because the Python key has been disabled at this point, but it tries to construct FakeTensors

hmm. I'm sure this is a noob question, but - this is a python function getting invoked by C++ code (the pytorch dispatcher). I had always just assumed that any python getting called by C++ code was "invisible" to dynamo. If dynamo will indeed go ahead and try to compile this python frame (and fakeify the inputs) even though it was called from C++, then why doesn't e.g. every tensor subclass's __torch_dispatch__ not also need to be wrappe in a @dynamo.disable?

This is a python function getting invoked by C++ code (the pytorch dispatcher). I had always just assumed that any python getting called by C++ code was "invisible" to dynamo

My understanding is that python functions called by C++ code still get seen by Dynamo.

If dynamo will indeed go ahead and try to compile this python frame (and fakeify the inputs) even though it was called from C++, then why doesn't e.g. every tensor subclass's torch_dispatch not also need to be wrappe in a @dynamo.disable?

We execute the__torch_dispatch__ as a part of AOTAutograd, right? And Dynamo is turned off during AOTAutograd.

bdhirsh · 2023-12-04T22:23:38Z

test/dynamo/test_misc.py

+            orig_args = (x, y, z, n)
+
+            compiled_args = pytree.tree_map_only(torch.Tensor, torch.clone, orig_args)
+            torch.compile(f, backend="aot_eager")(*compiled_args)


Just a thought: aot_eager_decomp_partition might be a better test backend here: the aot_eager backend actually does not try to keep input mutations in the graph (so the other backend will be a little closer to the "default" torch.compile behavior).

We could maybe update aot_eager to keep input mutations in the graph, but I thought it would be a nice way to debug compiler issues specific to input mutations if we keep aot_eager as is.

I didn't know about this backend, let me try it out

bdhirsh · 2023-12-04T22:26:11Z

torch/_higher_order_ops/auto_functionalize.py

+auto_functionalized = AutoFunctionalized()
+
+
+def can_auto_functionalize(op: torch._ops.OperatorBase) -> bool:


Do you think it's worth returning False for ops in the ATen namespace? Just so we don't accidentally auto-functionalize them (even though I think in way you use this in fake_tensor.py today, that will never happen since you only try to run this during the fallback).

My thought was that this will be silently wrong for metadata-mutation ops - and even once you add op_tests to error on metadata-mutation, it's still valid to have aten ops that mutate metadata.

Sure, we could

bdhirsh · 2023-12-04T22:28:00Z

torch/_higher_order_ops/auto_functionalize.py

+) -> Tuple[Tensor, ...]:
+    new_kwargs = dict(**kwargs)
+    result = []
+    for name in mutated_args_names:


Do you think it's worth a warn_once() here as a perf warning? (counterargument == too noisy, and it's not very actionable)

Probably not for now, as you mentioned that it is not very actionable. Maybe we can do this in the future when the re-inplacing pass exists (and we can log instead of warn so someone staring at the log might be able to surface this to us)

bdhirsh

left some comments but lgtm

eellison · 2023-12-04T22:45:19Z

How does this work if there are aliased inputs to an op? In that case, won't cloning before a mutation potentially be incorrect ?

zou3519 · 2023-12-04T23:00:54Z

@eellison that's a good test case, I'll add one. We expect this to just work automatically (https://gist.github.com/zou3519/373634e2be660026252969882fa7fe84)

The TL;DR is that when we replace the original input with the new output during Functionalization, if the input to be replaced is a view, we do some heavy-lifting of ensuring that each view gets updated

Users may wish to torch.compile custom ops that mutate their inputs and return nothing (this is a common class of operators). torch.compile will automatically support this op without anyone needing to provide a functionalization kernel for it. Here's how. Let's say we have a hypothetical mylib::sin_(Tensor(a!) x) -> () op. First, when FakeTensor sees this op, it can just return None. This is the case because custom ops are not allowed to mutate input metadata, so the FakeTensor rule for one that returns nothing is trivial. Next, when Python FunctionalTensor sees the op, it will functionalize it by emitting a call to an auto_functionalize(op, ["x"], {"x": ...}) HOP and replacing the mutated inputs with the outputs of this HOP. This HOP effectively runs the functional version of the op when called: it clones inputs that will be mutated, runs the op, and then returns Tensors with the new values. In the future we can teach Inductor how to do re-inplacing when it sees this HOP (like how triton kernels do it) but this isn't urgent (and is more of a performance problem). Test Plan: - new tests cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]

zou3519 · 2023-12-05T14:49:31Z

@pytorchbot merge

pytorchmergebot · 2023-12-05T14:51:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This should deflake some of the tests Pull Request resolved: #114956 Approved by: https://github.com/williamwen42 ghstack dependencies: #114955

In preparation for the next PR up in the stack, which is going to update "can_auto_functionalize" to support more operators than just ones that return nothing. We are unable to auto-generate FakeTensor kernels for operators that do not return nothing, but we are able to generate functionalization kernels for operators that return something. Test Plan: Existing tests Pull Request resolved: #115134 Approved by: https://github.com/bdhirsh ghstack dependencies: #114955, #114956

We can auto-functionalize operators that mutate their inputs as long as the outputs of the operator do not alias their inputs. The user needs to provide an abstract impl for the operator if it has non-trivial returns. - We update can_auto_functionalize(op) to include ops that return (but do not alias) Tensors - We update auto_functionalized(op, mutated_args_names, kwargs) to return (out, mutated_args), where `out = op(**kwargs)` and `mutated_args` are the new values of the inputs that would have been mutated. Test Plan: - new test Pull Request resolved: #115135 Approved by: https://github.com/bdhirsh ghstack dependencies: #114955, #114956, #115134

…114955) Users may wish to torch.compile custom ops that mutate their inputs and return nothing (this is a common class of operators). torch.compile will automatically support this op without anyone needing to provide a functionalization kernel for it. Here's how. Let's say we have a hypothetical mylib::sin_(Tensor(a!) x) -> () op. First, when FakeTensor sees this op, it can just return None. This is the case because custom ops are not allowed to mutate input metadata, so the FakeTensor rule for one that returns nothing is trivial. Next, when Python FunctionalTensor sees the op, it will functionalize it by emitting a call to an auto_functionalize(op, ["x"], {"x": ...}) HOP and replacing the mutated inputs with the outputs of this HOP. This HOP effectively runs the functional version of the op when called: it clones inputs that will be mutated, runs the op, and then returns Tensors with the new values. In the future we can teach Inductor how to do re-inplacing when it sees this HOP (like how triton kernels do it) but this isn't urgent (and is more of a performance problem). Test Plan: - new tests Pull Request resolved: pytorch#114955 Approved by: https://github.com/bdhirsh

…ch#114956) This should deflake some of the tests Pull Request resolved: pytorch#114956 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#114955

In preparation for the next PR up in the stack, which is going to update "can_auto_functionalize" to support more operators than just ones that return nothing. We are unable to auto-generate FakeTensor kernels for operators that do not return nothing, but we are able to generate functionalization kernels for operators that return something. Test Plan: Existing tests Pull Request resolved: pytorch#115134 Approved by: https://github.com/bdhirsh ghstack dependencies: pytorch#114955, pytorch#114956

…h#115135) We can auto-functionalize operators that mutate their inputs as long as the outputs of the operator do not alias their inputs. The user needs to provide an abstract impl for the operator if it has non-trivial returns. - We update can_auto_functionalize(op) to include ops that return (but do not alias) Tensors - We update auto_functionalized(op, mutated_args_names, kwargs) to return (out, mutated_args), where `out = op(**kwargs)` and `mutated_args` are the new values of the inputs that would have been mutated. Test Plan: - new test Pull Request resolved: pytorch#115135 Approved by: https://github.com/bdhirsh ghstack dependencies: pytorch#114955, pytorch#114956, pytorch#115134

github-actions bot added module: dynamo ciflow/inductor labels Dec 1, 2023

zou3519 mentioned this pull request Dec 1, 2023

Tie some torch.library def/impls to library objects in testing #114956

Closed

zou3519 added 2 commits December 1, 2023 08:30

zou3519 added the release notes: composability release notes category label Dec 1, 2023

zou3519 requested a review from bdhirsh December 1, 2023 17:02

zou3519 added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 1, 2023

bdhirsh reviewed Dec 4, 2023

View reviewed changes

zou3519 requested a review from williamwen42 December 4, 2023 22:24

bdhirsh reviewed Dec 4, 2023

View reviewed changes

bdhirsh approved these changes Dec 4, 2023

View reviewed changes

zou3519 added 2 commits December 4, 2023 16:04

This was referenced Dec 5, 2023

Refactor can_auto_functionalize #115134

Closed

Extend auto_functionalized to support ops that return Tensors #115135

Closed

pytorchmergebot added the merging label Dec 5, 2023

pytorchmergebot added Merged and removed merging labels Dec 5, 2023

pytorchmergebot closed this in cfa4370 Dec 5, 2023

facebook-github-bot deleted the gh/zou3519/758/head branch December 8, 2023 15:30

		auto_functionalized = AutoFunctionalized()


		def can_auto_functionalize(op: torch._ops.OperatorBase) -> bool:

torch.compile should auto-functionalize certain mutable ops #114955

torch.compile should auto-functionalize certain mutable ops #114955

Uh oh!

Conversation

zou3519 commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114955

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh left a comment

Choose a reason for hiding this comment

Uh oh!

eellison commented Dec 4, 2023

Uh oh!

zou3519 commented Dec 4, 2023

Uh oh!

zou3519 commented Dec 5, 2023

Uh oh!

pytorchmergebot commented Dec 5, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zou3519 commented Dec 1, 2023 •

edited

Loading

pytorch-bot bot commented Dec 1, 2023 •

edited

Loading

bdhirsh Dec 4, 2023 •

edited

Loading