-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Update partitioner's is_fusible heuristic to respect auto_functionalized #134490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We say Node a is fusible into node b if node b is an auto_functionalized node that may reinplace node a later on. Fixes #134468 Test Plan: - new test [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134490
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f071437 with merge base 2553278 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…functionalized" We say Node a is fusible into node b if node b is an auto_functionalized node that may reinplace node a later on. This PR also changes aten.empty to be recomputable w.r.t the Partitioner (it is, like aten.zeros, cheap to recompute and fusible into other ops). Fixes #134468 Test Plan: - new test [ghstack-poisoned]
torch/_inductor/pattern_matcher.py
Outdated
if node.target is torch.ops.higher_order.auto_functionalized: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto_functionalized nodes can have out = blah
arguments. Those are not mutable (the auto_functionalized node is always functional), so we fix that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I'll split this out into its own PR
arg = b.kwargs[name] | ||
if a is arg: | ||
return True | ||
if isinstance(arg, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a tree_map or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only support TensorList args as inputs to operators, so the list is fine
subtest(torch.empty_like, name="empty_like"), | ||
], | ||
) | ||
def test_partitioner_recomputes_factory(self, factory_op): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like test_perf
tests, since they're "property-based" (i.e. they measure the amount of bytes read and written). It's possible we don't support HOPs yet or something though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some test_perf tests too (if possible)
torch/_inductor/pattern_matcher.py
Outdated
if node.target is torch.ops.higher_order.auto_functionalized: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unrelated?
aten.full, | ||
aten.as_strided, | ||
aten.zeros, | ||
aten.empty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think plausibly aten.empty
should be treated even more specially than this. Since aten.empty
is always free to recompute, regardless of whether or not it's fusible into a downstream op.
…functionalized" We say Node a is fusible into node b if node b is an auto_functionalized node that may reinplace node a later on. This PR also changes aten.empty to be recomputable w.r.t the Partitioner (it is, like aten.zeros, cheap to recompute and fusible into other ops). Fixes #134468 Test Plan: - new test [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
aten.empty is almost always fusible into its consumer, so we never CSE it. This fixes a bug that looks like the following: ```py @torch.library.custom_op("_reinplacing::sin_cos", mutates_args={"out_sin", "out_cos"}) def sin_cos(x: torch.Tensor, out_sin: torch.Tensor, out_cos: torch.Tensor) -> None: out_sin.copy_(x.sin()) out_cos.copy_(x.cos()) @torch.compile def f(x): out0 = torch.empty_like(x) out1 = torch.empty_like(x) sin_cos(x, out0, out1) return x.clone(), out0, out1 x = torch.randn(3, requires_grad=True) f(x) ``` - cse would de-duplicate the empty nodes - reinplacing would add an additional clone (because it can't write to both tensors at the same time) - the clone lowers into a new buffer + a copy_ kernel - the copy_ kernel is unnecessary because "empty" is special - all reinplacing needed was an additional buffer, it doesn't matter what the values are. We could attempt to fix this on the reinplacing side but this seemed better as a partitioner heuristic and the reinplacing fix is a bit more tricky (we'd need to identify that the op never reads from the empty node). Test Plan: - new test (the old number was 27, the new number is 21, so this PR helped). Pull Request resolved: #134703 Approved by: https://github.com/yf225 ghstack dependencies: #134466, #134490, #134491
…zed (pytorch#134490) We say Node a is fusible into node b if node b is an auto_functionalized node that may reinplace node a later on. This PR also changes aten.empty to be recomputable w.r.t the Partitioner (it is, like aten.zeros, cheap to recompute and fusible into other ops). Fixes pytorch#134468 Test Plan: - new test Pull Request resolved: pytorch#134490 Approved by: https://github.com/Chillee ghstack dependencies: pytorch#134364, pytorch#134466
…ytorch#134491) mutated arguments to triton kernels are fusible into the triton kernel. Test Plan: - new test Pull Request resolved: pytorch#134491 Approved by: https://github.com/Chillee ghstack dependencies: pytorch#134364, pytorch#134466, pytorch#134490
ROCM doesn't trigger the layout optimization that makes the test case valid so we're going to skip the checks. Should fix the following (I'll close them later) - pytorch#134481 - pytorch#134519 Pull Request resolved: pytorch#134690 Approved by: https://github.com/FindHao ghstack dependencies: pytorch#134466, pytorch#134490, pytorch#134491
Fixes pytorch#134119 From user feedback, it's difficult to understand what the tests do. We clarify the docs more. Pull Request resolved: pytorch#134692 Approved by: https://github.com/albanD ghstack dependencies: pytorch#134466, pytorch#134490, pytorch#134491, pytorch#134690
Fixes pytorch#134278 Test Plan: - tested locally Pull Request resolved: pytorch#134688 Approved by: https://github.com/yushangdi ghstack dependencies: pytorch#134466, pytorch#134490, pytorch#134491, pytorch#134690, pytorch#134692
aten.empty is almost always fusible into its consumer, so we never CSE it. This fixes a bug that looks like the following: ```py @torch.library.custom_op("_reinplacing::sin_cos", mutates_args={"out_sin", "out_cos"}) def sin_cos(x: torch.Tensor, out_sin: torch.Tensor, out_cos: torch.Tensor) -> None: out_sin.copy_(x.sin()) out_cos.copy_(x.cos()) @torch.compile def f(x): out0 = torch.empty_like(x) out1 = torch.empty_like(x) sin_cos(x, out0, out1) return x.clone(), out0, out1 x = torch.randn(3, requires_grad=True) f(x) ``` - cse would de-duplicate the empty nodes - reinplacing would add an additional clone (because it can't write to both tensors at the same time) - the clone lowers into a new buffer + a copy_ kernel - the copy_ kernel is unnecessary because "empty" is special - all reinplacing needed was an additional buffer, it doesn't matter what the values are. We could attempt to fix this on the reinplacing side but this seemed better as a partitioner heuristic and the reinplacing fix is a bit more tricky (we'd need to identify that the op never reads from the empty node). Test Plan: - new test (the old number was 27, the new number is 21, so this PR helped). Pull Request resolved: pytorch#134703 Approved by: https://github.com/yf225 ghstack dependencies: pytorch#134466, pytorch#134490, pytorch#134491
Stack from ghstack (oldest at bottom):
We say Node a is fusible into node b if node b is an auto_functionalized
node that may reinplace node a later on.
This PR also changes aten.empty to be recomputable w.r.t the Partitioner
(it is, like aten.zeros, cheap to recompute and fusible into other ops).
Fixes #134468
Test Plan:
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang