[AOTDispatch] Return mutated inputs directly when keeping mutations #120514

peterbell10 · 2024-02-23T18:54:02Z

Stack from ghstack (oldest at bottom):

-> [AOTDispatch] Return mutated inputs directly when keeping mutations #120514

Fixes #120242

The example from the issue now results in the graph

def forward(self, arg0_1, arg1_1):
    sin = torch.ops.aten.sin.default(arg0_1);  arg0_1 = None
    copy_ = torch.ops.aten.copy_.default(arg1_1, sin);  arg1_1 = sin = None
    return (copy_,)

and the corresponding inductor kernel eliminates the intermediate buffer
completely

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    assert_size_stride(arg0_1, (5, ), (1, ))
    assert_size_stride(arg1_1, (5, ), (1, ))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        # Source Nodes: [sin], Original ATen: [aten.sin]
        stream0 = get_raw_stream(0)
        triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0)
        del arg0_1
    return (arg1_1, )

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (copy_,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` [ghstack-poisoned]

pytorch-bot · 2024-02-23T18:54:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120514

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 39c69f3 with merge base 953c6c3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (copy_,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` ghstack-source-id: ba0ba28 Pull Request resolved: #120514

…mutations" Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (copy_,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` [ghstack-poisoned]

…mutations" Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (arg0_1,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (copy_,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` ghstack-source-id: 37afb61 Pull Request resolved: #120514

…mutations" Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (arg0_1,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

…mutations" Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (copy_,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames [ghstack-poisoned]

Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (copy_,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` ghstack-source-id: 2ab5b66 Pull Request resolved: #120514

peterbell10 · 2024-02-26T16:23:34Z

torch/_inductor/fx_passes/post_grad.py

            # necessary.
-            if get_node_storage(node) in output_storages and (
-                get_node_storage(src) in input_storages
-                or get_node_storage(src) in output_storages


This previous logic banned all noop eliminations where the src and node storages are inputs or outputs, but this is only problematic if the storages weren't expected to alias. In the failing test I saw we had node = aten.slice(argn, ...) where argn was an input and output to the graph because of this change. The slice op itself was not returned, so eliminating the view is not an issue.

This also generalizes further, we might have a view of a view where the second view is returned but it's still safe to eliminate the first view op because that tensor is not returned directly.

Just to check, node_storage != src_storage is equivalent to op not being a view, right? If so, could you either leave a comment or use a variable node_is_view = node_storage == src_storage?

peterbell10 · 2024-02-26T16:25:48Z

torch/_inductor/fx_passes/reinplace.py

-                if copy_node == user:
+                # Ignore uses after the copy_ epilogue node, where the input
+                # has already been mutated anyway
+                if copy_node_loc is not None and copy_node_loc <= user_loc:


This change is required because the output node on the graph counts as a user, so was preventing reinplacing on mutated inputs.

peterbell10 · 2024-02-26T16:27:33Z

torch/_inductor/fx_passes/reinplace.py

                copy_node = copy_args_to_copy_nodes.get((mutated_arg, node))
                if copy_node is not None:
-                    graph.erase_node(copy_node)
+                    replace_dict[copy_node] = copy_node.args[0]


This change is needed because when we have an inplace mutation on a tensor x, make_fx replaces all future references to x with the inplace node. This means the aten.copy_ op now has a user that needs to be updated.

lezcano · 2024-03-04T17:50:39Z

ping @bdhirsh

ezyang · 2024-03-05T16:47:05Z

@bdhirsh is on vacation for two weeks.

lezcano · 2024-03-06T11:58:16Z

Let's just wait for @bdhirsh to be back then, as this issue is not blocking anything.

torch/_functorch/_aot_autograd/traced_function_transforms.py

ezyang · 2024-03-06T16:47:58Z

I'm 👍 the AOTAutograd changes. But the Inductor passes also need reviewing. I can bug @Chillee / @oulgen to look at it in person, or in a pinch I can review them too.

oulgen

reinplace changes look good to me

peterbell10 · 2024-03-08T13:05:33Z

@pytorchbot merge -r

pytorchmergebot · 2024-03-08T13:07:19Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-03-08T13:07:24Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict gh/peterbell10/697/orig returned non-zero exit code 1

Rebasing (1/1)
Auto-merging test/functorch/test_aotdispatch.py
Auto-merging torch/_functorch/_aot_autograd/traced_function_transforms.py
Auto-merging torch/_inductor/fx_passes/post_grad.py
Auto-merging torch/_inductor/fx_passes/reinplace.py
CONFLICT (content): Merge conflict in torch/_inductor/fx_passes/reinplace.py
error: could not apply 8a18644291f... [AOTDispatch] Return mutated inputs directly when keeping mutations
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 8a18644291f... [AOTDispatch] Return mutated inputs directly when keeping mutations

Raised by https://github.com/pytorch/pytorch/actions/runs/8203733237

…mutations" Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (copy_,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Fixes #120242 The example from the issue now results in the graph ```python def forward(self, arg0_1, arg1_1): sin = torch.ops.aten.sin.default(arg0_1); arg0_1 = None copy_ = torch.ops.aten.copy_.default(arg1_1, sin); arg1_1 = sin = None return (copy_,) ``` and the corresponding inductor kernel eliminates the intermediate buffer completely ```python def call(args): arg0_1, arg1_1 = args args.clear() assert_size_stride(arg0_1, (5, ), (1, )) assert_size_stride(arg1_1, (5, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) # Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, arg1_1, 5, grid=grid(5), stream=stream0) del arg0_1 return (arg1_1, ) ``` ghstack-source-id: eee04a3 Pull Request resolved: #120514

peterbell10 · 2024-03-08T13:33:47Z

@pytorchbot merge

pytorchmergebot · 2024-03-08T13:35:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions bot added the ciflow/inductor label Feb 23, 2024

pytorchbot added the open source label Feb 23, 2024

github-actions bot added the module: inductor label Feb 24, 2024

pytorch-bot bot added the release notes: fx release notes category label Feb 24, 2024

peterbell10 marked this pull request as ready for review February 27, 2024 00:32

peterbell10 requested review from Chillee and ezyang as code owners February 27, 2024 00:32

peterbell10 commented Feb 27, 2024

View reviewed changes

peterbell10 requested review from bdhirsh and lezcano February 27, 2024 00:33

Chillee requested a review from oulgen February 27, 2024 00:55

ezyang removed their request for review February 27, 2024 04:47

ezyang reviewed Mar 6, 2024

View reviewed changes

torch/_functorch/_aot_autograd/traced_function_transforms.py Show resolved Hide resolved

ezyang approved these changes Mar 6, 2024

View reviewed changes

oulgen approved these changes Mar 6, 2024

View reviewed changes

lezcano approved these changes Mar 8, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 8, 2024

pytorchmergebot added the merging label Mar 8, 2024

peterbell10 added the topic: not user facing topic category label Mar 8, 2024

pytorchmergebot added the Merged label Mar 8, 2024

pytorchmergebot closed this in a2a8c1f Mar 8, 2024

pytorchmergebot removed the merging label Mar 8, 2024

github-actions bot deleted the gh/peterbell10/697/head branch April 8, 2024 01:53

[AOTDispatch] Return mutated inputs directly when keeping mutations #120514

[AOTDispatch] Return mutated inputs directly when keeping mutations #120514

Uh oh!

Conversation

peterbell10 commented Feb 23, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120514

✅ No Failures

Uh oh!

peterbell10 Feb 26, 2024

Choose a reason for hiding this comment

Uh oh!

lezcano Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterbell10 Feb 26, 2024

Choose a reason for hiding this comment

Uh oh!

peterbell10 Feb 26, 2024

Choose a reason for hiding this comment

Uh oh!

lezcano commented Mar 4, 2024

Uh oh!

ezyang commented Mar 5, 2024

Uh oh!

lezcano commented Mar 6, 2024

Uh oh!

Uh oh!

ezyang commented Mar 6, 2024

Uh oh!

oulgen left a comment

Choose a reason for hiding this comment

Uh oh!

peterbell10 commented Mar 8, 2024

Uh oh!

pytorchmergebot commented Mar 8, 2024

Uh oh!

pytorchmergebot commented Mar 8, 2024

Uh oh!

peterbell10 commented Mar 8, 2024

Uh oh!

pytorchmergebot commented Mar 8, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

peterbell10 commented Feb 23, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 23, 2024 •

edited

Loading

lezcano Mar 8, 2024 •

edited

Loading