[Graph Partition] improve custom op output alias #163227

BoyuanFeng · 2025-09-18T00:56:00Z

For a custom op with multiple outputs, we will see the following generated code:

buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future

If buf1 is not accessed in the future, it's good to deallocate early. So we don't delay del until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that del buf1 does not prevent their usage.

However, when there are mutating args, we don't see del buf1 immediately.

@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)

Why? Because buf3 is a MultiOutput with buf1 as input and believes buf1 (an output of FallbackKernel op1) has inputs that alias output.

pytorch/torch/_inductor/ir.py

Lines 7976 to 7982 in 72fedf0

    
           def get_inputs_that_alias_output(self) -> Sequence[str]: 
        
               return [ 
        
                   inp.get_name() 
        
                   for inp in self.inputs 
        
                   if isinstance(inp, FallbackKernel) 
        
                   and len(inp.get_inputs_that_alias_output()) > 0 
        
               ]

According to [NOTE: FallbackKernel supported operators], as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: moe custom op in vllm

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @zou3519

pytorch-bot · 2025-09-18T00:56:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163227

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit f329f14 with merge base 0f46274 ():

NEW FAILURE - The following job has failed:

inductor / unit-test / inductor-test / test (inductor_cpp_wrapper, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
[ FAILED ] AotInductorTest.RuntimeUpdateConstantsCuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

BoyuanFeng · 2025-09-18T04:30:08Z

test/inductor/test_perf.py

            )

-            # Check that we are allocating the minimum number of intermediate buffers
+            # Check that we are not allocate intermediate buffers


the intermediate buffer buf3 is reused for buf8 so we don't allocate buf8. code comparison (left: after this pr)

eellison

I understand that we want to del multi output as soon as possible, but it's not clear to me how that relates to the changes you've made here.

eellison · 2025-09-18T19:19:39Z

torch/_inductor/ir.py

+            not isinstance(self.op_overload, torch._ops.HigherOrderOperator)
+            and "_c10d_functional" not in self.op_overload.name()
+            and self.op_overload._schema.is_mutable
+            and can_auto_functionalize(self.op_overload)


Can you say more about this ? can you also add tests for collective operators ?

For the 1st&2nd conditions:

not isinstance(self.op_overload, torch._ops.HigherOrderOperator) and "_c10d_functional" not in self.op_overload.name()

They are added since FallbackKernel just early returns when creating a ir.FallbackKernel:

pytorch/torch/_inductor/ir.py

Lines 7395 to 7406 in f329f14

if isinstance(self.op_overload, torch._ops.HigherOrderOperator):

# We assume here that HOPs with FallbackKernel are functional.

# This may not always be true! HOPs must individually opt-in to

# FallbackKernel, so please check this if you opt-in.

return

if "_c10d_functional" in self.op_overload.name():

# _c10d_functional kernels are lowered into _CollectiveKernel which

# derives from FallbackKernel for the cpp codegen. The kernels

# don't pass the can_auto_functionalize check, but their mutation

# is handled properly by _CollectiveKernel.

return

For the 3rd&4th conditions:

and self.op_overload._schema.is_mutable and can_auto_functionalize(self.op_overload)

If these conditions hold, the op is a "mutating op that is auto-functionalizable". [Note: FallbackKernel supported operators] says its outputs do not alias any of the inputs. So get_inputs_that_alias_output should return an empty list.

eellison · 2025-09-18T19:22:28Z

torch/_inductor/ir.py

+            f"{type(self.op_overload)} not supported"
+        )
+
+        # See [Note: FallbackKernel supported operators]: for a mutating


Shouldn't we just update this so as we do not create incorrect alias_names in the first place ?

self.alias_names are arguments that are aliased. It seems to be right.

The issue is that def get_inputs_that_alias_output(self) -> Sequence[str] treats all inputs arguments that are aliased as inputs arguments that alias output. This is wrong for mutating op that is auto-functionalizable.

yushangdi · 2025-09-18T19:43:26Z

The "[ FAILED ] AotInductorTest.RuntimeUpdateConstantsCuda" failure looks unrelated to the PR, might be flaky CI?

I think it comes from this line: https://github.com/pytorch/pytorch/blame/159c2140f7489a1806b88019ec1ed2a8ed012942/test/cpp/aoti_inference/test.cpp#L201

and it should be included in data.pt when generated here: https://github.com/pytorch/pytorch/blob/main/test/cpp/aoti_inference/test.py#L85

cc @desertfire @muchulee8 Have you seen this error before?

"C++ exception with description "torch.Serializer (of Python compilation unit at: 0x55bb8f62b440) does not have a field with name 'w_add_cuda_use_runtime_constant_folding'" thrown in the test body."

BoyuanFeng · 2025-09-19T16:34:57Z

@pytorchbot merge -f "skip unrelated AOTI failure"

BoyuanFeng · 2025-09-19T16:59:46Z

@pytorchbot merge -f "skip unrelated AOTI failure"

pytorchmergebot · 2025-09-19T17:01:20Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

BoyuanFeng · 2025-09-19T22:41:22Z

@pytorchbot cherry-pick --onto release/2.9 -c fixnewfeature

For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: #163227 Approved by: https://github.com/zou3519 (cherry picked from commit 4967ad8)

pytorchbot · 2025-09-19T22:46:34Z

Cherry picking #163227

The cherry pick PR is at #163380 and it is recommended to link a fixnewfeature cherry pick PR with an issue. The following tracker issues are updated:

[v.2.9.0] Release Tracker #162497 (comment)

Details for Dev Infra team

Raised by workflow job

[Graph Partition] improve custom op output alias (#163227) For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: #163227 Approved by: https://github.com/zou3519 (cherry picked from commit 4967ad8) Co-authored-by: Boyuan Feng <boyuan@meta.com>

For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: pytorch#163227 Approved by: https://github.com/zou3519

improve custom op output alias

2f22129

BoyuanFeng added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category ci-no-td Do not run TD on this PR labels Sep 18, 2025

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 18, 2025

BoyuanFeng requested review from angelayi, eellison and zou3519 September 18, 2025 00:58

BoyuanFeng added 3 commits September 17, 2025 21:23

Merge branch 'main' into bf/partition-custom-op-alias

bf61b0d

update a test

8ae3e1c

nit

f329f14

BoyuanFeng commented Sep 18, 2025

View reviewed changes

zou3519 approved these changes Sep 18, 2025

View reviewed changes

eellison reviewed Sep 18, 2025

View reviewed changes

pytorchmergebot added the merging label Sep 19, 2025

pytorchmergebot closed this in 4967ad8 Sep 19, 2025

pytorchmergebot added Merged and removed merging labels Sep 19, 2025

BoyuanFeng added this to the 2.9.0 milestone Sep 19, 2025

BoyuanFeng added the vllm-compile label Sep 19, 2025

pytorchbot mentioned this pull request Sep 19, 2025

[v.2.9.0] Release Tracker #162497

Closed

janeyx99 mentioned this pull request Sep 22, 2025

adding cherrypicks meta-pytorch/torch-release-notes#111

Merged

huydhn mentioned this pull request Sep 23, 2025

Update PyTorch to 2.9.0+cu129 vllm-project/vllm#24994

Merged

3 tasks

eellison mentioned this pull request Sep 25, 2025

[inductor] can_free check against graph outputs directly instead of outputNode as users #163182

Open

zou3519 mentioned this pull request Oct 9, 2025

[Bug]: Wrong answer with torch==2.9 and Inductor compilation (custom ops disabled) vllm-project/vllm#26378

Closed

1 task

github-actions bot deleted the bf/partition-custom-op-alias branch October 20, 2025 02:17

	def get_inputs_that_alias_output(self) -> Sequence[str]:
	return [
	inp.get_name()
	for inp in self.inputs
	if isinstance(inp, FallbackKernel)
	and len(inp.get_inputs_that_alias_output()) > 0
	]

	if isinstance(self.op_overload, torch._ops.HigherOrderOperator):
	# We assume here that HOPs with FallbackKernel are functional.
	# This may not always be true! HOPs must individually opt-in to
	# FallbackKernel, so please check this if you opt-in.
	return

	if "_c10d_functional" in self.op_overload.name():
	# _c10d_functional kernels are lowered into _CollectiveKernel which
	# derives from FallbackKernel for the cpp codegen. The kernels
	# don't pass the can_auto_functionalize check, but their mutation
	# is handled properly by _CollectiveKernel.
	return

[Graph Partition] improve custom op output alias #163227

[Graph Partition] improve custom op output alias #163227

Conversation

BoyuanFeng commented Sep 18, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163227

❌ 1 New Failure

Uh oh!

BoyuanFeng Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

eellison Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

eellison Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yushangdi commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BoyuanFeng commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BoyuanFeng commented Sep 19, 2025

Uh oh!

pytorchmergebot commented Sep 19, 2025

Merge started

Uh oh!

BoyuanFeng commented Sep 19, 2025

Uh oh!

pytorchbot commented Sep 19, 2025

Cherry picking #163227

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

BoyuanFeng commented Sep 18, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 18, 2025 •

edited

Loading

BoyuanFeng Sep 18, 2025 •

edited

Loading

eellison Sep 18, 2025 •

edited

Loading

BoyuanFeng Sep 19, 2025 •

edited

Loading

yushangdi commented Sep 18, 2025 •

edited

Loading

BoyuanFeng commented Sep 19, 2025 •

edited

Loading