[Graph Partition] improve custom op output alias #163380

pytorchbot · 2025-09-19T22:46:32Z

For a custom op with multiple outputs, we will see the following generated code:

buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future

If buf1 is not accessed in the future, it's good to deallocate early. So we don't delay del until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that del buf1 does not prevent their usage.

However, when there are mutating args, we don't see del buf1 immediately.

@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)

Why? Because buf3 is a MultiOutput with buf1 as input and believes buf1 (an output of FallbackKernel op1) has inputs that alias output.

pytorch/torch/_inductor/ir.py

Lines 7976 to 7982 in 72fedf0

    
           def get_inputs_that_alias_output(self) -> Sequence[str]: 
        
               return [ 
        
                   inp.get_name() 
        
                   for inp in self.inputs 
        
                   if isinstance(inp, FallbackKernel) 
        
                   and len(inp.get_inputs_that_alias_output()) > 0 
        
               ]

According to [NOTE: FallbackKernel supported operators], as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: moe custom op in vllm

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: #163227 Approved by: https://github.com/zou3519 (cherry picked from commit 4967ad8)

pytorch-bot · 2025-09-19T22:46:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163380

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 45 Pending

As of commit 0387893 with merge base 4840a1a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This was referenced Sep 19, 2025

[v.2.9.0] Release Tracker #162497

Closed

[Graph Partition] improve custom op output alias #163227

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 19, 2025

BoyuanFeng requested a review from huydhn September 19, 2025 22:55

pytorchbot added the open source label Sep 19, 2025

huydhn approved these changes Sep 19, 2025

View reviewed changes

huydhn merged commit 35c55da into release/2.9 Sep 19, 2025
70 of 117 checks passed

github-actions bot deleted the cherry-pick-163227-by-pytorch_bot_bot_ branch October 20, 2025 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Graph Partition] improve custom op output alias #163380

[Graph Partition] improve custom op output alias #163380

Uh oh!

pytorchbot commented Sep 19, 2025

Uh oh!

pytorch-bot bot commented Sep 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def get_inputs_that_alias_output(self) -> Sequence[str]:
	return [
	inp.get_name()
	for inp in self.inputs
	if isinstance(inp, FallbackKernel)
	and len(inp.get_inputs_that_alias_output()) > 0
	]

[Graph Partition] improve custom op output alias #163380

[Graph Partition] improve custom op output alias #163380

Uh oh!

Conversation

pytorchbot commented Sep 19, 2025

Uh oh!

pytorch-bot bot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163380

⏳ No Failures, 45 Pending

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented Sep 19, 2025 •

edited

Loading