Skip to content

Conversation

@pytorchbot
Copy link
Collaborator

For a custom op with multiple outputs, we will see the following generated code:

buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future

If buf1 is not accessed in the future, it's good to deallocate early. So we don't delay del until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that del buf1 does not prevent their usage.

However, when there are mutating args, we don't see del buf1 immediately.

@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
image

Why? Because buf3 is a MultiOutput with buf1 as input and believes buf1 (an output of FallbackKernel op1) has inputs that alias output.

pytorch/torch/_inductor/ir.py

Lines 7976 to 7982 in 72fedf0

def get_inputs_that_alias_output(self) -> Sequence[str]:
return [
inp.get_name()
for inp in self.inputs
if isinstance(inp, FallbackKernel)
and len(inp.get_inputs_that_alias_output()) > 0
]

According to [NOTE: FallbackKernel supported operators], as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: moe custom op in vllm

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```

If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.

However, when there are mutating args, we don't see `del buf1` immediately.

```python
@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
```

<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />

Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982

According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)

Pull Request resolved: #163227
Approved by: https://github.com/zou3519

(cherry picked from commit 4967ad8)
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163380

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 45 Pending

As of commit 0387893 with merge base 4840a1a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@huydhn huydhn merged commit 35c55da into release/2.9 Sep 19, 2025
70 of 117 checks passed
@github-actions github-actions bot deleted the cherry-pick-163227-by-pytorch_bot_bot_ branch October 20, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants