Skip to content

Conversation

BoyuanFeng
Copy link
Contributor

@BoyuanFeng BoyuanFeng commented Sep 18, 2025

For a custom op with multiple outputs, we will see the following generated code:

buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future

If buf1 is not accessed in the future, it's good to deallocate early. So we don't delay del until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that del buf1 does not prevent their usage.

However, when there are mutating args, we don't see del buf1 immediately.

@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
image

Why? Because buf3 is a MultiOutput with buf1 as input and believes buf1 (an output of FallbackKernel op1) has inputs that alias output.

pytorch/torch/_inductor/ir.py

Lines 7976 to 7982 in 72fedf0

def get_inputs_that_alias_output(self) -> Sequence[str]:
return [
inp.get_name()
for inp in self.inputs
if isinstance(inp, FallbackKernel)
and len(inp.get_inputs_that_alias_output()) > 0
]

According to [NOTE: FallbackKernel supported operators], as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: moe custom op in vllm

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @zou3519

@BoyuanFeng BoyuanFeng added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category ci-no-td Do not run TD on this PR labels Sep 18, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163227

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit f329f14 with merge base 0f46274 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

)

# Check that we are allocating the minimum number of intermediate buffers
# Check that we are not allocate intermediate buffers
Copy link
Contributor Author

@BoyuanFeng BoyuanFeng Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intermediate buffer buf3 is reused for buf8 so we don't allocate buf8. code comparison (left: after this pr)

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that we want to del multi output as soon as possible, but it's not clear to me how that relates to the changes you've made here.

Comment on lines +7564 to +7567
not isinstance(self.op_overload, torch._ops.HigherOrderOperator)
and "_c10d_functional" not in self.op_overload.name()
and self.op_overload._schema.is_mutable
and can_auto_functionalize(self.op_overload)
Copy link
Contributor

@eellison eellison Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about this ? can you also add tests for collective operators ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 1st&2nd conditions:

            not isinstance(self.op_overload, torch._ops.HigherOrderOperator)
            and "_c10d_functional" not in self.op_overload.name()

They are added since FallbackKernel just early returns when creating a ir.FallbackKernel:

pytorch/torch/_inductor/ir.py

Lines 7395 to 7406 in f329f14

if isinstance(self.op_overload, torch._ops.HigherOrderOperator):
# We assume here that HOPs with FallbackKernel are functional.
# This may not always be true! HOPs must individually opt-in to
# FallbackKernel, so please check this if you opt-in.
return
if "_c10d_functional" in self.op_overload.name():
# _c10d_functional kernels are lowered into _CollectiveKernel which
# derives from FallbackKernel for the cpp codegen. The kernels
# don't pass the can_auto_functionalize check, but their mutation
# is handled properly by _CollectiveKernel.
return

For the 3rd&4th conditions:

            and self.op_overload._schema.is_mutable
            and can_auto_functionalize(self.op_overload)

If these conditions hold, the op is a "mutating op that is auto-functionalizable". [Note: FallbackKernel supported operators] says its outputs do not alias any of the inputs. So get_inputs_that_alias_output should return an empty list.

f"{type(self.op_overload)} not supported"
)

# See [Note: FallbackKernel supported operators]: for a mutating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just update this so as we do not create incorrect alias_names in the first place ?

Copy link
Contributor Author

@BoyuanFeng BoyuanFeng Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.alias_names are arguments that are aliased. It seems to be right.

The issue is that def get_inputs_that_alias_output(self) -> Sequence[str] treats all inputs arguments that are aliased as inputs arguments that alias output. This is wrong for mutating op that is auto-functionalizable.

@yushangdi
Copy link
Contributor

yushangdi commented Sep 18, 2025

The "[ FAILED ] AotInductorTest.RuntimeUpdateConstantsCuda" failure looks unrelated to the PR, might be flaky CI?

I think it comes from this line: https://github.com/pytorch/pytorch/blame/159c2140f7489a1806b88019ec1ed2a8ed012942/test/cpp/aoti_inference/test.cpp#L201

and it should be included in data.pt when generated here: https://github.com/pytorch/pytorch/blob/main/test/cpp/aoti_inference/test.py#L85

cc @desertfire @muchulee8 Have you seen this error before?

"C++ exception with description "torch.Serializer (of Python compilation unit at: 0x55bb8f62b440) does not have a field with name 'w_add_cuda_use_runtime_constant_folding'" thrown in the test body."

@BoyuanFeng
Copy link
Contributor Author

BoyuanFeng commented Sep 19, 2025

@pytorchbot merge -f "skip unrelated AOTI failure"

1 similar comment
@BoyuanFeng
Copy link
Contributor Author

@pytorchbot merge -f "skip unrelated AOTI failure"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@BoyuanFeng BoyuanFeng added this to the 2.9.0 milestone Sep 19, 2025
@BoyuanFeng
Copy link
Contributor Author

@pytorchbot cherry-pick --onto release/2.9 -c fixnewfeature

pytorchbot pushed a commit that referenced this pull request Sep 19, 2025
For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```

If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.

However, when there are mutating args, we don't see `del buf1` immediately.

```python
@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
```

<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />

Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982

According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)

Pull Request resolved: #163227
Approved by: https://github.com/zou3519

(cherry picked from commit 4967ad8)
@pytorchbot
Copy link
Collaborator

Cherry picking #163227

The cherry pick PR is at #163380 and it is recommended to link a fixnewfeature cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

huydhn pushed a commit that referenced this pull request Sep 19, 2025
[Graph Partition] improve custom op output alias (#163227)

For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```

If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.

However, when there are mutating args, we don't see `del buf1` immediately.

```python
@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
```

<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />

Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982

According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)

Pull Request resolved: #163227
Approved by: https://github.com/zou3519

(cherry picked from commit 4967ad8)

Co-authored-by: Boyuan Feng <boyuan@meta.com>
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```

If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.

However, when there are mutating args, we don't see `del buf1` immediately.

```python
@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
```

<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />

Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982

According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)

Pull Request resolved: pytorch#163227
Approved by: https://github.com/zou3519
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```

If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.

However, when there are mutating args, we don't see `del buf1` immediately.

```python
@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
```

<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />

Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982

According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)

Pull Request resolved: pytorch#163227
Approved by: https://github.com/zou3519
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
For a custom op with multiple outputs, we will see the following generated code:
```
buf1 = op1(arg0)
buf3 = buf0[0]
buf4 = buf0[1]
del buf1 # <--- if buf1 is not accessed in the future
```

If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage.

However, when there are mutating args, we don't see `del buf1` immediately.

```python
@torch.library.custom_op(
    "mylib::op1",
    mutates_args=["x"],
    schema="(Tensor(a!)?  x) -> (Tensor, Tensor)",
    device_types="cuda",
)
def op1(x) -> tuple[torch.Tensor, torch.Tensor]:
    x = x + 1
    return (x + 1, x + 2)
```

<img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" />

Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output.
https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982

According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.

Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064)

Pull Request resolved: pytorch#163227
Approved by: https://github.com/zou3519
@github-actions github-actions bot deleted the bf/partition-custom-op-alias branch October 20, 2025 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor topic: not user facing topic category vllm-compile

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants