-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[Graph Partition] improve custom op output alias #163227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163227
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit f329f14 with merge base 0f46274 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
) | ||
|
||
# Check that we are allocating the minimum number of intermediate buffers | ||
# Check that we are not allocate intermediate buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the intermediate buffer buf3
is reused for buf8
so we don't allocate buf8
. code comparison (left: after this pr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that we want to del multi output as soon as possible, but it's not clear to me how that relates to the changes you've made here.
not isinstance(self.op_overload, torch._ops.HigherOrderOperator) | ||
and "_c10d_functional" not in self.op_overload.name() | ||
and self.op_overload._schema.is_mutable | ||
and can_auto_functionalize(self.op_overload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say more about this ? can you also add tests for collective operators ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the 1st&2nd conditions:
not isinstance(self.op_overload, torch._ops.HigherOrderOperator)
and "_c10d_functional" not in self.op_overload.name()
They are added since FallbackKernel just early returns when creating a ir.FallbackKernel
:
Lines 7395 to 7406 in f329f14
if isinstance(self.op_overload, torch._ops.HigherOrderOperator): | |
# We assume here that HOPs with FallbackKernel are functional. | |
# This may not always be true! HOPs must individually opt-in to | |
# FallbackKernel, so please check this if you opt-in. | |
return | |
if "_c10d_functional" in self.op_overload.name(): | |
# _c10d_functional kernels are lowered into _CollectiveKernel which | |
# derives from FallbackKernel for the cpp codegen. The kernels | |
# don't pass the can_auto_functionalize check, but their mutation | |
# is handled properly by _CollectiveKernel. | |
return |
For the 3rd&4th conditions:
and self.op_overload._schema.is_mutable
and can_auto_functionalize(self.op_overload)
If these conditions hold, the op is a "mutating op that is auto-functionalizable". [Note: FallbackKernel supported operators]
says its outputs do not alias any of the inputs. So get_inputs_that_alias_output
should return an empty list.
f"{type(self.op_overload)} not supported" | ||
) | ||
|
||
# See [Note: FallbackKernel supported operators]: for a mutating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just update this so as we do not create incorrect alias_names in the first place ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.alias_names
are arguments that are aliased. It seems to be right.
The issue is that def get_inputs_that_alias_output(self) -> Sequence[str]
treats all inputs arguments that are aliased
as inputs arguments that alias output
. This is wrong for mutating op that is auto-functionalizable
.
The "[ FAILED ] AotInductorTest.RuntimeUpdateConstantsCuda" failure looks unrelated to the PR, might be flaky CI? I think it comes from this line: https://github.com/pytorch/pytorch/blame/159c2140f7489a1806b88019ec1ed2a8ed012942/test/cpp/aoti_inference/test.cpp#L201 and it should be included in cc @desertfire @muchulee8 Have you seen this error before? "C++ exception with description "torch.Serializer (of Python compilation unit at: 0x55bb8f62b440) does not have a field with name 'w_add_cuda_use_runtime_constant_folding'" thrown in the test body." |
@pytorchbot merge -f "skip unrelated AOTI failure" |
1 similar comment
@pytorchbot merge -f "skip unrelated AOTI failure" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot cherry-pick --onto release/2.9 -c fixnewfeature |
For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: #163227 Approved by: https://github.com/zou3519 (cherry picked from commit 4967ad8)
Cherry picking #163227The cherry pick PR is at #163380 and it is recommended to link a fixnewfeature cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
[Graph Partition] improve custom op output alias (#163227) For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: #163227 Approved by: https://github.com/zou3519 (cherry picked from commit 4967ad8) Co-authored-by: Boyuan Feng <boyuan@meta.com>
For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: pytorch#163227 Approved by: https://github.com/zou3519
For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: pytorch#163227 Approved by: https://github.com/zou3519
For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: pytorch#163227 Approved by: https://github.com/zou3519
For a custom op with multiple outputs, we will see the following generated code:
If
buf1
is not accessed in the future, it's good to deallocate early. So we don't delaydel
until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such thatdel buf1
does not prevent their usage.However, when there are mutating args, we don't see
del buf1
immediately.Why? Because
buf3
is a MultiOutput withbuf1
as input and believesbuf1
(an output of FallbackKernel op1) has inputs that alias output.pytorch/torch/_inductor/ir.py
Lines 7976 to 7982 in 72fedf0
According to
[NOTE: FallbackKernel supported operators]
, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel.Use case: moe custom op in vllm
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @zou3519