-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Fix redudant kernel generations #102104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix redudant kernel generations #102104
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102104
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit dc1e187: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
Some dynamic shape tests failed because it counts the number of generated kernels while this PR changes it. I'll fix them in next commit. |
torch/_inductor/codegen/common.py
Outdated
|
||
def seed_offset(self, name, value): | ||
if "load_seed_offset" in self.sizevars.values(): | ||
name = "%s%d" % ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use f-string formatting here f"{name}{expr}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed by e6268cf
torch/_inductor/codegen/common.py
Outdated
self.inplace_buffers[output_name] = buf | ||
|
||
def seed_offset(self, name, value): | ||
if "load_seed_offset" in self.sizevars.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have any testcases for multiple load_seed_offset args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this wouldn't work if you change name
at the callsite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the hardcode string to the name
in e6268cf .
The simple example for multiple load_seed_offset could be the following.
def fn():
random_tensor1 = torch.randint(10, [1024], device="cuda")
random_tensor2 = torch.randint(11, [1024], device="cuda")
random_tensor3 = torch.randint(10, [1024], device="cuda")
tensor4 = random_tensor1 + random_tensor2 +random_tensor3
return tensor4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that there is another hardcode string in divisible_by_16 check. It's better to replace it too. Do you know where I should put this hardcode string as a global string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I understand examples for multiple load_seed_offset exist, I'm asking if we have a test for those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only tested several models. not found multiple load_seed_offset for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean adding a unit test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please add a unit test for this if it doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a unit test by dc1e187
torch/_inductor/codegen/common.py
Outdated
|
||
def seed_offset(self, name, value): | ||
if name in self.sizevars.values(): | ||
name = f"{name}{sum(1 for value in self.sizevars.values() if value.startswith('load_seed_offset'))}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you still have load_seed_offset
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed by b84276f
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / build Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 2, 3, windows.4xlarge.nonephemeral) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge -f "test failure unrelated" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Issue description
The PR #100064 introduces a new RNG operation process. However, it causes every
randint
to load a separate random seed by default. TorchInductor generates a buffer to store all necessary random seeds and places the offsets as constant values in the subsequent compute buffers. In ir_pre_fusion generated by TorchInductor, some buffers only differ by one line, which is the load random seed with the corresponding offset. Subsequently, the codegen generates Triton kernels following the same rule. Finally, in the output_code.py, some Triton kernels only differ by one line, meaning that redundant kernels are being generated.Solution
This PR captures the seed offset and adds it to the existing
self.sizevars
structure. It generates variable names as placeholders, allowing the code wrapper to pass the offset as an argument to the kernels. I've also modified the divisible_by_16 check to exclude this argument.This PR reduces the number of generated kernels from 50 to 17 for BertForMaskedLM forward.
According to tests on my own environment, the compilation time of attention_is_all_you_need_pytorch has been reduced from 94s to 66s. The speedup remains largely unchanged, at 1.37X.
The following is a comparison for a simple example.
Before:
After:
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10