-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aoti] clear precomputed symbol replacements before cpp wrapper compilation #122882
Conversation
…lation After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to indexing code. For example, let's say in the python codegen phase, we produce "ks2*48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2*48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122882
Note: Links to docs will display an error until the docs builds have been completed. ❌ 7 New Failures, 2 Unrelated FailuresAs of commit d3fcb17 with merge base 8c8e4e3 (): NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…lation After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to indexing code. For example, let's say in the python codegen phase, we produce "ks2*48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2*48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. ghstack-source-id: d2e53ecf2c418940595e45dec7f70f05f1c698fc Pull Request resolved: #122882
…apper compilation" After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to indexing code. For example, let's say in the python codegen phase, we produce "ks2*48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2*48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
…lation After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to indexing code. For example, let's say in the python codegen phase, we produce "ks2*48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2*48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. ghstack-source-id: a9e12d7e3d839120c8cfed7d50b24924880215ba Pull Request resolved: #122882
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 mandatory check(s) failed. The first few are:
Dig deeper by viewing the failures on hud |
@pytorchbot merge -f "pre-existing failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot revert -m "broke ROCm CI" -c "nosignal" |
Added |
@pytorchbot successfully started a revert job. Check the current status here. |
…er compilation (#122882)" This reverts commit 384de46. Reverted #122882 on behalf of https://github.com/jithunnair-amd due to broke ROCm CI ([comment](#122882 (comment)))
@chenyang78 your PR has been successfully reverted. |
@@ -1138,6 +1138,70 @@ def forward(self, x, y): | |||
exactly=True, | |||
).run(src_code) | |||
|
|||
def test_reuse_kernel_dynamic(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenyang78 Suggest adding @skipIfRocm
here to skip these new unit tests for ROCm and keep CI green. We will take a look at these unit tests and try to enable them later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternately, and preferably, you can add an entry "test_reuse_kernel_dynamic ": fail_cuda(is_skip=True),
here: https://github.com/pytorch/pytorch/blob/d3fcb1717c17a3c3541b67f829c0699e60eb0f3b/test/inductor/test_aot_inductor.py#L2421C13-L2421C67
to skip these unit tests only for "cuda" device_type on ROCm.
Closing this as the fix was already re-landed in #123136 |
…lation (pytorch#122882) After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to the code for indexing tensors. For example, let's say in the python codegen phase, we produce "ks2\*48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2\*48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. Pull Request resolved: pytorch#122882 Approved by: https://github.com/desertfire
…er compilation (pytorch#122882)" This reverts commit 384de46. Reverted pytorch#122882 on behalf of https://github.com/jithunnair-amd due to broke ROCm CI ([comment](pytorch#122882 (comment)))
Stack from ghstack (oldest at bottom):
After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.
In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.
The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.
This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang