[aoti] clear precomputed symbol replacements before cpp wrapper compilation #122882

chenyang78 · 2024-03-28T09:04:59Z

Stack from ghstack (oldest at bottom):

-> [aoti] clear precomputed symbol replacements before cpp wrapper compilation #122882

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

…lation After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to indexing code. For example, let's say in the python codegen phase, we produce "ks2*48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2*48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. [ghstack-poisoned]

pytorch-bot · 2024-03-28T09:05:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122882

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 2 Unrelated Failures

As of commit d3fcb17 with merge base 8c8e4e3 ():

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu) (gh)
test_nestedtensor.py::TestNestedTensorSubclassCUDA::test_nested_tensor_from_jagged_cuda_float32
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
test_nestedtensor.py::TestNestedTensorSubclassCUDA::test_nested_tensor_from_jagged_cuda_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…lation After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to indexing code. For example, let's say in the python codegen phase, we produce "ks2*48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2*48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. ghstack-source-id: d2e53ecf2c418940595e45dec7f70f05f1c698fc Pull Request resolved: #122882

…apper compilation" After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to indexing code. For example, let's say in the python codegen phase, we produce "ks2*48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2*48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

…lation After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to indexing code. For example, let's say in the python codegen phase, we produce "ks2*48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2*48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. ghstack-source-id: a9e12d7e3d839120c8cfed7d50b24924880215ba Pull Request resolved: #122882

chenyang78 · 2024-03-28T18:15:40Z

@pytorchbot merge

pytorchmergebot · 2024-03-28T18:17:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-03-28T18:28:01Z

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

chenyang78 · 2024-03-28T19:04:40Z

@pytorchbot merge -f "pre-existing failures"

pytorchmergebot · 2024-03-28T19:06:23Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jithunnair-amd · 2024-03-29T17:50:51Z

@pytorchbot revert -m "broke ROCm CI" -c "nosignal"

jithunnair-amd · 2024-03-29T17:52:23Z

Added ciflow/rocm label to surface the ROCm CI failures observed in the HUD: https://hud.pytorch.org/pytorch/pytorch/commit/384de46395234e793a319325e5c9d20a60407a64

pytorchmergebot · 2024-03-29T17:52:31Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…er compilation (#122882)" This reverts commit 384de46. Reverted #122882 on behalf of https://github.com/jithunnair-amd due to broke ROCm CI ([comment](#122882 (comment)))

pytorchmergebot · 2024-03-29T17:52:44Z

@chenyang78 your PR has been successfully reverted.

jithunnair-amd · 2024-03-29T18:34:12Z

test/inductor/test_aot_inductor.py

@@ -1138,6 +1138,70 @@ def forward(self, x, y):
                    exactly=True,
                ).run(src_code)

+    def test_reuse_kernel_dynamic(self):


@chenyang78 Suggest adding @skipIfRocm here to skip these new unit tests for ROCm and keep CI green. We will take a look at these unit tests and try to enable them later.

Alternately, and preferably, you can add an entry "test_reuse_kernel_dynamic ": fail_cuda(is_skip=True), here: https://github.com/pytorch/pytorch/blob/d3fcb1717c17a3c3541b67f829c0699e60eb0f3b/test/inductor/test_aot_inductor.py#L2421C13-L2421C67
to skip these unit tests only for "cuda" device_type on ROCm.

chenyang78 · 2024-04-10T08:38:38Z

Closing this as the fix was already re-landed in #123136

…lation (pytorch#122882) After we codegen a triton kernel in the triton codegen backend, we cache the generated triton source code in the wrapper to avoid producing multiple triton kernels with the same content. In AOTI compilation flow, this caching mechanism imposes a strong requirement on the codegen that we must generate the same triton source code for the same schedule node in both python and cpp codegen phases. Otherwise, we would end up with a mismatch between the kernel name formed in the cpp codegen and the cuda kernel key produced from the python codegen. Consequently, we would hit an missing-cuda-kernel error. The precomputed symbol replacements saved in V.graph.sizevars can cause such source-code inconsistency related to the code for indexing tensors. For example, let's say in the python codegen phase, we produce "ks2\*48" as part of indexing an input for schedule node A while yielding a replacement pair "ks0 -> ks2\*48" in the precomputed replacements. In the second cpp codegen phase, we would produce "ks0" for the same indexing code of schedule node A due to the "ks0 -> ks2*48" replacement pair. This PR fixed the issue by clearing precomputed_replacements and inv_precomputed_replacements before cpp wrapper codegen. Pull Request resolved: pytorch#122882 Approved by: https://github.com/desertfire

…er compilation (pytorch#122882)" This reverts commit 384de46. Reverted pytorch#122882 on behalf of https://github.com/jithunnair-amd due to broke ROCm CI ([comment](pytorch#122882 (comment)))

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 28, 2024

chenyang78 requested review from desertfire and jansel March 28, 2024 16:53

desertfire approved these changes Mar 28, 2024

View reviewed changes

chenyang78 added the topic: not user facing topic category label Mar 28, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 28, 2024

pytorchmergebot added the merging label Mar 28, 2024

pytorchmergebot removed the merging label Mar 28, 2024

pytorchmergebot added the merging label Mar 28, 2024

pytorchmergebot closed this in 384de46 Mar 28, 2024

pytorchmergebot added Merged and removed merging labels Mar 28, 2024

jithunnair-amd added the ciflow/rocm label Mar 29, 2024

pytorchmergebot added the Reverted label Mar 29, 2024

pytorchmergebot reopened this Mar 29, 2024

jithunnair-amd reviewed Mar 29, 2024

View reviewed changes

This was referenced Apr 3, 2024

[aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation #123136

Closed

[AOTI] Support module buffer mutation #123164

Closed

chenyang78 closed this Apr 10, 2024

github-actions bot deleted the gh/chenyang78/18/head branch May 11, 2024 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aoti] clear precomputed symbol replacements before cpp wrapper compilation #122882

[aoti] clear precomputed symbol replacements before cpp wrapper compilation #122882

chenyang78 commented Mar 28, 2024 •

edited

pytorch-bot bot commented Mar 28, 2024 •

edited

chenyang78 commented Mar 28, 2024

pytorchmergebot commented Mar 28, 2024

pytorchmergebot commented Mar 28, 2024

chenyang78 commented Mar 28, 2024

pytorchmergebot commented Mar 28, 2024

jithunnair-amd commented Mar 29, 2024

jithunnair-amd commented Mar 29, 2024 •

edited

pytorchmergebot commented Mar 29, 2024

pytorchmergebot commented Mar 29, 2024

jithunnair-amd Mar 29, 2024

jithunnair-amd Apr 1, 2024 •

edited

chenyang78 commented Apr 10, 2024

[aoti] clear precomputed symbol replacements before cpp wrapper compilation #122882

[aoti] clear precomputed symbol replacements before cpp wrapper compilation #122882

Conversation

chenyang78 commented Mar 28, 2024 • edited

pytorch-bot bot commented Mar 28, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122882

❌ 7 New Failures, 2 Unrelated Failures

chenyang78 commented Mar 28, 2024

pytorchmergebot commented Mar 28, 2024

Merge started

pytorchmergebot commented Mar 28, 2024

Merge failed

chenyang78 commented Mar 28, 2024

pytorchmergebot commented Mar 28, 2024

Merge started

jithunnair-amd commented Mar 29, 2024

jithunnair-amd commented Mar 29, 2024 • edited

pytorchmergebot commented Mar 29, 2024

pytorchmergebot commented Mar 29, 2024

jithunnair-amd Mar 29, 2024

Choose a reason for hiding this comment

jithunnair-amd Apr 1, 2024 • edited

Choose a reason for hiding this comment

chenyang78 commented Apr 10, 2024

chenyang78 commented Mar 28, 2024 •

edited

pytorch-bot bot commented Mar 28, 2024 •

edited

jithunnair-amd commented Mar 29, 2024 •

edited

jithunnair-amd Apr 1, 2024 •

edited