Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aoti] clear precomputed symbol replacements before cpp wrapper compilation #122882

Closed
wants to merge 2 commits into from

Conversation

chenyang78
Copy link
Contributor

@chenyang78 chenyang78 commented Mar 28, 2024

Stack from ghstack (oldest at bottom):

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

…lation

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to indexing
code. For example, let's say in the python codegen phase,
we produce "ks2*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Mar 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122882

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 2 Unrelated Failures

As of commit d3fcb17 with merge base 8c8e4e3 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

chenyang78 added a commit that referenced this pull request Mar 28, 2024
…lation

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to indexing
code. For example, let's say in the python codegen phase,
we produce "ks2*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

ghstack-source-id: d2e53ecf2c418940595e45dec7f70f05f1c698fc
Pull Request resolved: #122882
…apper compilation"

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to indexing
code. For example, let's say in the python codegen phase,
we produce "ks2*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chenyang78 added a commit that referenced this pull request Mar 28, 2024
…lation

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to indexing
code. For example, let's say in the python codegen phase,
we produce "ks2*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

ghstack-source-id: a9e12d7e3d839120c8cfed7d50b24924880215ba
Pull Request resolved: #122882
@chenyang78 chenyang78 added the topic: not user facing topic category label Mar 28, 2024
@chenyang78
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 28, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@chenyang78
Copy link
Contributor Author

@pytorchbot merge -f "pre-existing failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jithunnair-amd
Copy link
Collaborator

@pytorchbot revert -m "broke ROCm CI" -c "nosignal"

@jithunnair-amd
Copy link
Collaborator

jithunnair-amd commented Mar 29, 2024

Added ciflow/rocm label to surface the ROCm CI failures observed in the HUD: https://hud.pytorch.org/pytorch/pytorch/commit/384de46395234e793a319325e5c9d20a60407a64

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Mar 29, 2024
…er compilation (#122882)"

This reverts commit 384de46.

Reverted #122882 on behalf of https://github.com/jithunnair-amd due to broke ROCm CI ([comment](#122882 (comment)))
@pytorchmergebot
Copy link
Collaborator

@chenyang78 your PR has been successfully reverted.

@@ -1138,6 +1138,70 @@ def forward(self, x, y):
exactly=True,
).run(src_code)

def test_reuse_kernel_dynamic(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenyang78 Suggest adding @skipIfRocm here to skip these new unit tests for ROCm and keep CI green. We will take a look at these unit tests and try to enable them later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternately, and preferably, you can add an entry "test_reuse_kernel_dynamic ": fail_cuda(is_skip=True), here: https://github.com/pytorch/pytorch/blob/d3fcb1717c17a3c3541b67f829c0699e60eb0f3b/test/inductor/test_aot_inductor.py#L2421C13-L2421C67
to skip these unit tests only for "cuda" device_type on ROCm.

@chenyang78
Copy link
Contributor Author

Closing this as the fix was already re-landed in #123136

@chenyang78 chenyang78 closed this Apr 10, 2024
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…lation (pytorch#122882)

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: pytorch#122882
Approved by: https://github.com/desertfire
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
@github-actions github-actions bot deleted the gh/chenyang78/18/head branch May 11, 2024 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants