Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aoti][reland] clear precomputed symbol replacements before cpp wrapper compilation #123136

Closed
wants to merge 5 commits into from

Conversation

chenyang78
Copy link
Contributor

@chenyang78 chenyang78 commented Apr 1, 2024

Stack from ghstack (oldest at bottom):

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

…er compilation

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Apr 1, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123136

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit a408d2a with merge base 26bf05c (image):

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

chenyang78 added a commit that referenced this pull request Apr 1, 2024
…er compilation

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

ghstack-source-id: 2f73de7900b74d409757edad38270bc82d3dadba
Pull Request resolved: #123136
…e cpp wrapper compilation"

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chenyang78 added a commit that referenced this pull request Apr 1, 2024
…er compilation

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

ghstack-source-id: fb3f5b9847681f3b289e5523f2c4536330c36bc4
Pull Request resolved: #123136
@chenyang78
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 2, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@chenyang78
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@albanD
Copy link
Collaborator

albanD commented Apr 2, 2024

@pytorchbot revert -m "broke ROCm CI" -c "nosignal"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@chenyang78 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Apr 2, 2024
…pp wrapper compilation (#123136)"

This reverts commit 7eadb15.

Reverted #123136 on behalf of https://github.com/albanD due to broke ROCm CI ([comment](#123136 (comment)))
@jithunnair-amd
Copy link
Collaborator

@chenyang78 I'm not sure why this PR was filed; the original PR was reopened and I commented on it suggesting how to avoid the ROCm CI breakages: #122882 (comment). Please update the original PR and reland it.

@chenyang78
Copy link
Contributor Author

@chenyang78 I'm not sure why this PR was filed; the original PR was reopened and I commented on it suggesting how to avoid the ROCm CI breakages: #122882 (comment). Please update the original PR and reland it.

@jithunnair-amd I am really sorry that I missed your comment! I made the suggested changes in this PR and will close the original one. BTW, the original PR was reverted because it caused some internal issue.

…e cpp wrapper compilation"

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chenyang78 added a commit that referenced this pull request Apr 4, 2024
…er compilation

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

ghstack-source-id: a4d150f0ed7c763b88d9900cca8dd39e24337bec
Pull Request resolved: #123136
…e cpp wrapper compilation"

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chenyang78 added a commit that referenced this pull request Apr 4, 2024
…er compilation

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

ghstack-source-id: ffac0544644e8b4d690c5d5eb4046f4d70fd9466
Pull Request resolved: #123136
@chenyang78
Copy link
Contributor Author

@jithunnair-amd Hmm, seems the ROCM CI is currently broken. Just want to check if you are aware of this. I see the following failure, (which is unlikely related to my changes in the PR)

024-04-04T20:54:07.8252520Z FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/sparse/hip/torch_hip_generated_SparseSemiStructuredTile.hip.o /var/lib/jenkins/workspace/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/sparse/hip/torch_hip_generated_SparseSemiStructuredTile.hip.o 
2024-04-04T20:54:07.8259039Z cd /var/lib/jenkins/workspace/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/sparse/hip && /opt/conda/envs/py_3.8/bin/cmake -E make_directory /var/lib/jenkins/workspace/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/sparse/hip/. && /opt/conda/envs/py_3.8/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/var/lib/jenkins/workspace/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/sparse/hip/./torch_hip_generated_SparseSemiStructuredTile.hip.o -P /var/lib/jenkins/workspace/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/sparse/hip/torch_hip_generated_SparseSemiStructuredTile.hip.o.cmake
2024-04-04T20:54:07.8264043Z In file included from /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/SparseSemiStructuredTile.hip:6:
2024-04-04T20:54:07.8265380Z In file included from /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/ComputeSparseTile.h:4:
2024-04-04T20:54:07.8266695Z In file included from /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/SparseSemiStructuredPack.h:4:
2024-04-04T20:54:07.8268355Z /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/hip/StaticSort.h:3:10: fatal error: 'cutlass/cutlass.h' file not found
2024-04-04T20:54:07.8269378Z #include <cutlass/cutlass.h>
2024-04-04T20:54:07.8269722Z          ^~~~~~~~~~~~~~~~~~~
2024-04-04T20:54:07.8270145Z 1 error generated when compiling for host.
2024-04-04T20:54:07.8270845Z CMake Error at torch_hip_generated_SparseSemiStructuredTile.hip.o.cmake:146 (message):
2024-04-04T20:54:07.8271521Z   Error generating
2024-04-04T20:54:07.8272598Z   /var/lib/jenkins/workspace/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/sparse/hip/./torch_hip_generated_SparseSemiStructuredTile.hip.o

…e cpp wrapper compilation"

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
chenyang78 added a commit that referenced this pull request Apr 8, 2024
…er compilation

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

ghstack-source-id: cb24eb5051469f06e7c2450866de8c8d8d27372a
Pull Request resolved: #123136
@chenyang78
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…er compilation (pytorch#123136)

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: pytorch#123136
Approved by: https://github.com/desertfire
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…pp wrapper compilation (pytorch#123136)"

This reverts commit 7eadb15.

Reverted pytorch#123136 on behalf of https://github.com/albanD due to broke ROCm CI ([comment](pytorch#123136 (comment)))
sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
…er compilation (pytorch#123136)

After we codegen a triton kernel in the triton codegen backend,
we cache the generated triton source code in the wrapper to avoid
producing multiple triton kernels with the same content.

In AOTI compilation flow, this caching mechanism imposes a strong requirement
on the codegen that we must generate the same triton source code
for the same schedule node in both python and cpp codegen phases.
Otherwise, we would end up with a mismatch between the kernel name
formed in the cpp codegen and the cuda kernel key produced from
the python codegen. Consequently, we would hit an missing-cuda-kernel
error.

The precomputed symbol replacements saved in V.graph.sizevars
can cause such source-code inconsistency related to the code for indexing
tensors. For example, let's say in the python codegen phase,
we produce "ks2\*48" as part of indexing an input for schedule
node A while yielding a replacement pair "ks0 -> ks2\*48" in
the precomputed replacements. In the second cpp codegen phase,
we would produce "ks0" for the same indexing code of schedule
node A due to the "ks0 -> ks2*48" replacement pair.

This PR fixed the issue by clearing precomputed_replacements
and inv_precomputed_replacements before cpp wrapper codegen.

Pull Request resolved: pytorch#123136
Approved by: https://github.com/desertfire
@github-actions github-actions bot deleted the gh/chenyang78/20/head branch May 9, 2024 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants