[AOTI] Introduce DeferredCudaKernelLine for cuda cpp wrapper #129135

desertfire · 2024-06-20T13:49:51Z

Stack from ghstack (oldest at bottom):

-> [AOTI] Introduce DeferredCudaKernelLine for cuda cpp wrapper #129135

Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. This is to prepare for one-pass AOTI codegen implementation.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @chauhang

Differential Revision: D61018114

Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. [ghstack-poisoned]

pytorch-bot · 2024-06-20T13:49:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129135

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit dc0e943 with merge base 92151c8 ():

NEW FAILURE - The following job has failed:

linux-aarch64 / linux-jammy-aarch64-py3.10 / test (default, 1, 4, linux.arm64.2xlarge) (gh)
'test/test_transformers.py::TestSDPACpuOnlyCPU::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_1030_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_False_casual_True_set_attn_mask_False_cpu_bfloat16'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang [ghstack-poisoned]

Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. [ghstack-poisoned]

Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. ghstack-source-id: f473a52 Pull Request resolved: #129268

Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang [ghstack-poisoned]

Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. ghstack-source-id: 1495522 Pull Request resolved: #129268

Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang [ghstack-poisoned]

Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang [ghstack-poisoned]

Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang [ghstack-poisoned]

Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. ghstack-source-id: 886857a Pull Request resolved: #129268

desertfire · 2024-08-14T13:19:30Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

eellison · 2024-08-14T22:49:28Z

i can review if not, but maybe one of the other people working on aot inductor from ae gpu can review ?

torch/_inductor/codegen/cpp_wrapper_cuda.py

Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. This is to prepare for one-pass AOTI codegen implementation. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang Differential Revision: [D61018114](https://our.internmc.facebook.com/intern/diff/D61018114) [ghstack-poisoned]

Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. ghstack-source-id: f35f762 Pull Request resolved: #129135

desertfire · 2024-08-19T18:31:20Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-08-20T02:14:07Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-08-20T02:15:36Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…cuda cpp wrapper" Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang [ghstack-poisoned]

Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang [ghstack-poisoned]

…cuda cpp wrapper" Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang [ghstack-poisoned]

Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames chauhang [ghstack-poisoned]

Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. ghstack-source-id: cc942cc Pull Request resolved: #129268

Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. Differential Revision: [D61800622](https://our.internmc.facebook.com/intern/diff/D61800622) Pull Request resolved: #129268 Approved by: https://github.com/angelayi

…29268) Summary: Similar to pytorch#129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. Differential Revision: [D61800622](https://our.internmc.facebook.com/intern/diff/D61800622) Pull Request resolved: pytorch#129268 Approved by: https://github.com/angelayi

desertfire mentioned this pull request Jun 20, 2024

[AOTI] Auto-tune Triton kernels in a seperate block #129057

Closed

desertfire mentioned this pull request Jun 20, 2024

[AOTI] Remove the epilogue for generating non-triggered kernels #129134

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Jun 20, 2024

desertfire added the topic: not user facing topic category label Jun 20, 2024

desertfire mentioned this pull request Jun 20, 2024

[AOTI][refactor] Remove GridExprCppPrinter #129142

Closed

desertfire added a commit that referenced this pull request Jun 21, 2024

[AOTI] Introduce DeferredCudaGridLine

89f1e4f

Summary: Similar to #129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. [ghstack-poisoned]

This was referenced Jun 21, 2024

[AOTI] Introduce DeferredCudaGridLine for cuda cpp wrapper #129268

Closed

[AOTI][refactor] Move generate_user_defined_triton_kernel #129267

Closed

This was referenced Jun 24, 2024

[AOTI] Switch the CUDA codegen to one-pass #129342

Closed

[AOTI][not for review] Test cpp_wrapper mode #129345

Closed

desertfire mentioned this pull request Jun 25, 2024

[AOTI][refactor] Unify UserDefinedTritonKernel.codegen #129378

Closed

desertfire requested a review from eellison August 14, 2024 13:08

eellison removed their request for review August 14, 2024 22:49

desertfire requested review from ColinPeppler and aakhundov August 15, 2024 15:49

angelayi approved these changes Aug 19, 2024

View reviewed changes

torch/_inductor/codegen/cpp_wrapper_cuda.py Outdated Show resolved Hide resolved

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 19, 2024

desertfire added the ciflow/linux-aarch64 linux aarch64 CI workflow label Aug 19, 2024

pytorchmergebot added the merging label Aug 20, 2024

pytorchmergebot added the Merged label Aug 20, 2024

pytorchmergebot closed this in 6c82a1c Aug 20, 2024

pytorchmergebot removed the merging label Aug 20, 2024

github-actions bot deleted the gh/desertfire/415/head branch September 29, 2024 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AOTI] Introduce DeferredCudaKernelLine for cuda cpp wrapper #129135

[AOTI] Introduce DeferredCudaKernelLine for cuda cpp wrapper #129135

Uh oh!

desertfire commented Jun 20, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 20, 2024 •

edited

Loading

Uh oh!

desertfire commented Aug 14, 2024

Uh oh!

eellison commented Aug 14, 2024 •

edited

Loading

Uh oh!

Uh oh!

desertfire commented Aug 19, 2024

Uh oh!

facebook-github-bot commented Aug 20, 2024

Uh oh!

pytorchmergebot commented Aug 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[AOTI] Introduce DeferredCudaKernelLine for cuda cpp wrapper #129135

[AOTI] Introduce DeferredCudaKernelLine for cuda cpp wrapper #129135

Uh oh!

Conversation

desertfire commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129135

❌ 1 New Failure

Uh oh!

desertfire commented Aug 14, 2024

Uh oh!

eellison commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

desertfire commented Aug 19, 2024

Uh oh!

facebook-github-bot commented Aug 20, 2024

Uh oh!

pytorchmergebot commented Aug 20, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

desertfire commented Jun 20, 2024 •

edited

Loading

pytorch-bot bot commented Jun 20, 2024 •

edited

Loading

eellison commented Aug 14, 2024 •

edited

Loading