Skip kernel saving if already existed. #136389

muchulee8 · 2024-09-20T23:06:38Z

Stack from ghstack (oldest at bottom):

-> Skip kernel saving if already existed. #136389

Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm:

Before:

After:

We can see that before the change, the benchmarking includes two parts,
(1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation.
(2) The exact computation of Triton kernel.

We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels.

Test Plan:
Existing OSS CI
[Redacted, Some internal model results in D63441430]

Reviewers:

Subscribers:

Tasks:

Tags:

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @ColinPeppler @amjames @desertfire @chauhang @aakhundov @chuanhaozhuge

Summary: Skip Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot · 2024-09-20T23:06:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136389

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 2dd8d48 with merge base 08dba25 ():

NEW FAILURES - The following jobs have failed:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh)
moco
Lint / Test run_test.py is usable without boto3/rockset (gh)
ERROR: Could not find a version that satisfies the requirement torch (from versions: none)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Skip Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c849f36 Pull Request resolved: #136389

Summary: Skip Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Summary: Skip Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 884b31c Pull Request resolved: #136389

muchulee8 · 2024-09-23T17:23:59Z

@muchulee8 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

muchulee8 · 2024-09-24T01:17:01Z

@pytorchbot rebase

pytorchmergebot · 2024-09-24T01:18:26Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-09-24T01:18:36Z

Successfully rebased gh/muchulee8/36/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/136389)

Summary: Skip Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 608b503 Pull Request resolved: #136389

desertfire

This should only matter for AOTI (or JIT+cpp_wrapper) where store_cubin is True?

muchulee8 · 2024-09-26T17:41:40Z

This should only matter for AOTI (or JIT+cpp_wrapper) where store_cubin is True?

Yes, this diff only matters when store_cubin is True. This is intentional to keep the blast radius small. We should refactor the triton_heuristic if we want a really accurate result, there's actually one more non-trivial overhead which is the grid_fn computation, which can reach to ~15% of exact kernel runtime, but that's not as bad as this save_gpu_kernel (which is more than 100%)

muchulee8 · 2024-09-26T19:46:54Z

@pytorchbot merge

pytorchmergebot · 2024-09-26T19:48:42Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-27T01:47:22Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

Summary: Pull Request resolved: #136389 We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: {F1889997034} After: {F1889997398} We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels. Test Plan: Existing OSS CI IG_CTR (model id: 637300633) run: ``` TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MEMORY_PLANNING=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=a100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/637300633/0/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" ``` Before: ``` == Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024]) BS: 2048, MFLOPS/BS: 729.04, TFLOP/s: 68.53, Time per iter: 21.79ms, Threads: 1, QPS: 94000.50, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 545.33s ``` After: ``` == Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024]) BS: 2048, MFLOPS/BS: 729.04, TFLOP/s: 72.40, Time per iter: 20.62ms, Threads: 1, QPS: 99303.27, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 528.94s ``` CMF (model id: 642218919) run: ``` TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MEMORY_PLANNING=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=a100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/642218919/58/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" ``` Before: ``` == Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024]) BS: 2048, MFLOPS/BS: 658.92, TFLOP/s: 82.71, Time per iter: 16.32ms, Threads: 1, QPS: 125515.74, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 1027.37s ``` After: ``` == Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024]) BS: 2048, MFLOPS/BS: 658.92, TFLOP/s: 88.50, Time per iter: 15.25ms, Threads: 1, QPS: 134303.07, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 996.76s ``` Reviewed By: frank-wei Differential Revision: D63441430

muchulee8 · 2024-09-27T03:01:16Z

@pytorchbot merge

pytorchmergebot · 2024-09-27T03:03:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: After: We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI python test/inductor/test_cuda_cpp_wrapper.py Reviewers: Subscribers: Tasks: Tags: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: After: We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: b920462 Pull Request resolved: #137073

muchulee8 · 2024-10-01T20:00:24Z

@pytorchbot revert -m Issue #136940

pytorch-bot · 2024-10-01T20:00:27Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

muchulee8 · 2024-10-01T20:01:38Z

@pytorchbot revert -m Issue #136940 -c nosignal

pytorch-bot · 2024-10-01T20:01:40Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: https://github.com/pytorch/pytorch/issues/136940

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

muchulee8 · 2024-10-01T20:02:23Z

@pytorchbot revert -m "Issue #136940 " -c nosignal

pytorchmergebot · 2024-10-01T20:04:33Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 2521cd3. Reverted #136389 on behalf of https://github.com/muchulee8 due to Issue #136940 ([comment](#136389 (comment)))

pytorchmergebot · 2024-10-01T20:04:46Z

@muchulee8 your PR has been successfully reverted.

Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: After: We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: b182949 Pull Request resolved: #137073

Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI python test/inductor/test_cuda_cpp_wrapper.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: #137073 Approved by: https://github.com/desertfire

This reverts commit 2521cd3. Reverted pytorch#136389 on behalf of https://github.com/muchulee8 due to Issue pytorch#136940 ([comment](pytorch#136389 (comment)))

…h#137073) Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI python test/inductor/test_cuda_cpp_wrapper.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#137073 Approved by: https://github.com/desertfire

github-actions · 2024-12-02T22:37:13Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

aakhundov · 2024-12-09T01:24:20Z

@muchulee8 are we going to land this? My understanding was that excessive save_gpu_kernel call leads to a substantial distortion in autotuning results in AOTI scenarios.

[WIP] Skip kernel saving if already existed.

e68313f

Summary: Skip Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 20, 2024

muchulee8 added a commit that referenced this pull request Sep 20, 2024

[WIP] Skip kernel saving if already existed.

217bb52

Summary: Skip Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c849f36 Pull Request resolved: #136389

muchulee8 added a commit that referenced this pull request Sep 23, 2024

[WIP] Skip kernel saving if already existed.

31f254d

Summary: Skip Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 884b31c Pull Request resolved: #136389

muchulee8 added the release notes: inductor label Sep 24, 2024

Update

2dd8d48

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Sep 24, 2024

[WIP] Skip kernel saving if already existed.

8f4d739

Summary: Skip Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 608b503 Pull Request resolved: #136389

muchulee8 changed the title ~~[WIP] Skip kernel saving if already existed.~~ Skip kernel saving if already existed. Sep 26, 2024

muchulee8 requested review from chenyang78, desertfire and eellison September 26, 2024 06:28

desertfire approved these changes Sep 26, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 26, 2024

pytorchmergebot added the merging label Sep 26, 2024

pytorchmergebot closed this in 2521cd3 Sep 27, 2024

pytorchmergebot added Merged and removed merging labels Sep 27, 2024

etaf mentioned this pull request Sep 29, 2024

[Break XPU] The inductor UT test_unspec_inputs_cuda_cuda_wrapper fails on main branch but not tested in trunk. #136940

Closed

pytorchmergebot added a commit that referenced this pull request Oct 1, 2024

Revert "Skip kernel saving if already existed. (#136389)"

86b715c

This reverts commit 2521cd3. Reverted #136389 on behalf of https://github.com/muchulee8 due to Issue #136940 ([comment](#136389 (comment)))

pytorchmergebot added the Reverted label Oct 1, 2024

pytorchmergebot reopened this Oct 1, 2024

eellison removed their request for review October 3, 2024 21:40

github-actions bot added the Stale label Dec 2, 2024

github-actions bot closed this Jan 8, 2025

github-actions bot deleted the gh/muchulee8/36/head branch February 9, 2025 02:08

Skip kernel saving if already existed. #136389

Skip kernel saving if already existed. #136389

Uh oh!

Conversation

muchulee8 commented Sep 20, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136389

❌ 2 New Failures

Uh oh!

muchulee8 commented Sep 23, 2024

Uh oh!

muchulee8 commented Sep 24, 2024

Uh oh!

pytorchmergebot commented Sep 24, 2024

Uh oh!

pytorchmergebot commented Sep 24, 2024

Uh oh!

desertfire left a comment

Choose a reason for hiding this comment

Uh oh!

muchulee8 commented Sep 26, 2024

Uh oh!

muchulee8 commented Sep 26, 2024

Uh oh!

pytorchmergebot commented Sep 26, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 27, 2024

Uh oh!

muchulee8 commented Sep 27, 2024

Uh oh!

pytorchmergebot commented Sep 27, 2024

Merge started

Uh oh!

muchulee8 commented Oct 1, 2024

Uh oh!

pytorch-bot bot commented Oct 1, 2024

Uh oh!

muchulee8 commented Oct 1, 2024

Uh oh!

pytorch-bot bot commented Oct 1, 2024

Uh oh!

muchulee8 commented Oct 1, 2024

Uh oh!

pytorchmergebot commented Oct 1, 2024

Uh oh!

pytorchmergebot commented Oct 1, 2024

Uh oh!

github-actions bot commented Dec 2, 2024

Uh oh!

aakhundov commented Dec 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

muchulee8 commented Sep 20, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 20, 2024 •

edited

Loading