-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Skip kernel saving if already existed. #136389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Skip Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136389
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit 2dd8d48 with merge base 08dba25 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary: Skip Test Plan: TBD Reviewers: Subscribers: Tasks: Tags: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
@muchulee8 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only matter for AOTI (or JIT+cpp_wrapper) where store_cubin is True?
Yes, this diff only matters when store_cubin is True. This is intentional to keep the blast radius small. We should refactor the triton_heuristic if we want a really accurate result, there's actually one more non-trivial overhead which is the grid_fn computation, which can reach to ~15% of exact kernel runtime, but that's not as bad as this save_gpu_kernel (which is more than 100%) |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Summary: Pull Request resolved: #136389 We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: {F1889997034} After: {F1889997398} We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels. Test Plan: Existing OSS CI IG_CTR (model id: 637300633) run: ``` TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MEMORY_PLANNING=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=a100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/637300633/0/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" ``` Before: ``` == Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024]) BS: 2048, MFLOPS/BS: 729.04, TFLOP/s: 68.53, Time per iter: 21.79ms, Threads: 1, QPS: 94000.50, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 545.33s ``` After: ``` == Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024]) BS: 2048, MFLOPS/BS: 729.04, TFLOP/s: 72.40, Time per iter: 20.62ms, Threads: 1, QPS: 99303.27, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 528.94s ``` CMF (model id: 642218919) run: ``` TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MEMORY_PLANNING=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=a100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/642218919/58/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR" ``` Before: ``` == Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024]) BS: 2048, MFLOPS/BS: 658.92, TFLOP/s: 82.71, Time per iter: 16.32ms, Threads: 1, QPS: 125515.74, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 1027.37s ``` After: ``` == Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024]) BS: 2048, MFLOPS/BS: 658.92, TFLOP/s: 88.50, Time per iter: 15.25ms, Threads: 1, QPS: 134303.07, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 996.76s ``` Reviewed By: frank-wei Differential Revision: D63441430
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: After: We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI python test/inductor/test_cuda_cpp_wrapper.py Reviewers: Subscribers: Tasks: Tags: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: After: We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: b920462 Pull Request resolved: #137073
@pytorchbot revert -m Issue #136940 |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot revert -m Issue #136940 -c nosignal |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot revert -m "Issue #136940 " -c nosignal |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 2521cd3. Reverted #136389 on behalf of https://github.com/muchulee8 due to Issue #136940 ([comment](#136389 (comment)))
@muchulee8 your PR has been successfully reverted. |
Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: After: We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: b182949 Pull Request resolved: #137073
Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI python test/inductor/test_cuda_cpp_wrapper.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: #137073 Approved by: https://github.com/desertfire
This reverts commit 2521cd3. Reverted pytorch#136389 on behalf of https://github.com/muchulee8 due to Issue pytorch#136940 ([comment](pytorch#136389 (comment)))
…h#137073) Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI python test/inductor/test_cuda_cpp_wrapper.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#137073 Approved by: https://github.com/desertfire
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
@muchulee8 are we going to land this? My understanding was that excessive |
Stack from ghstack (oldest at bottom):
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm:
Before:

After:

We can see that before the change, the benchmarking includes two parts,
(1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation.
(2) The exact computation of Triton kernel.
We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels.
Test Plan:
Existing OSS CI
[Redacted, Some internal model results in D63441430]
Reviewers:
Subscribers:
Tasks:
Tags:
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @ColinPeppler @amjames @desertfire @chauhang @aakhundov @chuanhaozhuge