Skip to content

Conversation

muchulee8
Copy link
Contributor

@muchulee8 muchulee8 commented Sep 20, 2024

Stack from ghstack (oldest at bottom):

Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm:

Before:
Screenshot 2024-09-23 at 10 26 53 AM

After:
Screenshot 2024-09-23 at 10 27 03 AM

We can see that before the change, the benchmarking includes two parts,
(1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation.
(2) The exact computation of Triton kernel.

We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels.

Test Plan:
Existing OSS CI
[Redacted, Some internal model results in D63441430]

Reviewers:

Subscribers:

Tasks:

Tags:

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @ColinPeppler @amjames @desertfire @chauhang @aakhundov @chuanhaozhuge

Summary:
Skip

Test Plan:
TBD

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Sep 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136389

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 2dd8d48 with merge base 08dba25 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

muchulee8 added a commit that referenced this pull request Sep 20, 2024
Summary:
Skip

Test Plan:
TBD

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: c849f36
Pull Request resolved: #136389
Summary:
Skip

Test Plan:
TBD

Reviewers:

Subscribers:

Tasks:

Tags:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
muchulee8 added a commit that referenced this pull request Sep 23, 2024
Summary:
Skip

Test Plan:
TBD

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 884b31c
Pull Request resolved: #136389
@muchulee8
Copy link
Contributor Author

@muchulee8 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@muchulee8
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/muchulee8/36/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/136389)

pytorchmergebot pushed a commit that referenced this pull request Sep 24, 2024
Summary:
Skip

Test Plan:
TBD

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 608b503
Pull Request resolved: #136389
@muchulee8 muchulee8 changed the title [WIP] Skip kernel saving if already existed. Skip kernel saving if already existed. Sep 26, 2024
Copy link
Contributor

@desertfire desertfire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only matter for AOTI (or JIT+cpp_wrapper) where store_cubin is True?

@muchulee8
Copy link
Contributor Author

This should only matter for AOTI (or JIT+cpp_wrapper) where store_cubin is True?

Yes, this diff only matters when store_cubin is True. This is intentional to keep the blast radius small. We should refactor the triton_heuristic if we want a really accurate result, there's actually one more non-trivial overhead which is the grid_fn computation, which can reach to ~15% of exact kernel runtime, but that's not as bad as this save_gpu_kernel (which is more than 100%)

@muchulee8
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 26, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorch-bot bot pushed a commit that referenced this pull request Sep 27, 2024
Summary:
Pull Request resolved: #136389

We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm:

Before:
 {F1889997034}

After:
 {F1889997398}

We can see that before the change, the benchmarking includes two parts,
(1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation.
(2) The exact computation of Triton kernel.

We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels.

Test Plan:
Existing OSS CI

IG_CTR (model id: 637300633) run:
```
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MEMORY_PLANNING=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=a100  caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/637300633/0/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR"
```
Before:
```
== Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024])
BS: 2048, MFLOPS/BS: 729.04, TFLOP/s: 68.53, Time per iter: 21.79ms, Threads: 1, QPS: 94000.50, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 545.33s
```
After:
```
== Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024])
BS: 2048, MFLOPS/BS: 729.04, TFLOP/s: 72.40, Time per iter: 20.62ms, Threads: 1, QPS: 99303.27, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 528.94s
```

CMF (model id: 642218919) run:
```
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MEMORY_PLANNING=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 buck2 run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=a100  caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/642218919/58/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend="AOT_INDUCTOR"
```
Before:
```
== Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024])
BS: 2048, MFLOPS/BS: 658.92, TFLOP/s: 82.71, Time per iter: 16.32ms, Threads: 1, QPS: 125515.74, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 1027.37s
```
After:
```
== Benchmark Result for: Configuration(batch_iter=100, batch_size=2048, name='Test: AOTInductor', trt=False, ait=False, eager=False, jit=False, lower_dtype=torch.float16, accuracy_rtol=0.01, explicit_batch_dimension=True, report_aibench=False, verbose_profile=False, time_tensor_and_align=0, fx_time_tensor=False, use_cuda_graph=False, remove_passes=None, ait_profile=False, inductor=False, aot_inductor=True, aot_inductor_ep=False, num_threads=1, gpu_trace=True, op_level_profiling=False, additional_batch_size=[1, 512, 1024])
BS: 2048, MFLOPS/BS: 658.92, TFLOP/s: 88.50, Time per iter: 15.25ms, Threads: 1, QPS: 134303.07, Accuracy: True (rtol=0.01), AOT_INDUCTOR lowering duration: 996.76s

```

Reviewed By: frank-wei

Differential Revision: D63441430
@muchulee8
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

muchulee8 added a commit that referenced this pull request Oct 1, 2024
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:

Before:
After:

We can see that before the change, the benchmarking includes two parts,
   (1) The overhead of our triton_heuristic call, which includes the
   save/get, and the (expensive) hash computation.
   (2) The exact computation of Triton kernel.

   We see that (1) accounts >50% of time, which makes kernel selection
   for profiling choosing aten kernels over Triton kernels.

Test Plan:
Existing OSS CI

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
muchulee8 added a commit that referenced this pull request Oct 1, 2024
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:

Before:
<img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a">

After:
<img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118">

We can see that before the change, the benchmarking includes two parts,
   (1) The overhead of our triton_heuristic call, which includes the
   save/get, and the (expensive) hash computation.
   (2) The exact computation of Triton kernel.

   We see that (1) accounts >50% of time, which makes kernel selection
   for profiling choosing aten kernels over Triton kernels.

Test Plan:
Existing OSS CI
python test/inductor/test_cuda_cpp_wrapper.py

Reviewers:

Subscribers:

Tasks:

Tags:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
muchulee8 added a commit that referenced this pull request Oct 1, 2024
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:

Before:
After:

We can see that before the change, the benchmarking includes two parts,
   (1) The overhead of our triton_heuristic call, which includes the
   save/get, and the (expensive) hash computation.
   (2) The exact computation of Triton kernel.

   We see that (1) accounts >50% of time, which makes kernel selection
   for profiling choosing aten kernels over Triton kernels.

Test Plan:
Existing OSS CI

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: b920462
Pull Request resolved: #137073
@muchulee8
Copy link
Contributor Author

@pytorchbot revert -m Issue #136940

Copy link

pytorch-bot bot commented Oct 1, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@muchulee8
Copy link
Contributor Author

@pytorchbot revert -m Issue #136940 -c nosignal

Copy link

pytorch-bot bot commented Oct 1, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: https://github.com/pytorch/pytorch/issues/136940

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

@muchulee8
Copy link
Contributor Author

@pytorchbot revert -m "Issue #136940 " -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Oct 1, 2024
@pytorchmergebot
Copy link
Collaborator

@muchulee8 your PR has been successfully reverted.

pytorchmergebot pushed a commit that referenced this pull request Oct 2, 2024
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:

Before:
After:

We can see that before the change, the benchmarking includes two parts,
   (1) The overhead of our triton_heuristic call, which includes the
   save/get, and the (expensive) hash computation.
   (2) The exact computation of Triton kernel.

   We see that (1) accounts >50% of time, which makes kernel selection
   for profiling choosing aten kernels over Triton kernels.

Test Plan:
Existing OSS CI

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: b182949
Pull Request resolved: #137073
pytorchmergebot pushed a commit that referenced this pull request Oct 2, 2024
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:

Before:
<img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a">

After:
<img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118">

We can see that before the change, the benchmarking includes two parts,
   (1) The overhead of our triton_heuristic call, which includes the
   save/get, and the (expensive) hash computation.
   (2) The exact computation of Triton kernel.

   We see that (1) accounts >50% of time, which makes kernel selection
   for profiling choosing aten kernels over Triton kernels.

Test Plan:
Existing OSS CI
python test/inductor/test_cuda_cpp_wrapper.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: #137073
Approved by: https://github.com/desertfire
AnantGulati pushed a commit to AnantGulati/pytorch that referenced this pull request Oct 2, 2024
AnantGulati pushed a commit to AnantGulati/pytorch that referenced this pull request Oct 2, 2024
…h#137073)

Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:

Before:
<img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a">

After:
<img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118">

We can see that before the change, the benchmarking includes two parts,
   (1) The overhead of our triton_heuristic call, which includes the
   save/get, and the (expensive) hash computation.
   (2) The exact computation of Triton kernel.

   We see that (1) accounts >50% of time, which makes kernel selection
   for profiling choosing aten kernels over Triton kernels.

Test Plan:
Existing OSS CI
python test/inductor/test_cuda_cpp_wrapper.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: pytorch#137073
Approved by: https://github.com/desertfire
@eellison eellison removed their request for review October 3, 2024 21:40
Copy link
Contributor

github-actions bot commented Dec 2, 2024

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Dec 2, 2024
@aakhundov
Copy link
Contributor

@muchulee8 are we going to land this? My understanding was that excessive save_gpu_kernel call leads to a substantial distortion in autotuning results in AOTI scenarios.

@github-actions github-actions bot closed this Jan 8, 2025
@github-actions github-actions bot deleted the gh/muchulee8/36/head branch February 9, 2025 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants