Skip to content

Conversation

shunting314
Copy link
Contributor

@shunting314 shunting314 commented Sep 9, 2024

Stack from ghstack (oldest at bottom):

Fix #134768 .

When we benchmark the latency for a fused node set, we do benchmarking twice:

  1. benchmark the latency of the kernel including cloning mutated args
  2. benchmark the latency of cloning mutated args without running the kernel

We subtract result 2 from result 1 to get the latency of the kernel itself.

But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in triton.testing.do_bench the torch.cuda.synchronize call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0).

The fix is to set the correct current device in our benchmarking code.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

Copy link

pytorch-bot bot commented Sep 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135531

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit e5e1d1d with merge base cbc6b30 (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Fix #134768 .

When we benchmark the latency for a fused node set, we do benchmarking twice:
1. benchmark the latency of the kernel including cloning mutated args
2. benchmark the latency of cloning mutated args without running the kernel

We subtract result 2 from result 1 to get the latency of the kernel itself.

But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0).

The fix is to set the correct current device in our benchmarking code.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang 

[ghstack-poisoned]
@shunting314 shunting314 added the topic: not user facing topic category label Sep 10, 2024
pytorchmergebot pushed a commit that referenced this pull request Sep 10, 2024
…135533)

When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel.
Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s.

Pull Request resolved: #135533
Approved by: https://github.com/jansel
ghstack dependencies: #135531
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
…35531)

Fix pytorch#134768 .

When we benchmark the latency for a fused node set, we do benchmarking twice:
1. benchmark the latency of the kernel including cloning mutated args
2. benchmark the latency of cloning mutated args without running the kernel

We subtract result 2 from result 1 to get the latency of the kernel itself.

But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0).

The fix is to set the correct current device in our benchmarking code.

Pull Request resolved: pytorch#135531
Approved by: https://github.com/jansel
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
…ytorch#135533)

When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel.
Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s.

Pull Request resolved: pytorch#135533
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#135531
@github-actions github-actions bot deleted the gh/shunting314/174/head branch October 12, 2024 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants