[inductor] fix a device sync issue for benchmarking fusion #135531

shunting314 · 2024-09-09T20:39:47Z

Stack from ghstack (oldest at bottom):

When we benchmark the latency for a fused node set, we do benchmarking twice:

benchmark the latency of the kernel including cloning mutated args
benchmark the latency of cloning mutated args without running the kernel

We subtract result 2 from result 1 to get the latency of the kernel itself.

But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in triton.testing.do_bench the torch.cuda.synchronize call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0).

The fix is to set the correct current device in our benchmarking code.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-09-09T20:39:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135531

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (5 Unrelated Failures)

As of commit e5e1d1d with merge base cbc6b30 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-docs / build-docs-python-false (gh) (trunk failure)
pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh) (trunk failure)
test_nestedtensor.py::TestNestedTensorSubclassCPU::test_to_padded_tensor_nt_dim_2_requires_grad_False_cpu_float16
pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh) (trunk failure)
test_nestedtensor.py::TestNestedTensorSubclassCPU::test_to_padded_tensor_nt_dim_2_requires_grad_False_cpu_float16
pull / linux-focal-py3.12-clang10-experimental-split-build / test (dynamo, 3, 3, linux.2xlarge) (gh) (trunk failure)
test_nestedtensor.py::TestNestedTensorSubclassCPU::test_to_padded_tensor_nt_dim_2_requires_grad_False_cpu_float16
pull / linux-focal-py3.9-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh) (trunk failure)
test_nestedtensor.py::TestNestedTensorSubclassCPU::test_to_padded_tensor_nt_dim_2_requires_grad_False_cpu_float16

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Fix #134768 . When we benchmark the latency for a fused node set, we do benchmarking twice: 1. benchmark the latency of the kernel including cloning mutated args 2. benchmark the latency of cloning mutated args without running the kernel We subtract result 2 from result 1 to get the latency of the kernel itself. But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0). The fix is to set the correct current device in our benchmarking code. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 584d8cc Pull Request resolved: #135531

…135533) When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel. Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s. Pull Request resolved: #135533 Approved by: https://github.com/jansel ghstack dependencies: #135531

…35531) Fix pytorch#134768 . When we benchmark the latency for a fused node set, we do benchmarking twice: 1. benchmark the latency of the kernel including cloning mutated args 2. benchmark the latency of cloning mutated args without running the kernel We subtract result 2 from result 1 to get the latency of the kernel itself. But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0). The fix is to set the correct current device in our benchmarking code. Pull Request resolved: pytorch#135531 Approved by: https://github.com/jansel

…ytorch#135533) When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel. Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s. Pull Request resolved: pytorch#135533 Approved by: https://github.com/jansel ghstack dependencies: pytorch#135531

[inductor] fix a device sync issue for benchmarking fusion

376ca93

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 9, 2024

shunting314 requested review from Chillee, eellison and jansel September 9, 2024 20:46

shunting314 added a commit that referenced this pull request Sep 9, 2024

[inductor] fix a device sync issue for benchmarking fusion

f9f9043

ghstack-source-id: 584d8cc Pull Request resolved: #135531

shunting314 mentioned this pull request Sep 9, 2024

[ez][inductor] don't benchmark cloning if there are no mutated args #135533

Closed

jansel approved these changes Sep 10, 2024

View reviewed changes

shunting314 added the topic: not user facing topic category label Sep 10, 2024

pytorchmergebot added the Merged label Sep 10, 2024

pytorchmergebot closed this in 7b17918 Sep 10, 2024

github-actions bot deleted the gh/shunting314/174/head branch October 12, 2024 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor] fix a device sync issue for benchmarking fusion #135531

[inductor] fix a device sync issue for benchmarking fusion #135531

Uh oh!

shunting314 commented Sep 9, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 9, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[inductor] fix a device sync issue for benchmarking fusion #135531

[inductor] fix a device sync issue for benchmarking fusion #135531

Uh oh!

Conversation

shunting314 commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135531

✅ You can merge normally! (5 Unrelated Failures)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shunting314 commented Sep 9, 2024 •

edited

Loading

pytorch-bot bot commented Sep 9, 2024 •

edited

Loading