-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Make Inductor benchmarker more compatible with Triton do_bench #160921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160921
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 75bc584 with merge base 82c7a1e ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
de03957
to
93e10f3
Compare
@@ -183,7 +183,7 @@ def L2_cache_size(self: Self) -> int: | |||
|
|||
def get_event_pairs( | |||
self: Self, iters: int | |||
) -> list[tuple[torch.cuda.Event, torch.cuda.Event]]: | |||
) -> List[tuple[torch.cuda.Event, torch.cuda.Event]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use list instead of List
93e10f3
to
d3d38c9
Compare
d3d38c9
to
75bc584
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ch#160921) Common benchmark suites like TritonBench uses `triton.testing.do_bench` for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden. I also opened a PR to use this timing measurement function on TritonBench side: meta-pytorch/tritonbench#333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate. Pull Request resolved: pytorch#160921 Approved by: https://github.com/BoyuanFeng
Common benchmark suites like TritonBench uses
triton.testing.do_bench
for kernel timing measurement which is not always fair for all backends. E.g. it includes torch.compile Dynamo invocation overhead and hence doesn't reflect real-world model use case where Dynamo overhead is usually hidden.I also opened a PR to use this timing measurement function on TritonBench side: meta-pytorch/tritonbench#333. But regardless of whether that PR can land, I think we should enhance Inductor benchmark_gpu to match do_bench features, to make it easier to people to migrate.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben