-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[Profiler] Add speedup estimate for FP32 pattern and Extra CUDA Copy Pattern #81501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Pattern [ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 364e9a0 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
… CUDA Copy Pattern" Summary: The main idea is that we can run some baseline benchmarks after we are done matching the events. This gives us ability to accurate measure speed gain because system performance varies from machine to machine. Test Plan: I did some manually testing on all the models in torchbench, as well as added a simple test in test_profiler.py [ghstack-poisoned]
… CUDA Copy Pattern" Summary: The main idea is that we can run some baseline benchmarks after we are done matching the events. This gives us ability to accurate measure speed gain because system performance varies from machine to machine. Test Plan: I did some manually testing on all the models in torchbench, as well as added a simple test in test_profiler.py [ghstack-poisoned]
@davidchencsl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
… CUDA Copy Pattern" Summary: The main idea is that we can run some baseline benchmarks after we are done matching the events. This gives us ability to accurate measure speed gain because system performance varies from machine to machine. Test Plan: I did some manually testing on all the models in torchbench, as well as added a simple test in test_profiler.py Differential Revision: [D37894566](https://our.internmc.facebook.com/intern/diff/D37894566) [ghstack-poisoned]
@davidchencsl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
… CUDA Copy Pattern" Summary: The main idea is that we can run some baseline benchmarks after we are done matching the events. This gives us ability to accurate measure speed gain because system performance varies from machine to machine. Test Plan: I did some manually testing on all the models in torchbench, as well as added a simple test in test_profiler.py Differential Revision: [D37894566](https://our.internmc.facebook.com/intern/diff/D37894566) [ghstack-poisoned]
@davidchencsl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
… CUDA Copy Pattern" Summary: The main idea is that we can run some baseline benchmarks after we are done matching the events. This gives us ability to accurate measure speed gain because system performance varies from machine to machine. Test Plan: I did some manually testing on all the models in torchbench, as well as added a simple test in test_profiler.py Differential Revision: [D37894566](https://our.internmc.facebook.com/intern/diff/D37894566) [ghstack-poisoned]
@davidchencsl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. The only thing that I would say is report_all_anti_patterns
should take should_benchmark: bool = False
as an argument and plumb it through. It can add a lot of time to the analysis, so we want users to opt into it. (At some point TorchTidy might be sophisticated enough to pick an appropriate subset to benchmark, but that's a long way off.)
… CUDA Copy Pattern" Summary: The main idea is that we can run some baseline benchmarks after we are done matching the events. This gives us ability to accurate measure speed gain because system performance varies from machine to machine. Test Plan: I did some manually testing on all the models in torchbench, as well as added a simple test in test_profiler.py Differential Revision: [D37894566](https://our.internmc.facebook.com/intern/diff/D37894566) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… CUDA Copy Pattern" Summary: The main idea is that we can run some baseline benchmarks after we are done matching the events. This gives us ability to accurate measure speed gain because system performance varies from machine to machine. Test Plan: I did some manually testing on all the models in torchbench, as well as added a simple test in test_profiler.py Differential Revision: [D37894566](https://our.internmc.facebook.com/intern/diff/D37894566) [ghstack-poisoned]
@davidchencsl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
1 similar comment
@davidchencsl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@pytorchbot merge |
@pytorchbot successfully started a merge job. Check the current status here |
Hey @davidchencsl. |
@davidchencsl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…Pattern (#81501) (#81501) Summary: The main idea is that we can run some baseline benchmarks after we are done matching the events. This gives us ability to accurate measure speed gain because system performance varies from machine to machine. Pull Request resolved: #81501 Approved by: https://github.com/robieta Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/64c6387c0ff82d49a5bfdcae579b522ae830c2c8 Test plan from GitHub: I did some manually testing on all the models in torchbench, as well as added a simple test in test_profiler.py Original Phabricator Test Plan: I did some manually testing on all the models in torchbench, as well as added a simple test in test_profiler.py Reviewed By: robieta Differential Revision: D37894566 Pulled By: davidchencsl fbshipit-source-id: 3e7adcf9b647d02cfad28772cf72fe08da2c6f93
Stack from ghstack (oldest at bottom):
Summary: The main idea is that we can run some baseline benchmarks after we are done matching the events. This gives us ability to accurate measure speed gain because system performance varies from machine to machine.
Test Plan: I did some manually testing on all the models in torchbench, as well as added a simple test in test_profiler.py
Differential Revision: D37894566