New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use legacy unrolled kernel for non-trivial offset calc cases #71710
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 663066a (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a longer term plan? What will be the future of these two unrolled kernels? Is the previous unrolled kernel still used somewhere?
PRevious unrolled kernel is used with trivial offset calculator (so contiguous type casting case and contiguous 1-alignment case, the latter must be pretty rare). |
Summary: This leads to across the board improvements on Pascals, big perf improvements for some broadcasting patterns and datatypes on V100 (along with some 3-5% regressions for some other patterns). The most common improving pattern on V100 is half-precision x+bias, that improves by ~5%. Full V100 results in https://docs.google.com/spreadsheets/d/1K67x-6_TPT9Yt6533NfECEhUyfbqBxLH9M5Z3gymzXE/edit#gid=1218963246, benchmarking script in https://gist.github.com/ngimel/986ee84a1dd234a0485e99544e0fc8b6 Most importantly, it reduces context size by 40 MB. Pull Request resolved: #71710 Reviewed By: mruberry Differential Revision: D33769330 Pulled By: ngimel fbshipit-source-id: 5a7942261e06003ca79bfa3b071106aab1a8a4bc
Summary: This leads to across the board improvements on Pascals, big perf improvements for some broadcasting patterns and datatypes on V100 (along with some 3-5% regressions for some other patterns). The most common improving pattern on V100 is half-precision x+bias, that improves by ~5%. Full V100 results in https://docs.google.com/spreadsheets/d/1K67x-6_TPT9Yt6533NfECEhUyfbqBxLH9M5Z3gymzXE/edit#gid=1218963246, benchmarking script in https://gist.github.com/ngimel/986ee84a1dd234a0485e99544e0fc8b6 Most importantly, it reduces context size by 40 MB. Pull Request resolved: pytorch/pytorch#71710 Reviewed By: mruberry Differential Revision: D33769330 Pulled By: ngimel fbshipit-source-id: 5a7942261e06003ca79bfa3b071106aab1a8a4bc (cherry picked from commit f9b51b4)
Summary: This leads to across the board improvements on Pascals, big perf improvements for some broadcasting patterns and datatypes on V100 (along with some 3-5% regressions for some other patterns). The most common improving pattern on V100 is half-precision x+bias, that improves by ~5%. Full V100 results in https://docs.google.com/spreadsheets/d/1K67x-6_TPT9Yt6533NfECEhUyfbqBxLH9M5Z3gymzXE/edit#gid=1218963246, benchmarking script in https://gist.github.com/ngimel/986ee84a1dd234a0485e99544e0fc8b6 Most importantly, it reduces context size by 40 MB. Pull Request resolved: pytorch/pytorch#71710 Reviewed By: mruberry Differential Revision: D33769330 Pulled By: ngimel fbshipit-source-id: 5a7942261e06003ca79bfa3b071106aab1a8a4bc (cherry picked from commit f9b51b4)
Summary: This leads to across the board improvements on Pascals, big perf improvements for some broadcasting patterns and datatypes on V100 (along with some 3-5% regressions for some other patterns). The most common improving pattern on V100 is half-precision x+bias, that improves by ~5%. Full V100 results in https://docs.google.com/spreadsheets/d/1K67x-6_TPT9Yt6533NfECEhUyfbqBxLH9M5Z3gymzXE/edit#gid=1218963246, benchmarking script in https://gist.github.com/ngimel/986ee84a1dd234a0485e99544e0fc8b6 Most importantly, it reduces context size by 40 MB. Pull Request resolved: pytorch/pytorch#71710 Reviewed By: mruberry Differential Revision: D33769330 Pulled By: ngimel fbshipit-source-id: 5a7942261e06003ca79bfa3b071106aab1a8a4bc (cherry picked from commit f9b51b4)
Summary: This leads to across the board improvements on Pascals, big perf improvements for some broadcasting patterns and datatypes on V100 (along with some 3-5% regressions for some other patterns). The most common improving pattern on V100 is half-precision x+bias, that improves by ~5%. Full V100 results in https://docs.google.com/spreadsheets/d/1K67x-6_TPT9Yt6533NfECEhUyfbqBxLH9M5Z3gymzXE/edit#gid=1218963246, benchmarking script in https://gist.github.com/ngimel/986ee84a1dd234a0485e99544e0fc8b6 Most importantly, it reduces context size by 40 MB. Pull Request resolved: pytorch/pytorch#71710 Reviewed By: mruberry Differential Revision: D33769330 Pulled By: ngimel fbshipit-source-id: 5a7942261e06003ca79bfa3b071106aab1a8a4bc (cherry picked from commit f9b51b4)
Hi, @ngimel @zasdfgbnm May i know why the code change below has the perf gain on some platforms? It looks like the pipeline of single cuda thread is changed in this PR from the load->load->execute->execute->store->store to the load->execute->store->load->execute->store. We wonder why it has the perf gain? Thank you |
This leads to across the board improvements on Pascals, big perf improvements for some broadcasting patterns and datatypes on V100 (along with some 3-5% regressions for some other patterns). The most common improving pattern on V100 is half-precision x+bias, that improves by ~5%. Full V100, 3090 and A100 results in https://docs.google.com/spreadsheets/d/1K67x-6_TPT9Yt6533NfECEhUyfbqBxLH9M5Z3gymzXE/edit#gid=1218963246, benchmarking script in https://gist.github.com/ngimel/986ee84a1dd234a0485e99544e0fc8b6
Most importantly, it reduces context size by 40 MB.