-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inductor cutlass backend] Enable StreamK and nonzero workspace sizes #119005
Conversation
This change introduces temporary workspace buffers, which can be used to provide workspace memory to Inductor ops. This is a required feature in order to enable the Cutlass StreamK tile scheduler, which should bring performance benefits and more reliable GEMM performance for many tile shapes. See StreamK paper: https://arxiv.org/abs/2301.03598 [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119005
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New FailuresAs of commit 8019b3a with merge base 520771d (): NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…space sizes" This change introduces temporary workspace buffers, which can be used to provide workspace memory to Inductor ops. This is a required feature in order to enable the Cutlass StreamK tile scheduler, which should bring performance benefits and more reliable GEMM performance for many tile shapes. See StreamK paper: https://arxiv.org/abs/2301.03598 [ghstack-poisoned]
…space sizes" This change introduces temporary workspace buffers, which can be used to provide workspace memory to Inductor ops. This is a required feature in order to enable the Cutlass StreamK tile scheduler, which should bring performance benefits and more reliable GEMM performance for many tile shapes. See StreamK paper: https://arxiv.org/abs/2301.03598 [ghstack-poisoned]
…space sizes" This change introduces temporary workspace buffers, which can be used to provide workspace memory to Inductor ops. This is a required feature in order to enable the Cutlass StreamK tile scheduler, which should bring performance benefits and more reliable GEMM performance for many tile shapes. See StreamK paper: https://arxiv.org/abs/2301.03598 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
…space sizes" This change introduces temporary workspace buffers, which can be used to provide workspace memory to Inductor ops. This is a required feature in order to enable the Cutlass StreamK tile scheduler, which should bring performance benefits and more reliable GEMM performance for many tile shapes. See StreamK paper: https://arxiv.org/abs/2301.03598 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
…space sizes" This change introduces temporary workspace buffers, which can be used to provide workspace memory to Inductor ops. This is a required feature in order to enable the Cutlass StreamK tile scheduler, which should bring performance benefits and more reliable GEMM performance for many tile shapes. See StreamK paper: https://arxiv.org/abs/2301.03598 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
…space sizes" This change introduces temporary workspace buffers, which can be used to provide workspace memory to Inductor ops. This is a required feature in order to enable the Cutlass StreamK tile scheduler, which should bring performance benefits and more reliable GEMM performance for many tile shapes. See StreamK paper: https://arxiv.org/abs/2301.03598 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
…space sizes" This change introduces temporary workspace buffers, which can be used to provide workspace memory to Inductor ops. This is a required feature in order to enable the Cutlass StreamK tile scheduler, which should bring performance benefits and more reliable GEMM performance for many tile shapes. See StreamK paper: https://arxiv.org/abs/2301.03598 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
@@ -3028,6 +3028,41 @@ class InputBuffer(Buffer): | |||
pass | |||
|
|||
|
|||
@dataclasses.dataclass | |||
class WorkspaceBuffer(Buffer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #117992 I also add workspace allocations but take a slightly different approach. The workspace size is an implementation detail of the code generation, so I don't represent it in the inductor IR and instead put it in KernelArgs
and have the Kernel
method create the args.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterbell10 - Just looked at it, looks good to me. If the PR you mention is blocked through this ROCm issue (Is that blocking?),maybe you could split the workspace allocation code off into a separate PR, so we can get it merged quickly and I can adapt this to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ROCm issues are fixed and I'm just waiting on perf results before merging.
Had to be squashed into #119007 |
Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops. This is a simpler rewrite of my original version of pytorch#119005 using peterbell10 's workspace allocation mechanism from pytorch#117992 Test Plan: - Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement - CI ghstack-source-id: 24d06299f90a1e31af6b097316b76689e4944df2 Pull Request resolved: pytorch#125406
Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops. This is a simpler rewrite of my original version of pytorch#119005 using peterbell10 's workspace allocation mechanism from pytorch#117992 Test Plan: - Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement - CI ghstack-source-id: 6b2a29b3d2754b1981b503939f79f7bc5889216e Pull Request resolved: pytorch#125406
…amK (#125406) Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops. This is a simpler rewrite of my original version of #119005 using @peterbell10 's workspace allocation mechanism from #117992 Test Plan: - Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement - CI Pull Request resolved: #125406 Approved by: https://github.com/jansel
Enable nonzero workspace and Cutlass StreamK for Inductor Cutlass GEMM ops. This is a simpler rewrite of my original version of #119005 using peterbell10 's workspace allocation mechanism from #117992 Test Plan: - Additional unit test in test_cutlass_backend.py which specifically tests StreamK GEMM with workspace requirement - CI ghstack-source-id: eccd80aedce633b345a88eb25ca1d1149bb12756 Pull Request resolved: #125406
Stack from ghstack (oldest at bottom):
This change introduces temporary workspace buffers, which can be used to
provide workspace memory to Inductor ops. This is a required feature
in order to enable the Cutlass StreamK tile scheduler, which should
bring performance benefits and more reliable GEMM performance for
many tile shapes.
See StreamK paper: https://arxiv.org/abs/2301.03598
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @aakhundov @ColinPeppler