-
Notifications
You must be signed in to change notification settings - Fork 24.5k
[cutlass backend] forward fix of standalone runner for fbcode #147158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147158
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit aeb6482 with merge base 302f56a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@desertfire actually would be good if we can just remove this https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codecache.py#L2844 |
…ode" Also only doing mixed mm for A100. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
…ode" Also only doing mixed mm for A100. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
# error while loading shared libraries: IX}: invalid mode for dlopen(): Invalid argument | ||
platform_path = sysconfig.get_config_var("LIBDIR") | ||
link_str = " ".join( | ||
[f"-L{platform_path}", "-Xlinker", f"-rpath={platform_path}"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, does this only affect the standalone runner for the unittests? What about the cases where we compile a model with cutlass backend in fbcode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenyang78 my guess is, we are build an executable this time, and all the previous time we were building a .so file....
@unittest.skipIf(not SM80OrLater, "need sm_80") | ||
@unittest.skipIf(not SM80OrLater or SM90OrLater, "need sm_8x exactly") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for disabling test for SM90 and higher? I tested on H100 machine, and it worked fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexsamardzic good question not 100% sure
I think I broke it in #146877. Forward fix preping in #147185
EDIT: on the other hand, I saw that the mixed mm test is for A100 only. Is that true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it should work on Hopper too, as CUTLASS 3.x kernels are too put in the mix when auto-tuning for mixed MM, see here. As mentioned above, it worked when I tested on H100, i.e. it generated standalone runner with CUTLASS 3.x kernel instantiated. Alternatively, we could use regular MM for SM90 and above, the point of this test is just to have some kind of C++ program, generated with standalone runner, to compile, and run without errors.
For Ampere, auto-tuning is implemented only for mixed MM and 2:4 sparse MM (this could be tracked by searching for CUTLASS2xGemmTemplate.add_cutlass_gemm_choices()
calls in torch/_inductor/kernel/mm.py
), this is why I put mixed MM in this test. You asked previously about the rationale for supporting Ampere at all - the main motivation for having these two operations supported was that these were not supported by Triton. I think 2:4 sparse MM is not yet supported, while if I remember correctly mixed MM was not supported back at the time, and also CUTLASS would oftentimes be picked by auto-tuner over Triton, i.e. oftentimes CUTLASS mixed MM kernel would be faster.
Beware that CUTLASS2xGemmTemplate
is not supporting bias for 2:4 sparse MM, but instead it has metadata tensor (this tensor contains information about the sparsity pattern of the 2:4 sparse weight matrix) handling hacked in. I think it should not affect your changes, but it's something to keep on mind.
In general, this whole CUTLASS "back-end" for auto-tuning is rather fragile, and hard to extend, in its current state. As in the meantime CUTLASS provided solid support for generating C++ kernels from plain Python description of the intended operation, see here, I was planning to try to completely re-write this stuff, but at the moment I don't know if/when that could be. Still, if that path is approached, either by myself or someone else, it's very important that we keep the tests as complete as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: mixed mm test on H100, when I run it on H100 with TORCH_LOGS="+inductor" and force disable cache, I see "No suitable Cutlass GEMM configs found, fallbacks used ". That is without #146877. I also suspect autotune_fallback_to_aten=False doesn't work for mixed mm.
EDIT: I think its related to the alignment. If things are divisible by 16 they can work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re "In general, this whole CUTLASS "back-end" for auto-tuning is rather fragile, and hard to extend, in its current state. " I agree. Moving to python would be nice. I think this particular test can remain in A100 just for repro purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also suspect autotune_fallback_to_aten=False doesn't work for mixed mm.
The mixed MM has a rather convoluted logic for selection of auto-tuning back-ends to use, see here. I don't know why it's this way, but apparently it goes against some of the config settings.
…ode" Also only doing mixed mm for A100. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
…ode" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
…ode" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
@alexsamardzic I updated it, now it should run on both A100 and H100. A few small action items:
|
…ode" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
…ode" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
…ode" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 0 checks: Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…147409) Summary: I think the path is not needed anymore. It was added in #126408, but it has been a while since then. See if CI complains. Differential Revision: D69573185 See also #147158 Pull Request resolved: #147409 Approved by: https://github.com/chenyang78
Pull Request resolved: #147158 Approved by: https://github.com/chenyang78
…147409) Summary: I think the path is not needed anymore. It was added in #126408, but it has been a while since then. See if CI complains. Differential Revision: D69573185 See also #147158 Pull Request resolved: #147409 Approved by: https://github.com/chenyang78
Pull Request resolved: #147158 Approved by: https://github.com/chenyang78
…147409) Summary: I think the path is not needed anymore. It was added in #126408, but it has been a while since then. See if CI complains. Differential Revision: D69573185 See also #147158 Pull Request resolved: #147409 Approved by: https://github.com/chenyang78
…h#147158) Pull Request resolved: pytorch#147158 Approved by: https://github.com/chenyang78
…ytorch#147409) Summary: I think the path is not needed anymore. It was added in pytorch#126408, but it has been a while since then. See if CI complains. Differential Revision: D69573185 See also pytorch#147158 Pull Request resolved: pytorch#147409 Approved by: https://github.com/chenyang78
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov