Skip to content

[cutlass backend] forward fix of standalone runner for fbcode #147158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

henrylhtsang
Copy link
Contributor

@henrylhtsang henrylhtsang commented Feb 14, 2025

Copy link

pytorch-bot bot commented Feb 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147158

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit aeb6482 with merge base 302f56a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@henrylhtsang
Copy link
Contributor Author

@desertfire actually would be good if we can just remove this https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codecache.py#L2844

…ode"


Also only doing mixed mm for A100.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
…ode"


Also only doing mixed mm for A100.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
# error while loading shared libraries: IX}: invalid mode for dlopen(): Invalid argument
platform_path = sysconfig.get_config_var("LIBDIR")
link_str = " ".join(
[f"-L{platform_path}", "-Xlinker", f"-rpath={platform_path}"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, does this only affect the standalone runner for the unittests? What about the cases where we compile a model with cutlass backend in fbcode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenyang78 my guess is, we are build an executable this time, and all the previous time we were building a .so file....

Comment on lines 958 to 959
@unittest.skipIf(not SM80OrLater, "need sm_80")
@unittest.skipIf(not SM80OrLater or SM90OrLater, "need sm_8x exactly")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for disabling test for SM90 and higher? I tested on H100 machine, and it worked fine.

Copy link
Contributor Author

@henrylhtsang henrylhtsang Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexsamardzic good question not 100% sure

I think I broke it in #146877. Forward fix preping in #147185

EDIT: on the other hand, I saw that the mixed mm test is for A100 only. Is that true?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it should work on Hopper too, as CUTLASS 3.x kernels are too put in the mix when auto-tuning for mixed MM, see here. As mentioned above, it worked when I tested on H100, i.e. it generated standalone runner with CUTLASS 3.x kernel instantiated. Alternatively, we could use regular MM for SM90 and above, the point of this test is just to have some kind of C++ program, generated with standalone runner, to compile, and run without errors.

For Ampere, auto-tuning is implemented only for mixed MM and 2:4 sparse MM (this could be tracked by searching for CUTLASS2xGemmTemplate.add_cutlass_gemm_choices() calls in torch/_inductor/kernel/mm.py), this is why I put mixed MM in this test. You asked previously about the rationale for supporting Ampere at all - the main motivation for having these two operations supported was that these were not supported by Triton. I think 2:4 sparse MM is not yet supported, while if I remember correctly mixed MM was not supported back at the time, and also CUTLASS would oftentimes be picked by auto-tuner over Triton, i.e. oftentimes CUTLASS mixed MM kernel would be faster.

Beware that CUTLASS2xGemmTemplate is not supporting bias for 2:4 sparse MM, but instead it has metadata tensor (this tensor contains information about the sparsity pattern of the 2:4 sparse weight matrix) handling hacked in. I think it should not affect your changes, but it's something to keep on mind.

In general, this whole CUTLASS "back-end" for auto-tuning is rather fragile, and hard to extend, in its current state. As in the meantime CUTLASS provided solid support for generating C++ kernels from plain Python description of the intended operation, see here, I was planning to try to completely re-write this stuff, but at the moment I don't know if/when that could be. Still, if that path is approached, either by myself or someone else, it's very important that we keep the tests as complete as possible.

Copy link
Contributor Author

@henrylhtsang henrylhtsang Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: mixed mm test on H100, when I run it on H100 with TORCH_LOGS="+inductor" and force disable cache, I see "No suitable Cutlass GEMM configs found, fallbacks used ". That is without #146877. I also suspect autotune_fallback_to_aten=False doesn't work for mixed mm.

EDIT: I think its related to the alignment. If things are divisible by 16 they can work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re "In general, this whole CUTLASS "back-end" for auto-tuning is rather fragile, and hard to extend, in its current state. " I agree. Moving to python would be nice. I think this particular test can remain in A100 just for repro purpose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also suspect autotune_fallback_to_aten=False doesn't work for mixed mm.

The mixed MM has a rather convoluted logic for selection of auto-tuning back-ends to use, see here. I don't know why it's this way, but apparently it goes against some of the config settings.

…ode"


Also only doing mixed mm for A100.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
…ode"





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
…ode"





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
@henrylhtsang
Copy link
Contributor Author

henrylhtsang commented Feb 14, 2025

@alexsamardzic I updated it, now it should run on both A100 and H100.

A few small action items:

  1. update the mixed_mm test to use maybe (128, 16, 128) so it can run on H100
  2. Add fallback to aten logic to tuned_mixed_mm

…ode"





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
…ode"





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
…ode"





cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

[ghstack-poisoned]
henrylhtsang added a commit that referenced this pull request Feb 18, 2025
@henrylhtsang
Copy link
Contributor Author

@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 18, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 0 checks:

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Feb 19, 2025
…147409)

Summary: I think the path is not needed anymore. It was added in #126408, but  it has been a while since then. See if CI complains.

Differential Revision: D69573185

See also  #147158

Pull Request resolved: #147409
Approved by: https://github.com/chenyang78
Raymo111 pushed a commit that referenced this pull request Feb 20, 2025
Raymo111 pushed a commit that referenced this pull request Feb 20, 2025
…147409)

Summary: I think the path is not needed anymore. It was added in #126408, but  it has been a while since then. See if CI complains.

Differential Revision: D69573185

See also  #147158

Pull Request resolved: #147409
Approved by: https://github.com/chenyang78
pytorch-bot bot pushed a commit that referenced this pull request Feb 24, 2025
pytorch-bot bot pushed a commit that referenced this pull request Feb 24, 2025
…147409)

Summary: I think the path is not needed anymore. It was added in #126408, but  it has been a while since then. See if CI complains.

Differential Revision: D69573185

See also  #147158

Pull Request resolved: #147409
Approved by: https://github.com/chenyang78
majing921201 pushed a commit to majing921201/pytorch that referenced this pull request Mar 4, 2025
majing921201 pushed a commit to majing921201/pytorch that referenced this pull request Mar 4, 2025
…ytorch#147409)

Summary: I think the path is not needed anymore. It was added in pytorch#126408, but  it has been a while since then. See if CI complains.

Differential Revision: D69573185

See also  pytorch#147158

Pull Request resolved: pytorch#147409
Approved by: https://github.com/chenyang78
@github-actions github-actions bot deleted the gh/henrylhtsang/10/head branch March 27, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants