[inductor] Adding a way to force fusion of int_mm with mul #111125

HDCharles · 2023-10-12T08:29:49Z

Stack from ghstack (oldest at bottom):

-> [inductor] Adding a way to force fusion of int_mm with mul #111125

Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

Summary: unsure if there's a more general way to set it up rather than forcing the order of teh mul's based on their shape. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot · 2023-10-12T08:29:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111125

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

MacOS outage (2023-10-13)

✅ You can merge normally! (1 Unrelated Failure)

As of commit 1f49cab with merge base 898482f ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

trunk / linux-focal-rocm5.6-py3.8 / test (default, 1, 3, linux.rocm.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: unsure if there's a more general way to set it up rather than forcing the order of teh mul's based on their shape. Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 56aff64 Pull Request resolved: #111125

Summary: When doing quantization int_mm -> mul or int_mm -> mul -> to(dtype) is an extremely common op pattern which is currently not handled well by inductor. Ideally, since the output of int_mm has dtype int32 we'd prefer to only realize a smaller dtype like bf16 or float16. Currently inductor doesn't have a way to force this, in many cases the mul gets fused with a bunch of subsequent pointwise ops from the dequant creating an increase in memory overhead and a general slowdown compared to the fused version. Test Plan: python test/inductor/test_pattern_matcher.py -k "int_mm_mul" Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: When doing quantization int_mm -> mul or int_mm -> mul -> to(dtype) is an extremely common op pattern which is currently not handled well by inductor. Ideally, since the output of int_mm has dtype int32 we'd prefer to only realize a smaller dtype like bf16 or float16. Currently inductor doesn't have a way to force this, in many cases the mul gets fused with a bunch of subsequent pointwise ops from the dequant creating an increase in memory overhead and a general slowdown compared to the fused version. Test Plan: python test/inductor/test_pattern_matcher.py -k "int_mm_mul" Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 5a534a9 Pull Request resolved: #111125

Chillee · 2023-10-12T21:43:58Z

I don't understand why we need a pattern for this? Why don't we just fix it so that it does prioritize this fusion?

cpuhrsch · 2023-10-12T21:48:40Z

@Chillee - I agree with you. This kernel also uses block ptrs. It's a starting point. This isn't supposed to be the final version.

Summary: When doing quantization int_mm -> mul or int_mm -> mul -> to(dtype) is an extremely common op pattern which is currently not handled well by inductor. Ideally, since the output of int_mm has dtype int32 we'd prefer to only realize a smaller dtype like bf16 or float16. Currently inductor doesn't have a way to force this, in many cases the mul gets fused with a bunch of subsequent pointwise ops from the dequant creating an increase in memory overhead and a general slowdown compared to the fused version. Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed. Test Plan: python test/inductor/test_pattern_matcher.py -k "int_mm_mul" Reviewers: Subscribers: Tasks: Tags: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

Summary: When doing quantization int_mm -> mul or int_mm -> mul -> to(dtype) is an extremely common op pattern which is currently not handled well by inductor. Ideally, since the output of int_mm has dtype int32 we'd prefer to only realize a smaller dtype like bf16 or float16. Currently inductor doesn't have a way to force this, in many cases the mul gets fused with a bunch of subsequent pointwise ops from the dequant creating an increase in memory overhead and a general slowdown compared to the fused version. Test Plan: python test/inductor/test_pattern_matcher.py -k "int_mm_mul" Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c3de3db Pull Request resolved: #111125

HDCharles · 2023-10-12T23:26:40Z

I don't understand why we need a pattern for this? Why don't we just fix it so that it does prioritize this fusion?

yeah this would ideally be temporary until we fix it so it prioritizes the fusion, the reason we needed a pattern for this is mostly because i knew how to do that but couldn't figure out how to fix the fusion.

torch/_inductor/fx_passes/post_grad.py

torch/_inductor/kernel/fused_int_mm_mul.py

Summary: When doing quantization int_mm -> mul or int_mm -> mul -> to(dtype) is an extremely common op pattern which is currently not handled well by inductor. Ideally, since the output of int_mm has dtype int32 we'd prefer to only realize a smaller dtype like bf16 or float16. Currently inductor doesn't have a way to force this, in many cases the mul gets fused with a bunch of subsequent pointwise ops from the dequant creating an increase in memory overhead and a general slowdown compared to the fused version. Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed. Test Plan: python test/inductor/test_pattern_matcher.py -k "int_mm_mul" Reviewers: Subscribers: Tasks: Tags: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

Summary: When doing quantization int_mm -> mul or int_mm -> mul -> to(dtype) is an extremely common op pattern which is currently not handled well by inductor. Ideally, since the output of int_mm has dtype int32 we'd prefer to only realize a smaller dtype like bf16 or float16. Currently inductor doesn't have a way to force this, in many cases the mul gets fused with a bunch of subsequent pointwise ops from the dequant creating an increase in memory overhead and a general slowdown compared to the fused version. Test Plan: python test/inductor/test_pattern_matcher.py -k "int_mm_mul" Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7016102 Pull Request resolved: #111125

jansel · 2023-10-13T16:11:55Z

torch/_inductor/kernel/mm.py

+
+def tuned_fused_int_mm_mul(mat1, mat2, mat3, out_dtype, *, layout=None):
+    out_dtype = (
+        torch.promote_types(mat3.get_dtype(), torch.int32)


Suggested change

torch.promote_types(mat3.get_dtype(), torch.int32)

functools.reduce(torch.promote_types, [

mat1.get_dtype(),

mat2.get_dtype(),

mat3.get_dtype()

])

I am assuming this will match the semantics of aten._int_mm, but you should check to make sure.

Will aten._int_mm promote to torch.int32 even if none of the inputs are int32? What about int64?

int_mm only does int8 int8 -> int32 matmuls so i don't think those others are possible.

with torch.promote_types(mat3.get_dtype(), torch.int32) i was trying to set the output type to be what you'd naturally get from an int32 * float16 so that it matches the non torch-compiled model. What is the benefit of including the mat1 and mat2 dtypes if what we really want is the promote from the int_mm output which we know the dtype of?

torch/_inductor/kernel/mm.py

Summary: When doing quantization int_mm -> mul or int_mm -> mul -> to(dtype) is an extremely common op pattern which is currently not handled well by inductor. Ideally, since the output of int_mm has dtype int32 we'd prefer to only realize a smaller dtype like bf16 or float16. Currently inductor doesn't have a way to force this, in many cases the mul gets fused with a bunch of subsequent pointwise ops from the dequant creating an increase in memory overhead and a general slowdown compared to the fused version. Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed. Test Plan: python test/inductor/test_pattern_matcher.py -k "int_mm_mul" Reviewers: Subscribers: Tasks: Tags: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

Summary: When doing quantization int_mm -> mul or int_mm -> mul -> to(dtype) is an extremely common op pattern which is currently not handled well by inductor. Ideally, since the output of int_mm has dtype int32 we'd prefer to only realize a smaller dtype like bf16 or float16. Currently inductor doesn't have a way to force this, in many cases the mul gets fused with a bunch of subsequent pointwise ops from the dequant creating an increase in memory overhead and a general slowdown compared to the fused version. Test Plan: python test/inductor/test_pattern_matcher.py -k "int_mm_mul" Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 120dd57 Pull Request resolved: #111125

cpuhrsch · 2023-10-13T20:52:54Z

@pytorchbot merge

pytorchmergebot · 2023-10-13T20:54:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2023-10-16T17:35:12Z

@pytorchbot revert -m 'Sorry for reverting your change, but it fails on ROCm https://hud.pytorch.org/pytorch/pytorch/commit/f4297576e63e4110f6bdf2522ae6a5fb4c7f3816' -c nosignals

The error is RuntimeError: _int_mm_out_cuda not compiled for this platform. ROCm is currently having some stability issue, so its tests doesn't block merge atm. However, the issue here looks legit.

pytorch-bot · 2023-10-16T17:35:14Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: argument -c/--classification: invalid choice: 'nosignals' (choose from 'nosignal', 'ignoredsignal', 'landrace', 'weird', 'ghfirst')

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

huydhn · 2023-10-16T17:35:25Z

@pytorchbot revert -m 'Sorry for reverting your change, but it fails on ROCm https://hud.pytorch.org/pytorch/pytorch/commit/f4297576e63e4110f6bdf2522ae6a5fb4c7f3816' -c nosignal

pytorchmergebot · 2023-10-16T17:37:07Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-10-16T17:37:17Z

@HDCharles your PR has been successfully reverted.

…111125)" This reverts commit f429757. Reverted #111125 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it fails on ROCm https://hud.pytorch.org/pytorch/pytorch/commit/f4297576e63e4110f6bdf2522ae6a5fb4c7f3816 ([comment](#111125 (comment)))

Summary: When doing quantization int_mm -> mul or int_mm -> mul -> to(dtype) is an extremely common op pattern which is currently not handled well by inductor. Ideally, since the output of int_mm has dtype int32 we'd prefer to only realize a smaller dtype like bf16 or float16. Currently inductor doesn't have a way to force this, in many cases the mul gets fused with a bunch of subsequent pointwise ops from the dequant creating an increase in memory overhead and a general slowdown compared to the fused version. Test Plan: python test/inductor/test_pattern_matcher.py -k "int_mm_mul" Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 938093a Pull Request resolved: pytorch/pytorch#111125

WIP Adding int_mm_mul (int_mm fused with two muls)

10dab08

Summary: unsure if there's a more general way to set it up rather than forcing the order of teh mul's based on their shape. Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

github-actions bot added module: inductor ciflow/inductor labels Oct 12, 2023

HDCharles changed the title ~~WIP Adding int_mm_mul (int_mm fused with two muls)~~ Adding a way to force fusion of int_mm with mul Oct 12, 2023

HDCharles requested review from cpuhrsch and jansel October 12, 2023 21:24

HDCharles changed the title ~~Adding a way to force fusion of int_mm with mul~~ [inductor] Adding a way to force fusion of int_mm with mul Oct 12, 2023

HDCharles added topic: performance topic category release notes: inductor labels Oct 12, 2023

HDCharles requested review from Chillee and bdhirsh October 12, 2023 21:28

jansel requested changes Oct 13, 2023

View reviewed changes

torch/_inductor/fx_passes/post_grad.py Outdated Show resolved Hide resolved

torch/_inductor/kernel/fused_int_mm_mul.py Outdated Show resolved Hide resolved

torch/_inductor/kernel/fused_int_mm_mul.py Outdated Show resolved Hide resolved

HDCharles requested a review from jansel October 13, 2023 05:59

jansel requested changes Oct 13, 2023

View reviewed changes

HDCharles requested a review from jansel October 13, 2023 17:29

cpuhrsch approved these changes Oct 13, 2023

View reviewed changes

jansel approved these changes Oct 13, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 13, 2023

pytorchmergebot added the merging label Oct 13, 2023

pytorchmergebot added Merged and removed merging labels Oct 13, 2023

pytorchmergebot closed this in f429757 Oct 13, 2023

pytorchmergebot added the Reverted label Oct 16, 2023

HDCharles mentioned this pull request Oct 17, 2023

adding way to force int_mm fusion #111405

Closed

-        torch.promote_types(mat3.get_dtype(), torch.int32)
+        functools.reduce(torch.promote_types, [
+             mat1.get_dtype(),
+             mat2.get_dtype(),
+             mat3.get_dtype()
+        ])

[inductor] Adding a way to force fusion of int_mm with mul #111125

[inductor] Adding a way to force fusion of int_mm with mul #111125

Uh oh!

Conversation

HDCharles commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111125

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Chillee commented Oct 12, 2023

Uh oh!

cpuhrsch commented Oct 12, 2023

Uh oh!

HDCharles commented Oct 12, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jansel Oct 13, 2023

Choose a reason for hiding this comment

Uh oh!

HDCharles Oct 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cpuhrsch commented Oct 13, 2023

Uh oh!

pytorchmergebot commented Oct 13, 2023

Merge started

Uh oh!

huydhn commented Oct 16, 2023

Uh oh!

pytorch-bot bot commented Oct 16, 2023

Uh oh!

huydhn commented Oct 16, 2023

Uh oh!

pytorchmergebot commented Oct 16, 2023

Uh oh!

pytorchmergebot commented Oct 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HDCharles commented Oct 12, 2023 •

edited

Loading

pytorch-bot bot commented Oct 12, 2023 •

edited

Loading

HDCharles Oct 13, 2023 •

edited

Loading