Skip to content

Conversation

HDCharles
Copy link
Contributor

@HDCharles HDCharles commented Oct 12, 2023

Stack from ghstack (oldest at bottom):

Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

Summary: unsure if there's a more general way to set it up rather than
forcing the order of teh mul's based on their shape.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 12, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111125

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (1 Unrelated Failure)

As of commit 1f49cab with merge base 898482f (image):

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

HDCharles added a commit that referenced this pull request Oct 12, 2023
Summary: unsure if there's a more general way to set it up rather than
forcing the order of teh mul's based on their shape.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 56aff64
Pull Request resolved: #111125
@HDCharles HDCharles changed the title WIP Adding int_mm_mul (int_mm fused with two muls) Adding a way to force fusion of int_mm with mul Oct 12, 2023
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@HDCharles HDCharles requested review from cpuhrsch and jansel October 12, 2023 21:24
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Oct 12, 2023
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 5a534a9
Pull Request resolved: #111125
@HDCharles HDCharles changed the title Adding a way to force fusion of int_mm with mul [inductor] Adding a way to force fusion of int_mm with mul Oct 12, 2023
@HDCharles HDCharles requested review from Chillee and bdhirsh October 12, 2023 21:28
@Chillee
Copy link
Collaborator

Chillee commented Oct 12, 2023

I don't understand why we need a pattern for this? Why don't we just fix it so that it does prioritize this fusion?

@cpuhrsch
Copy link
Contributor

@Chillee - I agree with you. This kernel also uses block ptrs. It's a starting point. This isn't supposed to be the final version.

Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Oct 12, 2023
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: c3de3db
Pull Request resolved: #111125
@HDCharles
Copy link
Contributor Author

I don't understand why we need a pattern for this? Why don't we just fix it so that it does prioritize this fusion?

yeah this would ideally be temporary until we fix it so it prioritizes the fusion, the reason we needed a pattern for this is mostly because i knew how to do that but couldn't figure out how to fix the fusion.

@HDCharles HDCharles requested a review from jansel October 13, 2023 05:59
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Oct 13, 2023
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 7016102
Pull Request resolved: #111125

def tuned_fused_int_mm_mul(mat1, mat2, mat3, out_dtype, *, layout=None):
out_dtype = (
torch.promote_types(mat3.get_dtype(), torch.int32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
torch.promote_types(mat3.get_dtype(), torch.int32)
functools.reduce(torch.promote_types, [
mat1.get_dtype(),
mat2.get_dtype(),
mat3.get_dtype()
])

I am assuming this will match the semantics of aten._int_mm, but you should check to make sure.

Will aten._int_mm promote to torch.int32 even if none of the inputs are int32? What about int64?

Copy link
Contributor Author

@HDCharles HDCharles Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int_mm only does int8 int8 -> int32 matmuls so i don't think those others are possible.

with torch.promote_types(mat3.get_dtype(), torch.int32) i was trying to set the output type to be what you'd naturally get from an int32 * float16 so that it matches the non torch-compiled model. What is the benefit of including the mat1 and mat2 dtypes if what we really want is the promote from the int_mm output which we know the dtype of?

Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
@HDCharles HDCharles requested a review from jansel October 13, 2023 17:29
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Theoretically with better control of/smarter inductor fusion, this could be something we get for free, at which point these changes can be removed.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
HDCharles added a commit that referenced this pull request Oct 13, 2023
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 120dd57
Pull Request resolved: #111125
@cpuhrsch
Copy link
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 13, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Contributor

huydhn commented Oct 16, 2023

@pytorchbot revert -m 'Sorry for reverting your change, but it fails on ROCm https://hud.pytorch.org/pytorch/pytorch/commit/f4297576e63e4110f6bdf2522ae6a5fb4c7f3816' -c nosignals

The error is RuntimeError: _int_mm_out_cuda not compiled for this platform. ROCm is currently having some stability issue, so its tests doesn't block merge atm. However, the issue here looks legit.

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 16, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: argument -c/--classification: invalid choice: 'nosignals' (choose from 'nosignal', 'ignoredsignal', 'landrace', 'weird', 'ghfirst')

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@huydhn
Copy link
Contributor

huydhn commented Oct 16, 2023

@pytorchbot revert -m 'Sorry for reverting your change, but it fails on ROCm https://hud.pytorch.org/pytorch/pytorch/commit/f4297576e63e4110f6bdf2522ae6a5fb4c7f3816' -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@HDCharles your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Oct 16, 2023
tmavima2251 pushed a commit to tmavima2251/Pytorch that referenced this pull request Dec 8, 2023
Summary: When doing quantization int_mm -> mul or int_mm -> mul ->
to(dtype) is an extremely common op pattern which is currently not
handled well by inductor. Ideally, since the output of
int_mm has dtype int32 we'd prefer to only realize a smaller dtype like
bf16 or float16. Currently inductor doesn't have a way to force this, in
many cases the mul gets fused with a bunch of subsequent pointwise
ops from the dequant creating an increase in memory overhead and a general
slowdown compared to the fused version.

Test Plan: python test/inductor/test_pattern_matcher.py -k
"int_mm_mul"

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 938093a
Pull Request resolved: pytorch/pytorch#111125
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants