Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inductor] GEMM shape padding improvements #118522

Closed
wants to merge 1 commit into from

Conversation

kadeng
Copy link
Contributor

@kadeng kadeng commented Jan 29, 2024

Improvements to shape padding logic in torch/_inductor/pad_mm.py

These changes could lead up to 14% perf improvement for certain Meta internal models in experiments.

Most notably:

  • 1.) Use aten.const_pad_nd operation to pad Tensors in a single op instead of using multiple steps involving intermediate buffers. This appears to be more performant than the previous logic, confirmed by Profiling & Benchmarking results ( Meta internal )
  • 2.) Make many paddings unneccessary using explicitly transposed GEMM when either M or N dimension is properly aligned but the other is not, configurable via config.shape_pad_use_transpose (default: True).
  • 3.) Enable shape padding for the Inductor CUDA / Cutlass backend for all GEMM ops where Cutlass would be enabled, without benchmarking in that case.
  • Add config flag to always pad shapes (without benchmarking first), configurable via config.force_shape_pad (default: False )
  • Added several new unit tests to ensure tensors are padded such that they meet all alignment requirements after padding.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

Copy link

pytorch-bot bot commented Jan 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118522

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit baf3b9a with merge base 68c3cb7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@kadeng kadeng force-pushed the kadeng/inductor-cutlass-backend-3-PR1 branch from eac9bc0 to d9c820f Compare January 29, 2024 19:00
@kadeng kadeng self-assigned this Jan 30, 2024
@kadeng kadeng marked this pull request as ready for review January 30, 2024 10:20
@facebook-github-bot
Copy link
Contributor

@kadeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Would you add a few tests?

torch/_inductor/fx_passes/pad_mm.py Show resolved Hide resolved
torch/_inductor/fx_passes/pad_mm.py Outdated Show resolved Hide resolved
size=[batchsize, m, n],
stride=[n * m, n, 1],
)
if use_cutlass_template(fake_layout):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still want to pad bandwidth bound mms in cutlass ? I found for the small matmuls it wasnt profitable to pad

Copy link
Contributor Author

@kadeng kadeng Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not going to use cutlass for small Matmuls, in practice they will have a size threshold (in terms of MNK) below which they won't be used. This is how I did it here ( Meta-internal link) https://fb.workplace.com/notes/347669491400529/ and arrived at whole-model speedups of up to 14%.

And bandwidth bound large matmuls are definitely what I want the padding to be applied to, these can be sped up considerably by Cutlass. This has a benefit on average because the padding can often be fused with other (previous) ops by Triton (such that no mem IO overhead remains through padding) or be done via no-op memory reinterpretation ( e.g. transpose etc.). In addition, the Cutlass speedup can be so large that it would speed things up even if the above were not the case.

See also this (really large) PR where the threshold is introduced: #118416

Copy link
Contributor

@eellison eellison Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a benefit on average because the padding can often be fused with other (previous) ops by Triton (such that no mem IO overhead remains through padding) or be done via no-op memory reinterpretation ( e.g. transpose etc.).

For both these cases we should be able to analyze the graph to know that these are going to fire... A general TODO in this file is to skip the padding on a tensor in benchmarking when it comes from a fusable operator.

torch/_inductor/fx_passes/pad_mm.py Show resolved Hide resolved
Comment on lines +296 to +302
elif n_padded_length == 0 and m_padded_length != 0:
m_padded_length = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding:

if A[M, K] @ B[K, N] , with only dimension M need padding, we will consider this matmul does not need padding?

Do you actually means dimension M does not worth padding?

Copy link
Contributor Author

@kadeng kadeng Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, because A[M,K] @ B[K,N] -> [M,N] -> this is fine, since N will be the last dimension, M won't have to meet alignment requirements ( assuming row-major output layout, which will then certainly be picked by the matmul backend if M is not aligned ).

So, in other words, we only need K and either M or N to be aligned.

@kadeng kadeng force-pushed the kadeng/inductor-cutlass-backend-3-PR1 branch from d9c820f to 64fb65a Compare January 31, 2024 19:26
@facebook-github-bot
Copy link
Contributor

@kadeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kadeng kadeng force-pushed the kadeng/inductor-cutlass-backend-3-PR1 branch from 64fb65a to 4b5cd88 Compare January 31, 2024 19:30
@facebook-github-bot
Copy link
Contributor

@kadeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kadeng
Copy link
Contributor Author

kadeng commented Jan 31, 2024

Cool! Would you add a few tests?

Just did that. I focused on testing the padding logic itself, not the pattern matching etc. which is unchanged.

torch/_inductor/config.py Outdated Show resolved Hide resolved
@kadeng kadeng force-pushed the kadeng/inductor-cutlass-backend-3-PR1 branch from 4b5cd88 to a9ba985 Compare February 1, 2024 11:12
Improvements to shape padding logic in torch/_inductor/pad_mm.py

Most notably:

  * Enable shape padding for Cutlass
  * Add flag to always pad shapes
  * Use aten.const_pad_nd operation to pad Tensors in a single op instead of using multiple steps involving intermediate buffers.
  * Make many paddings unneccessary when either M or N dimension is properly aligned but the other is not ( configurable, on by default

Updates:

 * Addressed reviewer comments
 * Removed config setting to only pad K dimension
 * Added detailed unit tests in test/inductor/test_pad_mm.py
@kadeng kadeng force-pushed the kadeng/inductor-cutlass-backend-3-PR1 branch from a9ba985 to baf3b9a Compare February 1, 2024 11:17
@facebook-github-bot
Copy link
Contributor

@kadeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Improvements to shape padding logic in torch/_inductor/pad_mm.py

These changes could lead up to 14% perf improvement for certain Meta internal models in experiments.

Most notably:
  * 1.) Use aten.const_pad_nd operation to pad Tensors in a single op instead of using multiple steps involving intermediate buffers. This appears to be more performant than the previous logic, confirmed by Profiling & Benchmarking results ( Meta internal )
 * 2.) Make many paddings unneccessary using explicitly transposed GEMM when either M or N dimension is properly aligned but the other is not, configurable via config.shape_pad_use_transpose (default: True).
  * 3.) Enable shape padding for the Inductor CUDA  /  Cutlass backend for all GEMM ops where Cutlass would be enabled, without benchmarking in that case.
  * Add config flag to always pad shapes (without benchmarking first), configurable via config.force_shape_pad (default: False )
  * Added several new unit tests to ensure tensors are padded such that they meet all alignment requirements after padding.

Pull Request resolved: #118522
Approved by: https://github.com/jansel, https://github.com/eellison
pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
This reverts commit cc46829.

Reverted #118522 on behalf of https://github.com/eellison due to regresses HF ~4/5% ([comment](#118522 (comment)))
vfdev-5 pushed a commit to vfdev-5/pytorch that referenced this pull request Feb 9, 2024
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
This reverts commit cc46829.

Reverted #118522 on behalf of https://github.com/eellison due to regresses HF ~4/5% ([comment](#118522 (comment)))
Copy link

github-actions bot commented Apr 7, 2024

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Apr 7, 2024
@kadeng
Copy link
Contributor Author

kadeng commented May 6, 2024

Superseded by #119578

@kadeng kadeng closed this May 6, 2024
eellison added a commit that referenced this pull request May 8, 2024
Relanding just the pad in a single pass portion of the pr. Not including
the transpose logic:

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 8, 2024
Relanding just the pad in a single pass portion of the pr. Not including
the transpose logic:

ghstack-source-id: 710d916730270cc59da0e4410ecf12be434cebc0
Pull Request resolved: #125773
eellison added a commit that referenced this pull request May 9, 2024
…ovements (#118522)'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 9, 2024
…'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 9, 2024
…ovements (#118522)'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 9, 2024
…'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 10, 2024
…ovements (#118522)'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 10, 2024
…'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 10, 2024
…ovements (#118522)'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 10, 2024
…'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 13, 2024
…ovements (#118522)'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 13, 2024
…'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 13, 2024
…ovements (#118522)'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 13, 2024
…'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 13, 2024
…ovements (#118522)'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 13, 2024
…'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 14, 2024
…ovements (#118522)'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
eellison added a commit that referenced this pull request May 14, 2024
…'"


Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed. 

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request May 15, 2024
Relanding just the pad in a single pass portion of [the pr](#118522). Not including
the transpose logic:

This was previously accepted and reviewed.

Pull Request resolved: #125773
Approved by: https://github.com/shunting314
ghstack dependencies: #125772
ZelboK pushed a commit to ZelboK/pytorch that referenced this pull request May 19, 2024
…ytorch#125773)

Relanding just the pad in a single pass portion of [the pr](pytorch#118522). Not including
the transpose logic:

This was previously accepted and reviewed.

Pull Request resolved: pytorch#125773
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#125772
@github-actions github-actions bot deleted the kadeng/inductor-cutlass-backend-3-PR1 branch June 6, 2024 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants