[Inductor Cutlass backend] 2 of 2 - Enabling flexible EVT-based pointwise fusions with additional tensor input #119601

kadeng · 2024-02-09T21:13:11Z

Stack from ghstack (oldest at bottom):

[Inductor Cutlass backend] (2/2) Remove Cutlass EVT epilogue fusion #119701
[Inductor Cutlass backend] Remove Cutlass EVT epilogue fusion #119600
[Inductor cutlass backend] Experimental extended Cutlass op generator #119009
[Inductor Cutlass backend] Many doc comments #119008
-> [Inductor Cutlass backend] 2 of 2 - Enabling flexible EVT-based pointwise fusions with additional tensor input #119601
[Inductor Cutlass backend] 1 of 2 - Enabling flexible EVT-based pointwise fusions with additional tensor input #119007
[Inductor Cutlass backend] Disable Cutlass backend, split and add tests #119598
[Inductor Cutlass backend] Development feature flag #119597
[AOTInductor] Remove tests which do not fail anymore, but are expected to #119006
[inductor max autotune] Detailed autotuning result logs ( machine-readable ) #119004
Fix test failure: Add torch.cuda._get_device_properties to dynamo trace rules #120620

This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.

Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI

[Inductor max autotune] Robust Autotuning

This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )

[inductor max autotune] Flexible GEMM layout autotuning

This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.

During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.

Test Plan:

* Additional Unit test(s)
* CI

[Inductor Cutlass backend] Improvements to CUTLASS Kernels

* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests

[Inductor Cutlass backend] prevent problems with symbolic shapes

Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.

[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

Notable changes:

* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

[Inductor cutlass backend] Log CUDA compilation times

Log the time that CUDA compilations take in the debug log

[Inductor cutlass backend] Minor improvements to unit tests

As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend

[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

In experiments, this proved crucial to enable nontrivial epilogue fusions.

[Inductor cutlass backend] Support more edge cases involving strided memory layouts

This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.

[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.

This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.

They can be easily re-enabled via a configuration setting if desired.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @aakhundov @ColinPeppler

…twise fusions with additional tensor input This diff enables flexible EVT based Matmul fusions which may require one tensor input in addition to the Matmul operands ( A and B ) via the same input operand (C) that is also used for the Bias in the case of linear epilogues. Test Plan: * Additional unit tests in test/inductor/test_max_autotune.py * Manual inspection of the generated code * CI [Inductor max autotune] Robust Autotuning This change provides many changes which make autotuning robust against failures, e.g. Autotuning is not just for performance comparison anymore, but also to filter out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..). Also, if no suitable Kernels are found, it allows to fall back to alternative GEMM backends ( e.g. ATen ) [inductor max autotune] Flexible GEMM layout autotuning This diff introduces memory layout autotuning and flexibilizes memory layouts that are accepted and written by the Cutlass GEMM Kernels. During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory layouts, all possible combinations of row-major or column major layouts are tried during autotuning. Test Plan: * Additional Unit test(s) * CI [Inductor Cutlass backend] Improvements to CUTLASS Kernels * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling ) * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion) * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C ) * Support for tanh and sigmoid in EVT epilogue expressions * More unit tests [Inductor Cutlass backend] prevent problems with symbolic shapes Without this change, GEMM inputs with symbolic dimensions could lead to runtime exceptions. This change makes sure that, while the CUTLASS backend will not be picked, at least no exceptions occur. [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements Notable changes: * Smaller refactorings ( extract method specifically to make code more readable ) * Several new unit tests * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node [Inductor cutlass backend] Log CUDA compilation times Log the time that CUDA compilations take in the debug log [Inductor cutlass backend] Minor improvements to unit tests As the title says, minor improvements / changes in the unit tests of the Inductor CUTLASS backend [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory A common problem when fusing epilogues is that additional (auxiliary) inputs require shared memory. But when all shared memory is already required by the GEMM op, like is commonly the case for TMA ops, the compilation of the fused epilogue will fail. This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem. In experiments, this proved crucial to enable nontrivial epilogue fusions. [Inductor cutlass backend] Support more edge cases involving strided memory layouts This commit fixes several issues which were encountered when dealing with edge cases from a real world model, where complex strided inputs with offsets were encountered. [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation, which could lead to test failures due to numerical differences. This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used. This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels, and by default filters these "Pingpong" Kernels. They can be easily re-enabled via a configuration setting if desired. [ghstack-poisoned]

pytorch-bot · 2024-02-09T21:13:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119601

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d2d5baa with merge base 86063b4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…-based pointwise fusions with additional tensor input" This diff enables flexible EVT based Matmul fusions which may require one tensor input in addition to the Matmul operands ( A and B ) via the same input operand (C) that is also used for the Bias in the case of linear epilogues. Test Plan: * Additional unit tests in test/inductor/test_max_autotune.py * Manual inspection of the generated code * CI [Inductor max autotune] Robust Autotuning This change provides many changes which make autotuning robust against failures, e.g. Autotuning is not just for performance comparison anymore, but also to filter out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..). Also, if no suitable Kernels are found, it allows to fall back to alternative GEMM backends ( e.g. ATen ) [inductor max autotune] Flexible GEMM layout autotuning This diff introduces memory layout autotuning and flexibilizes memory layouts that are accepted and written by the Cutlass GEMM Kernels. During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory layouts, all possible combinations of row-major or column major layouts are tried during autotuning. Test Plan: * Additional Unit test(s) * CI [Inductor Cutlass backend] Improvements to CUTLASS Kernels * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling ) * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion) * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C ) * Support for tanh and sigmoid in EVT epilogue expressions * More unit tests [Inductor Cutlass backend] prevent problems with symbolic shapes Without this change, GEMM inputs with symbolic dimensions could lead to runtime exceptions. This change makes sure that, while the CUTLASS backend will not be picked, at least no exceptions occur. [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements Notable changes: * Smaller refactorings ( extract method specifically to make code more readable ) * Several new unit tests * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node [Inductor cutlass backend] Log CUDA compilation times Log the time that CUDA compilations take in the debug log [Inductor cutlass backend] Minor improvements to unit tests As the title says, minor improvements / changes in the unit tests of the Inductor CUTLASS backend [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory A common problem when fusing epilogues is that additional (auxiliary) inputs require shared memory. But when all shared memory is already required by the GEMM op, like is commonly the case for TMA ops, the compilation of the fused epilogue will fail. This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem. In experiments, this proved crucial to enable nontrivial epilogue fusions. [Inductor cutlass backend] Support more edge cases involving strided memory layouts This commit fixes several issues which were encountered when dealing with edge cases from a real world model, where complex strided inputs with offsets were encountered. [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation, which could lead to test failures due to numerical differences. This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used. This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels, and by default filters these "Pingpong" Kernels. They can be easily re-enabled via a configuration setting if desired. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

jansel

Please break this up into logical chunks each including tests that are enabled (I believe the previous PR disabled all tests).

kadeng · 2024-03-11T11:25:29Z

Closing. The PR stack is being restructured here: #121492

github-actions bot added module: inductor ciflow/inductor labels Feb 9, 2024

kadeng added 3 commits February 9, 2024 22:22

kadeng mentioned this pull request Feb 12, 2024

[Inductor Cutlass backend] (2/2) Remove Cutlass EVT epilogue fusion #119701

Closed

kadeng added 11 commits February 13, 2024 19:33

kadeng mentioned this pull request Feb 26, 2024

Fix test failure: Add torch.cuda._get_device_properties to dynamo trace rules #120620

Closed

kadeng marked this pull request as ready for review February 27, 2024 10:45

kadeng requested a review from jansel February 27, 2024 10:45

kadeng requested review from desertfire and ipiszy February 27, 2024 10:45

jansel requested changes Mar 6, 2024

View reviewed changes

kadeng closed this Mar 11, 2024

github-actions bot deleted the gh/kadeng/35/head branch April 11, 2024 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor Cutlass backend] 2 of 2 - Enabling flexible EVT-based pointwise fusions with additional tensor input #119601

[Inductor Cutlass backend] 2 of 2 - Enabling flexible EVT-based pointwise fusions with additional tensor input #119601

Uh oh!

kadeng commented Feb 9, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 9, 2024 •

edited

Loading

Uh oh!

jansel left a comment

Uh oh!

kadeng commented Mar 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Inductor Cutlass backend] 2 of 2 - Enabling flexible EVT-based pointwise fusions with additional tensor input #119601

[Inductor Cutlass backend] 2 of 2 - Enabling flexible EVT-based pointwise fusions with additional tensor input #119601

Uh oh!

Conversation

kadeng commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119601

✅ No Failures

Uh oh!

jansel left a comment

Choose a reason for hiding this comment

Uh oh!

kadeng commented Mar 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kadeng commented Feb 9, 2024 •

edited

Loading

pytorch-bot bot commented Feb 9, 2024 •

edited

Loading