-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[Inductor Cutlass backend] 2 of 2 - Enabling flexible EVT-based pointwise fusions with additional tensor input #119601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…twise fusions with additional tensor input
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119601
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d2d5baa with merge base 86063b4 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI
[Inductor max autotune] Robust Autotuning
This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).
Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )
[inductor max autotune] Flexible GEMM layout autotuning
This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.
During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.
Test Plan:
* Additional Unit test(s)
* CI
[Inductor Cutlass backend] Improvements to CUTLASS Kernels
* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests
[Inductor Cutlass backend] prevent problems with symbolic shapes
Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.
[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements
Notable changes:
* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node
[Inductor cutlass backend] Log CUDA compilation times
Log the time that CUDA compilations take in the debug log
[Inductor cutlass backend] Minor improvements to unit tests
As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend
[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory
A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.
This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
In experiments, this proved crucial to enable nontrivial epilogue fusions.
[Inductor cutlass backend] Support more edge cases involving strided memory layouts
This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.
[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences
Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.
This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.
This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.
They can be easily re-enabled via a configuration setting if desired.
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please break this up into logical chunks each including tests that are enabled (I believe the previous PR disabled all tests).
|
Closing. The PR stack is being restructured here: #121492 |
Stack from ghstack (oldest at bottom):
This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @aakhundov @ColinPeppler