Skip to content

Conversation

@kadeng
Copy link
Contributor

@kadeng kadeng commented Feb 9, 2024

Stack from ghstack (oldest at bottom):

This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
also used for the Bias in the case of linear epilogues.

Test Plan:
* Additional unit tests in test/inductor/test_max_autotune.py
* Manual inspection of the generated code
* CI

[Inductor max autotune] Robust Autotuning

This change provides many changes which make autotuning robust against failures,
e.g. Autotuning is not just for performance comparison anymore, but also to filter
out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

Also, if no suitable Kernels are found, it allows to fall back to
alternative GEMM backends ( e.g. ATen )

[inductor max autotune] Flexible GEMM layout autotuning

This diff introduces memory layout autotuning and flexibilizes
memory layouts that are accepted and written by the Cutlass GEMM Kernels.

During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
layouts, all possible combinations of row-major or column major layouts are
tried during autotuning.

Test Plan:

* Additional Unit test(s)
* CI

[Inductor Cutlass backend] Improvements to CUTLASS Kernels

* Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
* Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
* Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
* Support for tanh and sigmoid in EVT epilogue expressions
* More unit tests

[Inductor Cutlass backend] prevent problems with symbolic shapes

Without this change, GEMM inputs with symbolic dimensions could lead
to runtime exceptions. This change makes sure that, while the CUTLASS
backend will not be picked, at least no exceptions occur.

[Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

Notable changes:

* Smaller refactorings ( extract method specifically to make code more readable )
* Several new unit tests
* Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

[Inductor cutlass backend] Log CUDA compilation times

Log the time that CUDA compilations take in the debug log

[Inductor cutlass backend] Minor improvements to unit tests

As the title says, minor improvements / changes in the unit
tests of the Inductor CUTLASS backend

[Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

In experiments, this proved crucial to enable nontrivial epilogue fusions.

[Inductor cutlass backend] Support more edge cases involving strided memory layouts

This commit fixes several issues which were encountered when dealing with edge cases
from a real world model, where complex strided inputs with offsets were encountered.

[Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
which could lead to test failures due to numerical differences.

This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
and by default filters these "Pingpong" Kernels.

They can be easily re-enabled via a configuration setting if desired.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @aakhundov @ColinPeppler

…twise fusions with additional tensor input

    This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119601

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d2d5baa with merge base 86063b4 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
@kadeng kadeng requested review from desertfire and ipiszy February 27, 2024 10:45
…-based pointwise fusions with additional tensor input"

This diff enables flexible EVT based Matmul fusions which may require one tensor input
    in addition to the Matmul operands ( A and B ) via the same input operand (C) that is
    also used for the Bias in the case of linear epilogues.

    Test Plan:
    * Additional unit tests in test/inductor/test_max_autotune.py
    * Manual inspection of the generated code
    * CI

    [Inductor max autotune] Robust Autotuning

    This change provides many changes which make autotuning robust against failures,
    e.g. Autotuning is not just for performance comparison anymore, but also to filter
    out Kernels with bugs ( Segfaults / CUDA IMA, Infinite Loops ..).

    Also, if no suitable Kernels are found, it allows to fall back to
    alternative GEMM backends ( e.g. ATen )

    [inductor max autotune] Flexible GEMM layout autotuning

    This diff introduces memory layout autotuning and flexibilizes
    memory layouts that are accepted and written by the Cutlass GEMM Kernels.

    During autotuning, if Cutlass GEMM Kernels have inputs with flexible memory
    layouts, all possible combinations of row-major or column major layouts are
    tried during autotuning.

    Test Plan:

    * Additional Unit test(s)
    * CI

    [Inductor Cutlass backend] Improvements to CUTLASS Kernels

    * Standalone runner ( CUTLASS Kernels may be compiled as standalone executables for debugging & profiling )
    * Retuning ( After Fusion, another round of Autotuning is done to determine what's the best Kernel to pick given the fusion)
    * Support for auxiliary inputs ( additional GEMM inputs beyond operands A,B and C )
    * Support for tanh and sigmoid in EVT epilogue expressions
    * More unit tests

    [Inductor Cutlass backend] prevent problems with symbolic shapes

    Without this change, GEMM inputs with symbolic dimensions could lead
    to runtime exceptions. This change makes sure that, while the CUTLASS
    backend will not be picked, at least no exceptions occur.

    [Inductor Cutlass backend] Broadcasting fixes & Code quality improvements

    Notable changes:

    * Smaller refactorings ( extract method specifically to make code more readable )
    * Several new unit tests
    * Improved / more robust parsing of load index expressions when interpreting Pointwise.inner_fn node

    [Inductor cutlass backend] Log CUDA compilation times

    Log the time that CUDA compilations take in the debug log

    [Inductor cutlass backend] Minor improvements to unit tests

    As the title says, minor improvements / changes in the unit
    tests of the Inductor CUTLASS backend

    [Inductor cutlass backend] Add support for Aux Inputs without requiring shared memory

    A common problem when fusing epilogues is that additional (auxiliary) inputs require
    shared memory. But when all shared memory is already required by the GEMM op, like
    is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

    This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

    In experiments, this proved crucial to enable nontrivial epilogue fusions.

    [Inductor cutlass backend] Support more edge cases involving strided memory layouts

    This commit fixes several issues which were encountered when dealing with edge cases
    from a real world model, where complex strided inputs with offsets were encountered.

    [Inductor cutlass backend] Workaround for flaky tests caused by Pingpong Kernel nondeterministic numerical differences

    Some tests of the Inductor Cutlass backend exhibited flakiness, e.g. were failing nondeterministically. Even when
    ensuring identical inputs, zeroed-out buffers etc. the results could differ somewhat from invocation to invocation,
    which could lead to test failures due to numerical differences.

    This only happened when Cutlass' SM90 TMA Warpspecialized Pingpong Kernels were used.

    This diff introduces config options that allow to whitelist/blacklist Cutlass Kernels,
    and by default filters these "Pingpong" Kernels.

    They can be easily re-enabled via a configuration setting if desired.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler

[ghstack-poisoned]
Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break this up into logical chunks each including tests that are enabled (I believe the previous PR disabled all tests).

@kadeng
Copy link
Contributor Author

kadeng commented Mar 11, 2024

Closing. The PR stack is being restructured here: #121492

@kadeng kadeng closed this Mar 11, 2024
@github-actions github-actions bot deleted the gh/kadeng/35/head branch April 11, 2024 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants