Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inductor cutlass backend] Cutlass GEMM epilogue fusion phase 2 #115919

Closed

Conversation

kadeng
Copy link
Contributor

@kadeng kadeng commented Dec 15, 2023

This is an experimental feature branch PR for the Inductor CUTLASS backend. Please do not review. Features from this branch will enter Pytorch through separe PRs.

In order to ensure reproducibility of benchmarking experiments, this feature branch will not be rebased on main anymore. Separated PR's will of course be.

Already merged

Noteworthy additional features / changes

  • Switch to Cutlass 3.3.0 in third_party/cutlass in order to leverage important fixes to Cutlass EVT
  • Enabling EVT-based epilogue fusions with many strided inputs and broadcasting in real-world scenarios ( tested on Meta-internal model )
  • Many improvements to Cutlass backend stability, performance, operator coverage and max-autotune robustness
  • Parallel pre-compilation during max-autotune, cutting time required for Cutlass-backend Kernel selection often 20-fold
  • Enabling GEMM ops requiring workspace memory, including StreamK enabled Kernels with more consistent performance across problem sizes
  • Generated Cutlass Kernels (including their fused epilogues) can be compiled into native standalone executables for testing, verification, debugging or profiling purposes.
  • Shape Padding improvements: Aggressively leveraging transpose instead
    of padding when possible and using aten.constant_cat_nd leads to significant speedup, including for models which are not enabled to use the Cutlass backend.
  • Logging mechanism & corresponding log parser for analysis of auto-tuning benchmark results

Calling debug_str on FusedSchedulerNode, which may happen in certain
debug configuration options are enabled, results in an Exception
since self.node does not exist on FusedSchedulerNode.

This is a small fix to address that.

ghstack-source-id: 9f2e72e769d137feca87939d9b088944dc7086d5
Pull Request resolved: #113365
…between autotuning and CUTLASSGEMMTemplate.render

ghstack-source-id: d679b215c08c2e64c4d529f75fcc9c54aaf46fd9
Pull Request resolved: #113366
…dable )

This diff introduces a new separate logging of autotuning results,
with the intention of making the results analyzable, specifically
those for the new experimental Cutlass backend.

Results are logged as text files with one JSON document corresponding to a single benchmark result per line.

ghstack-source-id: 832bec36b804004be637101e3b2f3a4637097b22
Pull Request resolved: #113399
When using the Cutlass backend, the compilation
of CUDA source files can totally dominate the runtime required for the benchmarking done
as part of Autotuning.

This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a
possible on-disk sccache ).

Also it ensures that no unneccessary compilation
and benchmarking steps are performed, which was peviously the case.

ghstack-source-id: 64439f08398148f92108eeff2e62766bc7a841c6
Pull Request resolved: #113558
Cutlass backend GEMMs are comparatively expensive to compile. So they should only be applied to sufficiently large GEMMs. This small diff introduces a new torch._inductor.config option called "cuda.cutlass_backend_min_gemm_size" which introduces a threshold for the size of GEMM problems that the Cutlass backend will be considered for.

ghstack-source-id: eaf2410c1c40ab4ebcd9590b8595013b1454a6b0
Pull Request resolved: #113569
…ce issues

We observed some Cutlass GEMM ops with StreamK enabled to take forever during autotuning. Disabling them for now to unblock.

We should investigate this later.

ghstack-source-id: f5ba1e72b39ee4d65084a8317e8597294d5b85cd
Pull Request resolved: #113570
This adds support for torch.bmm and torch.baddbmm operations through Cutlass backend. A comparatively simple addition.

ghstack-source-id: ae7355631ccf4e8e959ef1bffb0fd2dd7784e37d
Pull Request resolved: #113890
…ons with additional tensor input

This diff enables flexible EVT based Matmul fusions which may require one tensor input
in addition to the Matmul operands ( A and B ).

Test Plan:
 * Additional unit tests in test/inductor/test_max_autotune.py
 * Manual inspection of the generated code
 * CI

ghstack-source-id: d6a714e399bed3bf3a8b44a0d827b985ac293953
Pull Request resolved: #113959
…benchmark

If one of several choices within max autotune fails with a compilation error or runtime error, the entire model compilation fails.

This changes the behavior, such that an error is being logged, but the model compilation may continue as long as valid choices remain.

ghstack-source-id: 691b9f56126adb20927eba6e34543514d1829716
Pull Request resolved: #113891
So far we have relied on the "generator.py" from
third_party/cutlass/python/cutlass_library/ to generate Cutlass op
configurations to be tried in autotuning. These op configs are not
sufficient to ensure good performance and dtype coverage in all cases.

This diff introduces an extended variant of that generator.py,
provided both as a source file and as a diff ( so that it may be applied
to future versions of cutlass_library.generator )

Test Plan:
 * CI
 * python test/inductor/test_max_autotune.py

ghstack-source-id: 0502aad2c53740fe00eeea24f06bc6116339b4c8
Pull Request resolved: #113932
So far, when only the Cutlass GEMM backend was enabled, and the
config.cuda.cutlass_only_evt_capable_ops option was enabled, it could happen
that for some input combinations, there were no Cutlass ops that could handle
it, since, for example, there are no EVT-capable fp32 GEMM ops when both
operands A and B are row-major.

This diff changes the behavior such that in the case that
the mentioned config option is renamed to config.cuda.cutlass_prefer_evt_capable_ops, and now allows to fall back to non-evt
capable ops in the case that not even a single EVT-capable one could be found..

In the case that no GEMM operand of the selected backend can be found at all, the
ATen backend is used as fallback.

ghstack-source-id: 284333b3f1e3e4575e8d6f8a7d25c547536d99c8
Pull Request resolved: #114075
This diff introduces memory layout autotuning and
flexibilizes memory layouts that are accepted and
written by the Cutlass GEMM Kernels.

During autotuning, if Cutlass GEMM Kernels have
inputs with Flexible Layouts, all possible combinations
of row-major or column major layouts are tried during
autotuning.

Note: Flexible input layouts are practically relevant in certain internal production models, this made these changes neccessary.

Test Plan:

 * Additional Unit test(s) (more tbd)
 * CI

ghstack-source-id: 5dcfc8eb1712ec40672e9cf2b1a878cae1ee2311
Pull Request resolved: #114319
…aux loads and activations

Simple change adding support for Sigmoid and Tanh activations. Several improvements to
EVT codegen, specifically to make broadcasting of aux loads possible.

Test Plan:

 * CI
 * Additional unit test

ghstack-source-id: 84192a220b722127eca5fec152218d70985d52e2
Pull Request resolved: #114606
…ass GEMMs

For debugging and code validation purposes, it's often helpful to be able to
run the generated GEMM Kernels in a standalone manner, without Python and Pytorch.

This diff adds a bit of code to GEMM Kernels which allows to compile and run
them as standalone executables, easing debugging, profiling and memory-checking
with CUDA Toolkit based tools.

ghstack-source-id: 8129003946bf6807a1cebd2c6d0471e5424f7dfa
Pull Request resolved: #115072
Cutlass 3.x Kernels take an optional Hardware info struct as argument, which tells them how many SMs ( CUDA Streaming Multiprocessors)
are available per Device. This small diff provides this info to ensure better Kernel params are selected, and no re-querying has
to happen at runtime.

ghstack-source-id: 2aac874a6e80ee1da9235134c323787dd2e2e1d3
Pull Request resolved: #115174
… auxiliary inputs

So far, it could happen that auxiilary inputs which required row or column broadcast
ran into CUDA errors due to conflicting access to memory. This small fix resolves that
by using the right broadcasting operator in such cases.

ghstack-source-id: 57c8953307fa62d609c44253a078219a9b55f52d
Pull Request resolved: #115270
… sizes

The Inductor Cutlass backend so far did not support GEMM ops which require
CUDA workspace memory to run. This diff enables non-zero workspace sizes,
and at the same time enables support for GEMMS using the StreamK tile scheduler
which requires non-zero workspace.

ghstack-source-id: 6067844d5894f0befb094d251f33d3a0778b96af
Pull Request resolved: #114687
Cutlass 3.3 offers the following improvements:

Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
minor doc update
Test Plan:

CI ( ciflow/trunk, ciflow/inductor )
pytest test/inductor/test_max_autotune.py

ghstack-source-id: 46b0d3c156ef707f7e75a25f99d3995e7383c2be
Pull Request resolved: #112861
…ing in subprocesses

Makes autotuning in subprocesses more robust, specifically against long running or crashing
functions being benchmarked, which could also completely corrupt the CUDA Context of the entire process.

This diff introduces changes to ensure that precompilation works well with autotuning in
subprocesses, and ensures that autotuning subprocesses have robust timeouts after which
they will be killed.

ghstack-source-id: 2b5117cbd10b35e27a54a9781dd2ba4510b2842f
Pull Request resolved: #115654
…ge cases

There are some edge cases concerning broadcasting of auxiliary inputs which need special treatment. The Cutlass
operators to broadcast auxiliary inputs require these inputs to be contiguous in the non-broadcasted dimension.

Most importantly, though, Pointwise nodes are able to implicitly reinterpret the memory layout of the Buffers they read from. In order to reliable way to fuse an additionally
loaded buffer as "Bias" argument, it is therefore neccessary to parse out of that
information a memory layout ( strides, offset and mapping to GEMM output dims ).

Added several tests to cover cases related to broadcasting of bias / aux inputs.

ghstack-source-id: eb1903d5ab20f88e8993086a3e2fb19e30316891
Pull Request resolved: #115655
…ions

Improved the coverage of doc comments and type annotations. No functional changes.

ghstack-source-id: e2a340159b7e630ab6b0bc665c4c589c7dde0664
Pull Request resolved: #115813
Log the time that CUDA compilations take, as requested by Cutlass team in order to
justify efforts to improve compilation times.

ghstack-source-id: cde7e230fa69adb4b6cf3bab135ef2fae9ba9c8a
Pull Request resolved: #115814
When the workspace size is changed via retuning, the corresponding buffer allocation in the wrapper needs to use the updated size. This did
not work properly before, this diff fixes that.

ghstack-source-id: 3b1533af8742c28575a4e5c13741883ad49bcd60
Pull Request resolved: #115877
Copy link

pytorch-bot bot commented Dec 15, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115919

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 12 Unrelated Failures

As of commit fb5af8e with merge base afe6d27 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…ng shared memory

A common problem when fusing epilogues is that additional (auxiliary) inputs require
shared memory. But when all shared memory is already required by the GEMM op, like
is commonly the case for TMA ops, the compilation of the fused epilogue will fail.

This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
@kadeng kadeng force-pushed the kadeng/inductor-cutlass-backend-phase2-pre-diff-collapse-ok branch from f4a74e3 to da7cf89 Compare December 26, 2023 20:42
…tain modified functions

In order to bring down the number of lines in this PR considerably
… decisions, reduce CPU overhead

These are optimizations based on partial profiling results. Retuning and Fusion decisions can be improved, such that fewer fusion errors are encountered and more valid fusions result.

CPU overhead can be reduced by making validation
of Arguments to GEMM Kernels optional, and
by ensuring that buffers are not unneccessarily
allocated.
@kadeng kadeng force-pushed the kadeng/inductor-cutlass-backend-phase2-pre-diff-collapse-ok branch 2 times, most recently from 029d302 to 5b27c74 Compare December 30, 2023 22:26
@kadeng kadeng force-pushed the kadeng/inductor-cutlass-backend-phase2-pre-diff-collapse-ok branch from 5b27c74 to 8a92e2a Compare December 30, 2023 22:28
This is a small change which adds logging info about the CUDA architecture
level ( SM80, SM90 etc.) and increases a default precompilation timeout to make precompilation work properly to speed up autotuning even when a large number of Kernels is being chosen from.
@kadeng kadeng force-pushed the kadeng/inductor-cutlass-backend-phase2-pre-diff-collapse-ok branch from 587be87 to fb5af8e Compare January 7, 2024 12:18
@kadeng
Copy link
Contributor Author

kadeng commented Jan 23, 2024

This PR is frozen in order to ensure reproducibility of experiments. Continued in #117745

@kadeng kadeng closed this Jan 30, 2024
@github-actions github-actions bot deleted the kadeng/inductor-cutlass-backend-phase2-pre-diff-collapse-ok branch March 1, 2024 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant