-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inductor cutlass backend] Cutlass GEMM epilogue fusion phase 2 #115919
[Inductor cutlass backend] Cutlass GEMM epilogue fusion phase 2 #115919
Conversation
Calling debug_str on FusedSchedulerNode, which may happen in certain debug configuration options are enabled, results in an Exception since self.node does not exist on FusedSchedulerNode. This is a small fix to address that. ghstack-source-id: 9f2e72e769d137feca87939d9b088944dc7086d5 Pull Request resolved: #113365
…between autotuning and CUTLASSGEMMTemplate.render ghstack-source-id: d679b215c08c2e64c4d529f75fcc9c54aaf46fd9 Pull Request resolved: #113366
…dable ) This diff introduces a new separate logging of autotuning results, with the intention of making the results analyzable, specifically those for the new experimental Cutlass backend. Results are logged as text files with one JSON document corresponding to a single benchmark result per line. ghstack-source-id: 832bec36b804004be637101e3b2f3a4637097b22 Pull Request resolved: #113399
When using the Cutlass backend, the compilation of CUDA source files can totally dominate the runtime required for the benchmarking done as part of Autotuning. This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a possible on-disk sccache ). Also it ensures that no unneccessary compilation and benchmarking steps are performed, which was peviously the case. ghstack-source-id: 64439f08398148f92108eeff2e62766bc7a841c6 Pull Request resolved: #113558
Cutlass backend GEMMs are comparatively expensive to compile. So they should only be applied to sufficiently large GEMMs. This small diff introduces a new torch._inductor.config option called "cuda.cutlass_backend_min_gemm_size" which introduces a threshold for the size of GEMM problems that the Cutlass backend will be considered for. ghstack-source-id: eaf2410c1c40ab4ebcd9590b8595013b1454a6b0 Pull Request resolved: #113569
…ce issues We observed some Cutlass GEMM ops with StreamK enabled to take forever during autotuning. Disabling them for now to unblock. We should investigate this later. ghstack-source-id: f5ba1e72b39ee4d65084a8317e8597294d5b85cd Pull Request resolved: #113570
This adds support for torch.bmm and torch.baddbmm operations through Cutlass backend. A comparatively simple addition. ghstack-source-id: ae7355631ccf4e8e959ef1bffb0fd2dd7784e37d Pull Request resolved: #113890
…ons with additional tensor input This diff enables flexible EVT based Matmul fusions which may require one tensor input in addition to the Matmul operands ( A and B ). Test Plan: * Additional unit tests in test/inductor/test_max_autotune.py * Manual inspection of the generated code * CI ghstack-source-id: d6a714e399bed3bf3a8b44a0d827b985ac293953 Pull Request resolved: #113959
…benchmark If one of several choices within max autotune fails with a compilation error or runtime error, the entire model compilation fails. This changes the behavior, such that an error is being logged, but the model compilation may continue as long as valid choices remain. ghstack-source-id: 691b9f56126adb20927eba6e34543514d1829716 Pull Request resolved: #113891
So far we have relied on the "generator.py" from third_party/cutlass/python/cutlass_library/ to generate Cutlass op configurations to be tried in autotuning. These op configs are not sufficient to ensure good performance and dtype coverage in all cases. This diff introduces an extended variant of that generator.py, provided both as a source file and as a diff ( so that it may be applied to future versions of cutlass_library.generator ) Test Plan: * CI * python test/inductor/test_max_autotune.py ghstack-source-id: 0502aad2c53740fe00eeea24f06bc6116339b4c8 Pull Request resolved: #113932
So far, when only the Cutlass GEMM backend was enabled, and the config.cuda.cutlass_only_evt_capable_ops option was enabled, it could happen that for some input combinations, there were no Cutlass ops that could handle it, since, for example, there are no EVT-capable fp32 GEMM ops when both operands A and B are row-major. This diff changes the behavior such that in the case that the mentioned config option is renamed to config.cuda.cutlass_prefer_evt_capable_ops, and now allows to fall back to non-evt capable ops in the case that not even a single EVT-capable one could be found.. In the case that no GEMM operand of the selected backend can be found at all, the ATen backend is used as fallback. ghstack-source-id: 284333b3f1e3e4575e8d6f8a7d25c547536d99c8 Pull Request resolved: #114075
This diff introduces memory layout autotuning and flexibilizes memory layouts that are accepted and written by the Cutlass GEMM Kernels. During autotuning, if Cutlass GEMM Kernels have inputs with Flexible Layouts, all possible combinations of row-major or column major layouts are tried during autotuning. Note: Flexible input layouts are practically relevant in certain internal production models, this made these changes neccessary. Test Plan: * Additional Unit test(s) (more tbd) * CI ghstack-source-id: 5dcfc8eb1712ec40672e9cf2b1a878cae1ee2311 Pull Request resolved: #114319
…aux loads and activations Simple change adding support for Sigmoid and Tanh activations. Several improvements to EVT codegen, specifically to make broadcasting of aux loads possible. Test Plan: * CI * Additional unit test ghstack-source-id: 84192a220b722127eca5fec152218d70985d52e2 Pull Request resolved: #114606
…ass GEMMs For debugging and code validation purposes, it's often helpful to be able to run the generated GEMM Kernels in a standalone manner, without Python and Pytorch. This diff adds a bit of code to GEMM Kernels which allows to compile and run them as standalone executables, easing debugging, profiling and memory-checking with CUDA Toolkit based tools. ghstack-source-id: 8129003946bf6807a1cebd2c6d0471e5424f7dfa Pull Request resolved: #115072
Cutlass 3.x Kernels take an optional Hardware info struct as argument, which tells them how many SMs ( CUDA Streaming Multiprocessors) are available per Device. This small diff provides this info to ensure better Kernel params are selected, and no re-querying has to happen at runtime. ghstack-source-id: 2aac874a6e80ee1da9235134c323787dd2e2e1d3 Pull Request resolved: #115174
… auxiliary inputs So far, it could happen that auxiilary inputs which required row or column broadcast ran into CUDA errors due to conflicting access to memory. This small fix resolves that by using the right broadcasting operator in such cases. ghstack-source-id: 57c8953307fa62d609c44253a078219a9b55f52d Pull Request resolved: #115270
… sizes The Inductor Cutlass backend so far did not support GEMM ops which require CUDA workspace memory to run. This diff enables non-zero workspace sizes, and at the same time enables support for GEMMS using the StreamK tile scheduler which requires non-zero workspace. ghstack-source-id: 6067844d5894f0befb094d251f33d3a0778b96af Pull Request resolved: #114687
Cutlass 3.3 offers the following improvements: Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper Enhancements to EVT Enhancements to Python interface Enhancements to Sub-byte type handling in CuTe Several other bug-fixes and performance improvements. minor doc update Test Plan: CI ( ciflow/trunk, ciflow/inductor ) pytest test/inductor/test_max_autotune.py ghstack-source-id: 46b0d3c156ef707f7e75a25f99d3995e7383c2be Pull Request resolved: #112861
…ing in subprocesses Makes autotuning in subprocesses more robust, specifically against long running or crashing functions being benchmarked, which could also completely corrupt the CUDA Context of the entire process. This diff introduces changes to ensure that precompilation works well with autotuning in subprocesses, and ensures that autotuning subprocesses have robust timeouts after which they will be killed. ghstack-source-id: 2b5117cbd10b35e27a54a9781dd2ba4510b2842f Pull Request resolved: #115654
…ge cases There are some edge cases concerning broadcasting of auxiliary inputs which need special treatment. The Cutlass operators to broadcast auxiliary inputs require these inputs to be contiguous in the non-broadcasted dimension. Most importantly, though, Pointwise nodes are able to implicitly reinterpret the memory layout of the Buffers they read from. In order to reliable way to fuse an additionally loaded buffer as "Bias" argument, it is therefore neccessary to parse out of that information a memory layout ( strides, offset and mapping to GEMM output dims ). Added several tests to cover cases related to broadcasting of bias / aux inputs. ghstack-source-id: eb1903d5ab20f88e8993086a3e2fb19e30316891 Pull Request resolved: #115655
…ions Improved the coverage of doc comments and type annotations. No functional changes. ghstack-source-id: e2a340159b7e630ab6b0bc665c4c589c7dde0664 Pull Request resolved: #115813
Log the time that CUDA compilations take, as requested by Cutlass team in order to justify efforts to improve compilation times. ghstack-source-id: cde7e230fa69adb4b6cf3bab135ef2fae9ba9c8a Pull Request resolved: #115814
When the workspace size is changed via retuning, the corresponding buffer allocation in the wrapper needs to use the updated size. This did not work properly before, this diff fixes that. ghstack-source-id: 3b1533af8742c28575a4e5c13741883ad49bcd60 Pull Request resolved: #115877
This PR needs a
|
…ng shared memory A common problem when fusing epilogues is that additional (auxiliary) inputs require shared memory. But when all shared memory is already required by the GEMM op, like is commonly the case for TMA ops, the compilation of the fused epilogue will fail. This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.
f4a74e3
to
da7cf89
Compare
…tain modified functions In order to bring down the number of lines in this PR considerably
…emory input/output layouts
… decisions, reduce CPU overhead These are optimizations based on partial profiling results. Retuning and Fusion decisions can be improved, such that fewer fusion errors are encountered and more valid fusions result. CPU overhead can be reduced by making validation of Arguments to GEMM Kernels optional, and by ensuring that buffers are not unneccessarily allocated.
029d302
to
5b27c74
Compare
5b27c74
to
8a92e2a
Compare
This is a small change which adds logging info about the CUDA architecture level ( SM80, SM90 etc.) and increases a default precompilation timeout to make precompilation work properly to speed up autotuning even when a large number of Kernels is being chosen from.
587be87
to
fb5af8e
Compare
This PR is frozen in order to ensure reproducibility of experiments. Continued in #117745 |
This is an experimental feature branch PR for the Inductor CUTLASS backend. Please do not review. Features from this branch will enter Pytorch through separe PRs.
In order to ensure reproducibility of benchmarking experiments, this feature branch will not be rebased on main anymore. Separated PR's will of course be.
Already merged
Noteworthy additional features / changes
of padding when possible and using aten.constant_cat_nd leads to significant speedup, including for models which are not enabled to use the Cutlass backend.