[Inductor cutlass backend] Cutlass GEMM epilogue fusion phase 2 #115919

kadeng · 2023-12-15T10:08:24Z

This is an experimental feature branch PR for the Inductor CUTLASS backend. Please do not review. Features from this branch will enter Pytorch through separe PRs.

In order to ensure reproducibility of benchmarking experiments, this feature branch will not be rebased on main anymore. Separated PR's will of course be.

Already merged

Cutlass backend enabling fast Cutlass GEMMs for TorchInductor, interoperable / co-existing with Triton backend (Stack of [Inductor CUTLASS backend] Step 5: Gemm CUTLASS templates #108015)
Enabling Cutlass GEMM epilogue fusion using Epilogue Visitor Tree ( EVT ) ([Inductor CUTLASS backend] Epilogue fusion codegen (Step 1) #110890, landed)

Noteworthy additional features / changes

Switch to Cutlass 3.3.0 in third_party/cutlass in order to leverage important fixes to Cutlass EVT
Enabling EVT-based epilogue fusions with many strided inputs and broadcasting in real-world scenarios ( tested on Meta-internal model )
Many improvements to Cutlass backend stability, performance, operator coverage and max-autotune robustness
Parallel pre-compilation during max-autotune, cutting time required for Cutlass-backend Kernel selection often 20-fold
Enabling GEMM ops requiring workspace memory, including StreamK enabled Kernels with more consistent performance across problem sizes
Generated Cutlass Kernels (including their fused epilogues) can be compiled into native standalone executables for testing, verification, debugging or profiling purposes.
Shape Padding improvements: Aggressively leveraging transpose instead
of padding when possible and using aten.constant_cat_nd leads to significant speedup, including for models which are not enabled to use the Cutlass backend.
Logging mechanism & corresponding log parser for analysis of auto-tuning benchmark results

Calling debug_str on FusedSchedulerNode, which may happen in certain debug configuration options are enabled, results in an Exception since self.node does not exist on FusedSchedulerNode. This is a small fix to address that. ghstack-source-id: 9f2e72e769d137feca87939d9b088944dc7086d5 Pull Request resolved: #113365

…between autotuning and CUTLASSGEMMTemplate.render ghstack-source-id: d679b215c08c2e64c4d529f75fcc9c54aaf46fd9 Pull Request resolved: #113366

…dable ) This diff introduces a new separate logging of autotuning results, with the intention of making the results analyzable, specifically those for the new experimental Cutlass backend. Results are logged as text files with one JSON document corresponding to a single benchmark result per line. ghstack-source-id: 832bec36b804004be637101e3b2f3a4637097b22 Pull Request resolved: #113399

When using the Cutlass backend, the compilation of CUDA source files can totally dominate the runtime required for the benchmarking done as part of Autotuning. This change adds a multithreaded precompilation phase, which serves to pre-populate the compilation cache ( both in-memory, and a possible on-disk sccache ). Also it ensures that no unneccessary compilation and benchmarking steps are performed, which was peviously the case. ghstack-source-id: 64439f08398148f92108eeff2e62766bc7a841c6 Pull Request resolved: #113558

Cutlass backend GEMMs are comparatively expensive to compile. So they should only be applied to sufficiently large GEMMs. This small diff introduces a new torch._inductor.config option called "cuda.cutlass_backend_min_gemm_size" which introduces a threshold for the size of GEMM problems that the Cutlass backend will be considered for. ghstack-source-id: eaf2410c1c40ab4ebcd9590b8595013b1454a6b0 Pull Request resolved: #113569

…ce issues We observed some Cutlass GEMM ops with StreamK enabled to take forever during autotuning. Disabling them for now to unblock. We should investigate this later. ghstack-source-id: f5ba1e72b39ee4d65084a8317e8597294d5b85cd Pull Request resolved: #113570

This adds support for torch.bmm and torch.baddbmm operations through Cutlass backend. A comparatively simple addition. ghstack-source-id: ae7355631ccf4e8e959ef1bffb0fd2dd7784e37d Pull Request resolved: #113890

…ons with additional tensor input This diff enables flexible EVT based Matmul fusions which may require one tensor input in addition to the Matmul operands ( A and B ). Test Plan: * Additional unit tests in test/inductor/test_max_autotune.py * Manual inspection of the generated code * CI ghstack-source-id: d6a714e399bed3bf3a8b44a0d827b985ac293953 Pull Request resolved: #113959

…benchmark If one of several choices within max autotune fails with a compilation error or runtime error, the entire model compilation fails. This changes the behavior, such that an error is being logged, but the model compilation may continue as long as valid choices remain. ghstack-source-id: 691b9f56126adb20927eba6e34543514d1829716 Pull Request resolved: #113891

So far we have relied on the "generator.py" from third_party/cutlass/python/cutlass_library/ to generate Cutlass op configurations to be tried in autotuning. These op configs are not sufficient to ensure good performance and dtype coverage in all cases. This diff introduces an extended variant of that generator.py, provided both as a source file and as a diff ( so that it may be applied to future versions of cutlass_library.generator ) Test Plan: * CI * python test/inductor/test_max_autotune.py ghstack-source-id: 0502aad2c53740fe00eeea24f06bc6116339b4c8 Pull Request resolved: #113932

So far, when only the Cutlass GEMM backend was enabled, and the config.cuda.cutlass_only_evt_capable_ops option was enabled, it could happen that for some input combinations, there were no Cutlass ops that could handle it, since, for example, there are no EVT-capable fp32 GEMM ops when both operands A and B are row-major. This diff changes the behavior such that in the case that the mentioned config option is renamed to config.cuda.cutlass_prefer_evt_capable_ops, and now allows to fall back to non-evt capable ops in the case that not even a single EVT-capable one could be found.. In the case that no GEMM operand of the selected backend can be found at all, the ATen backend is used as fallback. ghstack-source-id: 284333b3f1e3e4575e8d6f8a7d25c547536d99c8 Pull Request resolved: #114075

This diff introduces memory layout autotuning and flexibilizes memory layouts that are accepted and written by the Cutlass GEMM Kernels. During autotuning, if Cutlass GEMM Kernels have inputs with Flexible Layouts, all possible combinations of row-major or column major layouts are tried during autotuning. Note: Flexible input layouts are practically relevant in certain internal production models, this made these changes neccessary. Test Plan: * Additional Unit test(s) (more tbd) * CI ghstack-source-id: 5dcfc8eb1712ec40672e9cf2b1a878cae1ee2311 Pull Request resolved: #114319

…aux loads and activations Simple change adding support for Sigmoid and Tanh activations. Several improvements to EVT codegen, specifically to make broadcasting of aux loads possible. Test Plan: * CI * Additional unit test ghstack-source-id: 84192a220b722127eca5fec152218d70985d52e2 Pull Request resolved: #114606

…ass GEMMs For debugging and code validation purposes, it's often helpful to be able to run the generated GEMM Kernels in a standalone manner, without Python and Pytorch. This diff adds a bit of code to GEMM Kernels which allows to compile and run them as standalone executables, easing debugging, profiling and memory-checking with CUDA Toolkit based tools. ghstack-source-id: 8129003946bf6807a1cebd2c6d0471e5424f7dfa Pull Request resolved: #115072

Cutlass 3.x Kernels take an optional Hardware info struct as argument, which tells them how many SMs ( CUDA Streaming Multiprocessors) are available per Device. This small diff provides this info to ensure better Kernel params are selected, and no re-querying has to happen at runtime. ghstack-source-id: 2aac874a6e80ee1da9235134c323787dd2e2e1d3 Pull Request resolved: #115174

… auxiliary inputs So far, it could happen that auxiilary inputs which required row or column broadcast ran into CUDA errors due to conflicting access to memory. This small fix resolves that by using the right broadcasting operator in such cases. ghstack-source-id: 57c8953307fa62d609c44253a078219a9b55f52d Pull Request resolved: #115270

… sizes The Inductor Cutlass backend so far did not support GEMM ops which require CUDA workspace memory to run. This diff enables non-zero workspace sizes, and at the same time enables support for GEMMS using the StreamK tile scheduler which requires non-zero workspace. ghstack-source-id: 6067844d5894f0befb094d251f33d3a0778b96af Pull Request resolved: #114687

Cutlass 3.3 offers the following improvements: Adds support for mixed precision GEMMs On Hopper and Ampere Adds support for < 16B aligned GEMMs on Hopper Enhancements to EVT Enhancements to Python interface Enhancements to Sub-byte type handling in CuTe Several other bug-fixes and performance improvements. minor doc update Test Plan: CI ( ciflow/trunk, ciflow/inductor ) pytest test/inductor/test_max_autotune.py ghstack-source-id: 46b0d3c156ef707f7e75a25f99d3995e7383c2be Pull Request resolved: #112861

…ing in subprocesses Makes autotuning in subprocesses more robust, specifically against long running or crashing functions being benchmarked, which could also completely corrupt the CUDA Context of the entire process. This diff introduces changes to ensure that precompilation works well with autotuning in subprocesses, and ensures that autotuning subprocesses have robust timeouts after which they will be killed. ghstack-source-id: 2b5117cbd10b35e27a54a9781dd2ba4510b2842f Pull Request resolved: #115654

…ge cases There are some edge cases concerning broadcasting of auxiliary inputs which need special treatment. The Cutlass operators to broadcast auxiliary inputs require these inputs to be contiguous in the non-broadcasted dimension. Most importantly, though, Pointwise nodes are able to implicitly reinterpret the memory layout of the Buffers they read from. In order to reliable way to fuse an additionally loaded buffer as "Bias" argument, it is therefore neccessary to parse out of that information a memory layout ( strides, offset and mapping to GEMM output dims ). Added several tests to cover cases related to broadcasting of bias / aux inputs. ghstack-source-id: eb1903d5ab20f88e8993086a3e2fb19e30316891 Pull Request resolved: #115655

…ions Improved the coverage of doc comments and type annotations. No functional changes. ghstack-source-id: e2a340159b7e630ab6b0bc665c4c589c7dde0664 Pull Request resolved: #115813

Log the time that CUDA compilations take, as requested by Cutlass team in order to justify efforts to improve compilation times. ghstack-source-id: cde7e230fa69adb4b6cf3bab135ef2fae9ba9c8a Pull Request resolved: #115814

When the workspace size is changed via retuning, the corresponding buffer allocation in the wrapper needs to use the updated size. This did not work properly before, this diff fixes that. ghstack-source-id: 3b1533af8742c28575a4e5c13741883ad49bcd60 Pull Request resolved: #115877

pytorch-bot · 2023-12-15T10:08:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115919

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 12 Unrelated Failures

As of commit fb5af8e with merge base afe6d27 ():

NEW FAILURES - The following jobs have failed:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
phlippe_resnet
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
phlippe_resnet
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
phlippe_resnet
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_torchinductor.py::CudaTests::test_shape_padding_cuda
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_huggingface, 1, 1, linux.12xlarge) (gh)
YituTechConvBert

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
hf_Whisper
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
hf_Whisper
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
hf_Whisper
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_huggingface, 1, 1, linux.12xlarge) (gh)
YituTechConvBert
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_timm, 1, 2, linux.12xlarge) (gh)
mixer_b16_224
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_timm, 2, 2, linux.12xlarge) (gh)
xcit_large_24_p8_224
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_torchbench, 1, 2, linux.12xlarge) (gh)
hf_T5_base
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh)
yolov3
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_timm, 1, 2, linux.12xlarge) (gh)
mixer_b16_224
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_timm, 2, 2, linux.12xlarge) (gh)
xcit_large_24_p8_224
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 1, 2, linux.12xlarge) (gh)
hf_T5_base
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh)
yolov3

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2023-12-15T10:09:02Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…ng shared memory A common problem when fusing epilogues is that additional (auxiliary) inputs require shared memory. But when all shared memory is already required by the GEMM op, like is commonly the case for TMA ops, the compilation of the fused epilogue will fail. This adds an Sm90AuxLoadDirect operator which loads auxiliary inputs without shared mem.

…tain modified functions In order to bring down the number of lines in this PR considerably

…emory input/output layouts

… decisions, reduce CPU overhead These are optimizations based on partial profiling results. Retuning and Fusion decisions can be improved, such that fewer fusion errors are encountered and more valid fusions result. CPU overhead can be reduced by making validation of Arguments to GEMM Kernels optional, and by ensuring that buffers are not unneccessarily allocated.

…via environment

This is a small change which adds logging info about the CUDA architecture level ( SM80, SM90 etc.) and increases a default precompilation timeout to make precompilation work properly to speed up autotuning even when a large number of Kernels is being chosen from.

kadeng · 2024-01-23T21:44:37Z

This PR is frozen in order to ensure reproducibility of experiments. Continued in #117745

kadeng added 24 commits December 14, 2023 19:07

[Inductor cutlass GEMM backend] Fix for operand memory layout change …

e91220c

…between autotuning and CUTLASSGEMMTemplate.render ghstack-source-id: d679b215c08c2e64c4d529f75fcc9c54aaf46fd9 Pull Request resolved: #113366

[Inductor cutlass backend] Enable torch.bmm and torch.baddbmm support

00bb005

This adds support for torch.bmm and torch.baddbmm operations through Cutlass backend. A comparatively simple addition. ghstack-source-id: ae7355631ccf4e8e959ef1bffb0fd2dd7784e37d Pull Request resolved: #113890

[Inductor cutlass backend] Added (some) doc comments and type annotat…

d80d1e6

…ions Improved the coverage of doc comments and type annotations. No functional changes. ghstack-source-id: e2a340159b7e630ab6b0bc665c4c589c7dde0664 Pull Request resolved: #115813

[Inductor cutlass backend] Log CUDA compilation times

2243f8c

Log the time that CUDA compilations take, as requested by Cutlass team in order to justify efforts to improve compilation times. ghstack-source-id: cde7e230fa69adb4b6cf3bab135ef2fae9ba9c8a Pull Request resolved: #115814

[Inductor cutlass backend] Minor changes to improve benchmark collection

ea34321

github-actions bot added module: inductor ciflow/inductor labels Dec 15, 2023

This was referenced Dec 15, 2023

[Cutlass 3.3 submodule upgrade] #112861

Closed

[Inductor] Fix debug_str method of FusedSchedulerNode #113365

Closed

kadeng added 8 commits December 20, 2023 21:13

[Inductor cutlass backend] Fix to addmm

be5feff

[Inductor cutlass backend] Enable shape padding for Cutlass

96a7218

[Inductor cutlass backend] Improved Autotuning Log Parser

8674d86

[Inductor cutlass backend] Fix case of wrongly failed assertion

acfbddc

[Inductor cutlass backend] Fix in extract_pointwise_load_strides

ffbe0d5

[Inductor cutlass backend] Minor fix to AutotuningLogParser

5e7d359

[Inductor cutlass backend] Add flag to always pad shapes

436350d

kadeng force-pushed the kadeng/inductor-cutlass-backend-phase2-pre-diff-collapse-ok branch from f4a74e3 to da7cf89 Compare December 26, 2023 20:42

kadeng added 4 commits December 27, 2023 22:46

[Inductor cutlass backend] Simplify generator_extended.py to only con…

a7bc51f

…tain modified functions In order to bring down the number of lines in this PR considerably

[Inductor cutlass backend] Support more edge cases inolving strided m…

3d64279

…emory input/output layouts

[Inductor] [optimization] Matmul shape padding improvements

67c118a

kadeng force-pushed the kadeng/inductor-cutlass-backend-phase2-pre-diff-collapse-ok branch 2 times, most recently from 029d302 to 5b27c74 Compare December 30, 2023 22:26

[Inductor Cutlass backend] Make nvcc optimization level configurable …

8a92e2a

…via environment

kadeng force-pushed the kadeng/inductor-cutlass-backend-phase2-pre-diff-collapse-ok branch from 5b27c74 to 8a92e2a Compare December 30, 2023 22:28

kadeng force-pushed the kadeng/inductor-cutlass-backend-phase2-pre-diff-collapse-ok branch from 587be87 to fb5af8e Compare January 7, 2024 12:18

kadeng mentioned this pull request Jan 26, 2024

[Inductor cutlass backend] merge EVT fusion feature branch ( phase 2 ) #118416

Closed

kadeng closed this Jan 30, 2024

github-actions bot deleted the kadeng/inductor-cutlass-backend-phase2-pre-diff-collapse-ok branch March 1, 2024 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor cutlass backend] Cutlass GEMM epilogue fusion phase 2 #115919

[Inductor cutlass backend] Cutlass GEMM epilogue fusion phase 2 #115919

kadeng commented Dec 15, 2023 •

edited

Loading

pytorch-bot bot commented Dec 15, 2023 •

edited

Loading

github-actions bot commented Dec 15, 2023

kadeng commented Jan 23, 2024 •

edited

Loading

[Inductor cutlass backend] Cutlass GEMM epilogue fusion phase 2 #115919

[Inductor cutlass backend] Cutlass GEMM epilogue fusion phase 2 #115919

Conversation

kadeng commented Dec 15, 2023 • edited Loading

pytorch-bot bot commented Dec 15, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115919

❌ 5 New Failures, 12 Unrelated Failures

github-actions bot commented Dec 15, 2023

This PR needs a release notes: label

kadeng commented Jan 23, 2024 • edited Loading

kadeng commented Dec 15, 2023 •

edited

Loading

pytorch-bot bot commented Dec 15, 2023 •

edited

Loading

This PR needs a `release notes:` label

kadeng commented Jan 23, 2024 •

edited

Loading