[Cutlass inductor backend] Cutlass GEMM size threshold #113569

kadeng · 2023-11-13T16:35:07Z

Stack from ghstack (oldest at bottom):

Cutlass backend GEMMs are comparatively expensive to compile. So they should only be applied to sufficiently large GEMMs. This small diff introduces a new torch._inductor.config option called "cuda.cutlass_backend_min_gemm_size" which introduces a threshold for the size of GEMM problems that the Cutlass backend will be considered for.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @aakhundov @ColinPeppler

Cutlass backend GEMMs are comparatively expensive to compile. So they should only be applied to sufficiently large GEMMs. This small diff introduces a new torch._inductor.config option called "cuda.cutlass_backend_min_gemm_size" which introduces a threshold for the size of GEMM problems that the Cutlass backend will be considered for. [ghstack-poisoned]

pytorch-bot · 2023-11-13T16:35:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113569

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 13 New Failures, 13 Unrelated Failures

As of commit a9edf4f with merge base afe6d27 ():

NEW FAILURES - The following jobs have failed:

pull / linux-docs / build-docs-python-false (gh)
Process completed with exit code 2.
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_zero_grid_with_unbacked_symbols_non_abi_compatible_cuda
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_unbacked_symints.py::TestUnbackedSymints::test_autotuning
pull / linux-focal-py3.8-clang10 / test (crossref, 1, 2, linux.2xlarge) (gh)
RuntimeError: inductor/test_debug_trace 1/1 failed
pull / linux-focal-py3.8-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh)
RuntimeError: inductor/test_torchinductor_dynamic_shapes 1/1 failed
pull / linux-focal-py3.8-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
RuntimeError: inductor/test_debug_trace 1/1 failed
pull / linux-focal-py3.8-clang10 / test (default, 2, 3, linux.2xlarge) (gh)
RuntimeError: inductor/test_max_autotune 1/1 failed
pull / linux-focal-py3.8-clang10 / test (default, 3, 3, linux.2xlarge) (gh)
RuntimeError: inductor/test_torchinductor 2/2 failed
pull / linux-jammy-py3.8-gcc11 / test (default, 1, 3, linux.2xlarge) (gh)
RuntimeError: inductor/test_debug_trace 1/1 failed
pull / linux-jammy-py3.8-gcc11 / test (default, 2, 3, linux.2xlarge) (gh)
RuntimeError: inductor/test_max_autotune 1/1 failed
pull / linux-jammy-py3.8-gcc11 / test (default, 3, 3, linux.2xlarge) (gh)
RuntimeError: inductor/test_torchinductor 2/2 failed
pull / linux-jammy-py3.8-gcc11 / test (distributed, 1, 2, linux.2xlarge) (gh)
RuntimeError: distributed/test_inductor_collectives 1/1 failed
pull / linux-jammy-py3.8-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh)
RuntimeError: distributed/_spmd/test_transformation 1/1 failed

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_huggingface, 1, 1, linux.12xlarge) (gh)
YituTechConvBert
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_timm, 1, 2, linux.12xlarge) (gh)
mixer_b16_224
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_timm, 2, 2, linux.12xlarge) (gh)
xcit_large_24_p8_224
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_torchbench, 1, 2, linux.12xlarge) (gh)
hf_T5_base
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh)
yolov3
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_huggingface, 1, 1, linux.12xlarge) (gh)
YituTechConvBert
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_timm, 1, 2, linux.12xlarge) (gh)
mixer_b16_224
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_timm, 2, 2, linux.12xlarge) (gh)
xcit_large_24_p8_224
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 1, 2, linux.12xlarge) (gh)
hf_T5_base
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh)
yolov3
pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge) (gh)
ModuleNotFoundError: No module named 'torch.version'
pull / linux-focal-py3.8-clang10 / test (dynamo, 1, 2, linux.2xlarge) (gh)
Process completed with exit code 1.
pull / linux-focal-py3.8-clang10 / test (dynamo, 2, 2, linux.2xlarge) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Cutlass backend GEMMs are comparatively expensive to compile. So they should only be applied to sufficiently large GEMMs. This small diff introduces a new torch._inductor.config option called "cuda.cutlass_backend_min_gemm_size" which introduces a threshold for the size of GEMM problems that the Cutlass backend will be considered for. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

Cutlass backend GEMMs are comparatively expensive to compile. So they should only be applied to sufficiently large GEMMs. This small diff introduces a new torch._inductor.config option called "cuda.cutlass_backend_min_gemm_size" which introduces a threshold for the size of GEMM problems that the Cutlass backend will be considered for. ghstack-source-id: d89d08edb3b05dc6a11308c1273453eba52d82ba Pull Request resolved: #113569

aakhundov · 2023-11-14T10:40:04Z

torch/_inductor/config.py

@@ -577,6 +577,9 @@ class cuda:
    # are enabled for the CUTLASS backend.
    cutlass_only_evt_capable_ops: bool = False

+    # Minimum of M*N*N to consider the CUTLASS backend for GEMM ops.
+    cutlass_backend_min_gemm_size: int = 1


Specifying the GEMM size threshold with a single number sounds a bit too simplistic. E.g., there can be tall or thin GEMMs where CUTLASS may outperform Triton, but the M * N * K would be as small as a moderately-sized square GEMM. Can't suggest an immediate alternative, though, as no total ordering in GEMM sizes.

aakhundov · 2023-11-14T10:40:16Z

torch/_inductor/config.py

@@ -577,6 +577,9 @@ class cuda:
    # are enabled for the CUTLASS backend.
    cutlass_only_evt_capable_ops: bool = False

+    # Minimum of M*N*N to consider the CUTLASS backend for GEMM ops.


Nit: should be "MNK".

Cutlass backend GEMMs are comparatively expensive to compile. So they should only be applied to sufficiently large GEMMs. This small diff introduces a new torch._inductor.config option called "cuda.cutlass_backend_min_gemm_size" which introduces a threshold for the size of GEMM problems that the Cutlass backend will be considered for. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

kadeng · 2023-12-15T10:13:10Z

Moved to a (draft) feature branch, see #115919

This was referenced Nov 13, 2023

[Cutlass 3.3 submodule upgrade] #112861

Closed

[Inductor] Fix debug_str method of FusedSchedulerNode #113365

Closed

github-actions bot added module: inductor ciflow/inductor labels Nov 13, 2023

kadeng mentioned this pull request Nov 13, 2023

[Inductor cutlass backend] disable StreamK for GEMMs due to performance issues #113570

Closed

aakhundov reviewed Nov 14, 2023

View reviewed changes

This was referenced Nov 16, 2023

Minor fix in Unit Test test_max_autotune.py #113889

Closed

[Inductor cutlass backend] Enable torch.bmm and torch.baddbmm support #113890

Closed

[inductor max autotune] Log but continue on errors during autotuning benchmark #113891

Closed

This was referenced Nov 17, 2023

[Inductor cutlass backend] Experimental extended Cutlass op generator #113932

Closed

[Inductor Cutlass backend] Enabling flexible EVT-based pointwise fusions with additional tensor input #113959

Closed

kadeng added 3 commits November 17, 2023 18:41

kadeng mentioned this pull request Nov 19, 2023

[Inductor cutlass backend] Allow to fall back to non-fuseable ops #114075

Closed

kadeng added 3 commits November 19, 2023 23:35

This was referenced Dec 5, 2023

[Inductor cutlass backend] Retuning after fusion #115173

Closed

[Inductor cutlass backend] Provide CUDA SM count to kernels #115174

Closed

kadeng mentioned this pull request Dec 6, 2023

[Inductor cutlass backend] Add support for row/column broadcasting of auxiliary inputs #115270

Closed

kadeng added 11 commits December 6, 2023 23:16

kadeng mentioned this pull request Dec 14, 2023

[Inductor cutlass backend] Fixed workspace resize issue #115877

Closed

kadeng closed this Dec 15, 2023

facebook-github-bot deleted the gh/kadeng/23/head branch January 14, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cutlass inductor backend] Cutlass GEMM size threshold #113569

[Cutlass inductor backend] Cutlass GEMM size threshold #113569

kadeng commented Nov 13, 2023 •

edited

Loading

pytorch-bot bot commented Nov 13, 2023 •

edited

Loading

aakhundov Nov 14, 2023

aakhundov Nov 14, 2023

kadeng commented Dec 15, 2023

[Cutlass inductor backend] Cutlass GEMM size threshold #113569

[Cutlass inductor backend] Cutlass GEMM size threshold #113569

Conversation

kadeng commented Nov 13, 2023 • edited Loading

pytorch-bot bot commented Nov 13, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113569

❌ 13 New Failures, 13 Unrelated Failures

aakhundov Nov 14, 2023

Choose a reason for hiding this comment

aakhundov Nov 14, 2023

Choose a reason for hiding this comment

kadeng commented Dec 15, 2023

kadeng commented Nov 13, 2023 •

edited

Loading

pytorch-bot bot commented Nov 13, 2023 •

edited

Loading