[Inductor cutlass GEMM backend] Fix for operand memory layout change between autotuning and CUTLASSGEMMTemplate.render #113366

kadeng · 2023-11-09T16:31:47Z

Stack from ghstack (oldest at bottom):

When compiling a Meta-internal model using torch.compile with Cutlass GEMM backend enabled, compilation errors were produced since the input memory layouts changed between autotuning and call to CutlassGEMMTemplate.render.

Note:

Neither fixing the layout directly after autotuning nor this fix would lead to optimal results in every case. Open to suggestions what the best fix would be ( freezing the layout?)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @aakhundov @ColinPeppler

…between autotuning and CUTLASSGEMMTemplate.render [ghstack-poisoned]

pytorch-bot · 2023-11-09T16:31:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113366

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit bea32f3 with merge base afe6d27 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_cutlass_backend_mm_bias_dynamic_False_max_autotune_gemm_backends_CUTLASS_only_evt_capable_True

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…out change between autotuning and CUTLASSGEMMTemplate.render" When compiling a Meta-internal model using torch.compile with Cutlass GEMM backend enabled, compilation errors were produced since the input memory layouts changed between autotuning and call to CutlassGEMMTemplate.render. Note: Neither fixing the layout directly after autotuning nor this fix would lead to optimal results in every case. Open to suggestions what the best fix would be ( freezing the layout?) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

torch/_inductor/codegen/cuda/gemm_template.py

aakhundov · 2023-11-10T14:50:15Z

torch/_inductor/codegen/cuda/gemm_template.py

-
+        # The layouts might have changed between autotuning and this call if they were FlexibleLayout
+        # we need to adapt, which might lead to suboptimal performance.
+        # @TODO kadeng: Find a way to solve this better


I believe, this TODO is sufficient :)

Just FYI: This is addressed later in the diff stack..

…out change between autotuning and CUTLASSGEMMTemplate.render" When compiling a Meta-internal model using torch.compile with Cutlass GEMM backend enabled, compilation errors were produced since the input memory layouts changed between autotuning and call to CutlassGEMMTemplate.render. Note: Neither fixing the layout directly after autotuning nor this fix would lead to optimal results in every case. Open to suggestions what the best fix would be ( freezing the layout?) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

Skylion007 · 2023-12-09T18:36:49Z

torch/_inductor/codegen/cuda/gemm_template.py

+                [X, W, Bias, Y],
+                [op.A.layout, op.B.layout, op.C.layout, op.D.layout],


Suggested change

[X, W, Bias, Y],

[op.A.layout, op.B.layout, op.C.layout, op.D.layout],

(X, W, Bias, Y),

(op.A.layout, op.B.layout, op.C.layout, op.D.layout),

…out change between autotuning and CUTLASSGEMMTemplate.render" When compiling a Meta-internal model using torch.compile with Cutlass GEMM backend enabled, compilation errors were produced since the input memory layouts changed between autotuning and call to CutlassGEMMTemplate.render. Note: Neither fixing the layout directly after autotuning nor this fix would lead to optimal results in every case. Open to suggestions what the best fix would be ( freezing the layout?) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

kadeng · 2023-12-15T10:11:29Z

Moved to a (draft) feature branch, see #115919

[Inductor cutlass GEMM backend] Fix for operand memory layout change …

1eaeda8

…between autotuning and CUTLASSGEMMTemplate.render [ghstack-poisoned]

This was referenced Nov 9, 2023

[Cutlass 3.3 submodule upgrade] #112861

Closed

[Inductor] Fix debug_str method of FusedSchedulerNode #113365

Closed

github-actions bot added module: inductor ciflow/inductor labels Nov 9, 2023

This was referenced Nov 9, 2023

[Inductor CUTLASS backend] Cutlass EVT Addmm support for Inductor Cutlass backend #113367

Closed

[inductor max autotune] Detailed autotuning result logs ( machine-readable ) #113399

Closed

kadeng mentioned this pull request Nov 10, 2023

[Inductor cutlass backend] Shuffling selection of GEMM ops for autotuning before applying limit #113442

Closed

aakhundov reviewed Nov 10, 2023

View reviewed changes

torch/_inductor/codegen/cuda/gemm_template.py Outdated Show resolved Hide resolved

aakhundov reviewed Nov 10, 2023

View reviewed changes

This was referenced Nov 13, 2023

[Inductor max autotune] Multithreaded Precompilation #113558

Closed

[Cutlass inductor backend] Cutlass GEMM size threshold #113569

Closed

[Inductor cutlass backend] disable StreamK for GEMMs due to performance issues #113570

Closed

This was referenced Nov 16, 2023

Minor fix in Unit Test test_max_autotune.py #113889

Closed

[Inductor cutlass backend] Enable torch.bmm and torch.baddbmm support #113890

Closed

[inductor max autotune] Log but continue on errors during autotuning benchmark #113891

Closed

kadeng marked this pull request as ready for review November 16, 2023 20:16

kadeng requested a review from ipiszy November 16, 2023 20:17

This was referenced Nov 17, 2023

[Inductor cutlass backend] Experimental extended Cutlass op generator #113932

Closed

[Inductor Cutlass backend] Enabling flexible EVT-based pointwise fusions with additional tensor input #113959

Closed

kadeng added 3 commits November 17, 2023 18:41

kadeng marked this pull request as draft November 17, 2023 18:28

kadeng mentioned this pull request Nov 19, 2023

[Inductor cutlass backend] Allow to fall back to non-fuseable ops #114075

Closed

This was referenced Dec 5, 2023

[Inductor cutlass backend] Retuning after fusion #115173

Closed

[Inductor cutlass backend] Provide CUDA SM count to kernels #115174

Closed

kadeng mentioned this pull request Dec 6, 2023

[Inductor cutlass backend] Add support for row/column broadcasting of auxiliary inputs #115270

Closed

kadeng added 7 commits December 6, 2023 23:16

kadeng marked this pull request as ready for review December 9, 2023 17:45

Skylion007 reviewed Dec 9, 2023

View reviewed changes

kadeng added 2 commits December 10, 2023 11:19

kadeng mentioned this pull request Dec 14, 2023

[Inductor cutlass backend] Fixed workspace resize issue #115877

Closed

kadeng closed this Dec 15, 2023

facebook-github-bot deleted the gh/kadeng/18/head branch January 14, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor cutlass GEMM backend] Fix for operand memory layout change between autotuning and CUTLASSGEMMTemplate.render #113366

[Inductor cutlass GEMM backend] Fix for operand memory layout change between autotuning and CUTLASSGEMMTemplate.render #113366

kadeng commented Nov 9, 2023 •

edited

Loading

pytorch-bot bot commented Nov 9, 2023 •

edited

Loading

aakhundov Nov 10, 2023

kadeng Dec 10, 2023

Skylion007 Dec 9, 2023

kadeng Dec 10, 2023

kadeng commented Dec 15, 2023

		[X, W, Bias, Y],
		[op.A.layout, op.B.layout, op.C.layout, op.D.layout],

[Inductor cutlass GEMM backend] Fix for operand memory layout change between autotuning and CUTLASSGEMMTemplate.render #113366

[Inductor cutlass GEMM backend] Fix for operand memory layout change between autotuning and CUTLASSGEMMTemplate.render #113366

Conversation

kadeng commented Nov 9, 2023 • edited Loading

pytorch-bot bot commented Nov 9, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113366

❌ 1 New Failure

aakhundov Nov 10, 2023

Choose a reason for hiding this comment

kadeng Dec 10, 2023

Choose a reason for hiding this comment

Skylion007 Dec 9, 2023

Choose a reason for hiding this comment

kadeng Dec 10, 2023

Choose a reason for hiding this comment

kadeng commented Dec 15, 2023

kadeng commented Nov 9, 2023 •

edited

Loading

pytorch-bot bot commented Nov 9, 2023 •

edited

Loading