[Inductor CUTLASS backend] Step 4: CUDA (template) kernels #107931

ipiszy · 2023-08-25T05:40:00Z

This is the step 4 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.

Feature request: #106991.

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

[ghstack-poisoned]

pytorch-bot · 2023-08-25T05:40:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107931

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit e62717f with merge base f9a250c ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

lintrunner / linux-job (gh)

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: db3e505fbfd679038c8446f035d8426221426d02 Pull Request resolved: #107931

kadeng · 2023-08-25T11:12:06Z

torch/_inductor/codegen/common.py

+        return None
+
+
+class ChoiceCaller:


A small doc comment would be great: What's the purpose of this class / which problem does it solve? Is it supposed to have subclasses?

This is the original code moved from select_algorithm.py. Let me add some comments.

kadeng · 2023-08-25T11:13:30Z

torch/_inductor/codegen/common.py

+
+class KernelTemplate:
+    """
+    Base class for defining kernel templates.


Which kind of kernel templates? ( e.g. Triton / C++ / Cutlass / any involving Jinja templates )

Added some comments.

kadeng · 2023-08-25T11:14:56Z

torch/_inductor/codegen/common.py

+    def __init__(self, name: str):
+        self.name = name
+
+    def maybe_append_choice(self, choices, **kwargs):


What is the "choices" argument here? ( e.g. datatype and intended usage). A small doc comment would help clarify I think.

Added some comments.

kadeng · 2023-08-25T11:16:50Z

torch/_inductor/codegen/cuda/cuda_kernel.py

+from ..cpp import CppPrinter, DTYPE_TO_CPP
+
+
+cexpr = CppPrinter().doprint


This might benefit from a type annotation and/or a small comment what it is, so we don't need to follow the link to CppPrinter when reading the code.

Added a comment, basically it's a print function.

kadeng · 2023-08-25T11:23:58Z

torch/_inductor/codegen/cuda/cuda_kernel.py

+    """
+    Kernels defined by the CUDA language.
+    """
+    overrides = OpOverrides


Which consequences does this line have? It's a bit hard to read out of the inductor codegen, how exactly these overrides fields are used.

This is related to Inductor "define-by-run" IR design I think. If you check lowering.py, you could find that the lowered Loops contain code like ops.some_method(). These ops.some_method() are overridden by OpOverrides of each kernel. e.g. TritonKernel has its Triton overrides. CPPKernel has its CPP overrides.

The CUDAKernel here is not a general backend since it only supports templates now, so OpOverrides doesn't matter. However, with flexible epilogue fusion, it could be used to generate cutlass epilogue visitor tree code, so could be relevant here.

kadeng · 2023-08-25T11:30:42Z

torch/_inductor/codegen/cuda/cuda_kernel.py

+            return "0"
+        return str(node.get_layout().offset)
+
+    def ptr(self, node: IRNode, default_node: IRNode = None) -> str:


A small doc comment maybe? Same for other methods here, like dtype, offset, call_kernel etc.

Added some comments.

kadeng · 2023-08-25T11:36:01Z

torch/_inductor/codegen/cuda/cuda_scheduling.py

+from typing import List
+
+from ... import config
+from ...codecache import code_hash, get_path


It would be better to use absolute imports here, as that eases code navigation tools ( both in IDEs and github code views for example )

Most existing Inductor code uses relative imports. I feel it's easier to just follow existing coding style.

kadeng · 2023-08-25T11:36:27Z

torch/_inductor/codegen/cuda/cuda_template.py

+# import cutlass libs
+import scripts as cutlass_lib
+
+from ...autotune_process import CUDABenchmarkRequest, TensorMeta


Same here, absolute imports would be preferable for most tools

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ipiszy

Thanks @kadeng !

ipiszy · 2023-08-27T00:22:40Z

torch/_inductor/codegen/common.py

+        return None
+
+
+class ChoiceCaller:


This is the original code moved from select_algorithm.py. Let me add some comments.

ipiszy · 2023-08-27T00:34:41Z

torch/_inductor/codegen/common.py

+    def __init__(self, name: str):
+        self.name = name
+
+    def maybe_append_choice(self, choices, **kwargs):


Added some comments.

ipiszy · 2023-08-27T00:36:45Z

torch/_inductor/codegen/cuda/cuda_scheduling.py

+from typing import List
+
+from ... import config
+from ...codecache import code_hash, get_path


Most existing Inductor code uses relative imports. I feel it's easier to just follow existing coding style.

ipiszy · 2023-08-27T00:39:51Z

torch/_inductor/codegen/cuda/cuda_kernel.py

+from ..cpp import CppPrinter, DTYPE_TO_CPP
+
+
+cexpr = CppPrinter().doprint


Added a comment, basically it's a print function.

ipiszy · 2023-08-27T00:45:49Z

torch/_inductor/codegen/cuda/cuda_kernel.py

+    """
+    Kernels defined by the CUDA language.
+    """
+    overrides = OpOverrides


This is related to Inductor "define-by-run" IR design I think. If you check lowering.py, you could find that the lowered Loops contain code like ops.some_method(). These ops.some_method() are overridden by OpOverrides of each kernel. e.g. TritonKernel has its Triton overrides. CPPKernel has its CPP overrides.

The CUDAKernel here is not a general backend since it only supports templates now, so OpOverrides doesn't matter. However, with flexible epilogue fusion, it could be used to generate cutlass epilogue visitor tree code, so could be relevant here.

ipiszy · 2023-08-27T01:02:57Z

torch/_inductor/codegen/cuda/cuda_kernel.py

+            return "0"
+        return str(node.get_layout().offset)
+
+    def ptr(self, node: IRNode, default_node: IRNode = None) -> str:


Added some comments.

ipiszy · 2023-08-27T01:05:49Z

torch/_inductor/codegen/common.py

+
+class KernelTemplate:
+    """
+    Base class for defining kernel templates.


Added some comments.

This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ipiszy

Thanks @aakhundov ! Updated the PR, ptal.

ipiszy · 2023-09-06T22:38:46Z

torch/_inductor/codegen/cuda/cuda_kernel.py

+                self.named_nodes[name] = node
+                self.args.input_buffers[node.get_name()] = name


Yes, dict order is useful when generating func args.

ipiszy · 2023-09-06T22:41:54Z

torch/_inductor/codegen/cuda/cuda_template.py

+            else list(range(len(self.input_nodes)))
+        )
+        expected_args = list(
+            unique(self.input_nodes[idx].get_name() for idx in input_reorder)


I think it's fine. We want to dedup args in the function definition. It doesn't affect function implementation codegen. Let me add a unittest to verify.

ipiszy · 2023-09-06T22:58:28Z

torch/_inductor/codegen/cuda/cuda_kernel.py

+
+class CUDATemplateKernel(CUDAKernel):
+    """
+    Template kernels defined by the CUDA language.


Well I think it depends on how you define this. Maybe language extension is more accurate. I'd like to distinguish it with "CUDA platform" which also contains things like PTX, cubin, etc. Let me change it to "C++ CUDA".

ipiszy · 2023-09-06T23:02:26Z

torch/_inductor/scheduler.py

+                if isinstance(node.node, ir.CUDATemplateBuffer):
+                    from .codegen.cuda.cuda_scheduling import CUDAScheduling
+
+                    CUDAScheduling(self).codegen_template(node, epilogue)


can_fuse() always returns False for CUDATemplateBuffer (in triton.py). It invokes Triton codegen_template() here to codegen only the non-epilogue part.

ipiszy · 2023-09-06T23:09:36Z

torch/_inductor/codegen/cuda/cuda_kernel.py

+
+        if node is None:
+            return None
+        return {**self.args.input_buffers, **self.args.output_buffers}.get(


OSS linter formats it in this way..

This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

aakhundov

@ipiszy there are some CI jobs failing, worth checking as not filtered out as flaky / broken trunk. As the 3 PRs below this one in the stack are green and the one above is also red, my guess is that the root cause may be in this one.

This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ipiszy · 2023-09-07T23:22:29Z

@ipiszy there are some CI jobs failing, worth checking as not filtered out as flaky / broken trunk. As the 3 PRs below this one in the stack are green and the one above is also red, my guess is that the root cause may be in this one.

Yes on it. There had been a bunch of test failures. I've fixed some and now there is something new. It's an iterative process..

This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ipiszy · 2023-09-09T18:33:52Z

@pytorchbot label "topic: not user facing"

chenyang78 · 2023-09-10T08:59:22Z

torch/_inductor/codegen/cuda/cuda_kernel.py

+        input_reorder: The actual order of input nodes.
+                       e.g. The template might have input argument defined as [X, W, Bias],
+                       and the actual input passed into this template could be [Bias, X, W].
+                       In this case, the `input_reorder` would be [2, 0, 1].


Wondering why we allow different orders for the actual input and the template declaration in the first place? Could we enforce the actual input to have the same order specified by the template?

I assume, for template front-end similarity with the existing Triton templates?

This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

This is the step 5 to add cutlass as an alternative inductor backend. Feature request: #106991. Pull Request resolved: #108015 Approved by: https://github.com/kadeng, https://github.com/jansel, https://github.com/aakhundov ghstack dependencies: #107802, #107847, #107901, #107931

[Inductor CUTLASS backend] Step 4: CUDA (template) kernels

75cfaa0

[ghstack-poisoned]

ipiszy mentioned this pull request Aug 25, 2023

[Inductor CUTLASS backend] Step 1: Inductor config for cuda / cutlass, util functions. #107802

Closed

This was referenced Aug 25, 2023

[Inductor CUTLASS backend] Step 2: CUDACodeCache #107847

Closed

[Inductor CUTLASS backend] Step 3: autotune_process, and CUDABenchmarkRequest #107901

Closed

github-actions bot added module: inductor ciflow/inductor labels Aug 25, 2023

ipiszy added a commit that referenced this pull request Aug 25, 2023

[Inductor CUTLASS backend] Step 4: CUDA (template) kernels

4c78e30

ghstack-source-id: db3e505fbfd679038c8446f035d8426221426d02 Pull Request resolved: #107931

ipiszy mentioned this pull request Aug 25, 2023

[Inductor CUTLASS backend] Step 5: CUTLASS gemm kernels #107933

Closed

kadeng reviewed Aug 25, 2023

View reviewed changes

ipiszy added 10 commits August 25, 2023 11:00

Update on "[Inductor CUTLASS backend] Step 4: CUDA (template) kernels"

0c6a52d

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Update on "[Inductor CUTLASS backend] Step 4: CUDA (template) kernels"

c0b885d

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Update on "[Inductor CUTLASS backend] Step 4: CUDA (template) kernels"

a96e751

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Update on "[Inductor CUTLASS backend] Step 4: CUDA (template) kernels"

5ad9fee

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Update on "[Inductor CUTLASS backend] Step 4: CUDA (template) kernels"

51fa22c

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Update on "[Inductor CUTLASS backend] Step 4: CUDA (template) kernels"

701bbee

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Update on "[Inductor CUTLASS backend] Step 4: CUDA (template) kernels"

9949e3f

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Update on "[Inductor CUTLASS backend] Step 4: CUDA (template) kernels"

46da92b

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Update on "[Inductor CUTLASS backend] Step 4: CUDA (template) kernels"

d4008e5

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Update on "[Inductor CUTLASS backend] Step 4: CUDA (template) kernels"

44bee1c

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ipiszy commented Aug 27, 2023

View reviewed changes

ipiszy marked this pull request as ready for review August 27, 2023 01:39

ipiszy requested a review from jansel August 27, 2023 01:39

ipiszy commented Sep 6, 2023

View reviewed changes

ipiszy added 4 commits September 6, 2023 17:13

ipiszy requested review from aakhundov and kadeng September 7, 2023 05:48

aakhundov requested changes Sep 7, 2023

View reviewed changes

ipiszy requested a review from aakhundov September 8, 2023 03:13

jansel approved these changes Sep 8, 2023

View reviewed changes

pytorch-bot bot added the topic: not user facing topic category label Sep 9, 2023

chenyang78 reviewed Sep 10, 2023

View reviewed changes

aakhundov approved these changes Sep 10, 2023

View reviewed changes

kadeng approved these changes Sep 11, 2023

View reviewed changes

ipiszy added 4 commits September 11, 2023 12:04

pytorchmergebot added the Merged label Sep 12, 2023

pytorchmergebot closed this in 097fd43 Sep 12, 2023

facebook-github-bot deleted the gh/ipiszy@gmail.com/4/head branch September 16, 2023 14:23

		from ..cpp import CppPrinter, DTYPE_TO_CPP


		cexpr = CppPrinter().doprint

		self.named_nodes[name] = node
		self.args.input_buffers[node.get_name()] = name

[Inductor CUTLASS backend] Step 4: CUDA (template) kernels #107931

[Inductor CUTLASS backend] Step 4: CUDA (template) kernels #107931

Conversation

ipiszy commented Aug 25, 2023 • edited

pytorch-bot bot commented Aug 25, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107931

✅ You can merge normally! (2 Unrelated Failures)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kadeng Aug 25, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ipiszy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ipiszy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aakhundov left a comment

Choose a reason for hiding this comment

ipiszy commented Sep 7, 2023

ipiszy commented Sep 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ipiszy commented Aug 25, 2023 •

edited

pytorch-bot bot commented Aug 25, 2023 •

edited

kadeng Aug 25, 2023 •

edited