[nvidia] Support passing TMA descriptors by-value #4498

embg · 2024-08-09T22:56:58Z

Motivation

Currently, Triton passes TMA descriptors by-ref through global memory. This has a number of problems:

Significant launch overhead (5-10us) for the host-to-device memcpy
Users must insert fences for TMA descriptor cache flush (see Short-term solution for TMA descriptor cache management #4342). When users don't insert these fences correctly, they run into very strange bugs: run into dead loop when tuning the tma persistent kernel #4332
The memcpy makes it nearly impossible to use cudagraphs

There are two possible solutions:

Because of the tricky memory model for TMA descriptors on H100, creating a descriptor on-device requires moving data back and forth from L2 cache. This is relatively expensive (100s of cycles at least) and requires the user or compiler to correctly insert release/acquire fences.

In some cases, there is no way to avoid creating the descriptor on-device. But for many use-cases, it's perfectly fine to set up the descriptor on the host and pass by-value, avoiding both performance and correctness issues. This PR implements the by-value functionality.

User-level API

Whenever the user provides a kernel param which implements the method tma_desc_cpu_ptr(), Triton will lower that argument to a __grid_constant__ by-value param. The existing helper methods create_[1d/2d]_tma_descriptor were modified to return such a type, so existing code does not need any changes to take advantage of the new feature.

Implementation details

When a kernel param with tma_desc_cpu_ptr() is detected, we attach an attribute to that param at the TTIR level. The attribute is passed through to TTGIR. When lowering TTGIR to LLIR, we use code ported from Mosaic (jax-ml/jax#22175) to set up the correct LLVM attributes. The runtime is also modified to pass by-value TMA descriptors properly.

Limitations

This feature is currently broken when compiling an IRSource directly (which is useful for editing IR and re-compiling). That would require updating some regexes which infer the function signature from the IR. IRSource compilation still works fine for kernels which do not use the new feature.

Once the approach I'm taking here is reviewed, I plan to fix that limitation, either in this PR or in a follow-up PR.

python/test/unit/hopper/test_experimental_tma.py

ThomasRaoux · 2024-08-09T23:04:20Z

python/triton/runtime/build.py

@@ -42,7 +42,8 @@ def _build(name, src, srcdir, library_dirs, include_dirs, libraries):
    py_include_dir = sysconfig.get_paths(scheme=scheme)["include"]
    custom_backend_dirs = set(os.getenv(var) for var in ('TRITON_CUDACRT_PATH', 'TRITON_CUDART_PATH'))
    include_dirs = include_dirs + [srcdir, py_include_dir, *custom_backend_dirs]
-    cc_cmd = [cc, src, "-O3", "-shared", "-fPIC", "-o", so]
+    # for -Wno-psabi, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111047
+    cc_cmd = [cc, src, "-O3", "-shared", "-fPIC", "-Wno-psabi", "-o", so]


what is causing the extra warning?

GCC doesn't like the CUtensorMap struct. This is called out in the CUDA C++ Programming Guide as a false warning:

When passing the tensor map as a parameter, some versions of the GCC C++ compiler issue the warning “the ABI for passing parameters with 64-byte alignment has changed in GCC 4.6”. This warning can be ignored.

I don't think it can be suppressed inline via pragma, it has to be suppressed on the command line: https://godbolt.org/z/f5n5crhjG

htyu

Thanks for the good work. LGTM in general. Left a couple minor feedbacks.

htyu · 2024-08-14T02:30:56Z

lib/Conversion/TritonGPUToLLVM/FuncOpToLLVM.cpp

+          llvmFuncOp.setArgAttr(i, "nvvm.grid_constant",
+                                mlir::UnitAttr::get(llvmFuncOp.getContext()));
+          llvmFuncOp.setArgAttr(i, "llvm.align",
+                                mlir::IntegerAttr::get(i32_type, 64));


Is 64 a required alignment value?

Yes. Here is the definition of CUtensorMap in <cuda.h>:

typedef struct CUtensorMap_st { alignas(64) unsigned long long opaque[16]; } CUtensorMap;

third_party/nvidia/backend/driver.py

python/triton/tools/experimental_descriptor.py

ThomasRaoux

LGTM once the other comments are addressed

Summary: This PR follows [a recent PR in Triton](triton-lang/triton#4498) that supports passing TMA descriptors by-value using `__grid_constant__`. In this PR, we update the kernel `_attn_fwd_inner` to support the above new feature in Triton. To support auto-tune, we implement a helper class that wraps operations for TMA during auto-tune and computations in kernel respectively. In addition, the benchmark program now also checks whether the triton version supports this new feature. If it doesn't, the helper class applies the old way of handling TMA. The change has been tested on Triton from the standard installation of pytorch on conda, as well as the recent Triton including the above PR. Command for testing and experiment results: Before removing fences: P1541573348 After removing fences: P1541736645 1) CUDA_VISIBLE_DEVICES=5, old tma: 138.476 2) CUDA_VISIBLE_DEVICES=5, new tma, with fences: 152 - 164 3) CUDA_VISIBLE_DEVICES=5, new tma, after removing fences: 168.0 4) CUDA_VISIBLE_DEVICES=5, no tma: 187.881 The result is still behind no TMA and we can investigate further. Pull Request resolved: #2428 Reviewed By: embg Differential Revision: D61668142 Pulled By: sfzhu93 fbshipit-source-id: d08bab147c6b2197f73447ee8f30ede877e712ca

This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created. - The newly introduced variables have `reconstruct` methods used in case of graph breaks. - The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like. - In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors. - In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required. - JIT Inductor and AOT Inductor support will be implemented in follow-up PRs. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang rec [ghstack-poisoned]

This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created. - The newly introduced variables have `reconstruct` methods used in case of graph breaks. - The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like. - In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors. - In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required. - JIT Inductor and AOT Inductor support will be implemented in follow-up PRs. Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928) Pull Request resolved: #137677 Approved by: https://github.com/zou3519

Summary: This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created. - The newly introduced variables have `reconstruct` methods used in case of graph breaks. - The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like. - In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors. - In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required. - JIT Inductor and AOT Inductor support will be implemented in follow-up PRs. X-link: pytorch/pytorch#137677 Approved by: https://github.com/zou3519 Reviewed By: clee2000 Differential Revision: D64404928 Pulled By: aakhundov fbshipit-source-id: c812cea3867c55800d5fe213bf07bf21292345e3

This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created. - The newly introduced variables have `reconstruct` methods used in case of graph breaks. - The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like. - In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors. - In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required. - JIT Inductor and AOT Inductor support will be implemented in follow-up PRs. Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928) Pull Request resolved: #137677 Approved by: https://github.com/zou3519

This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes: - Here we assume the availability of the host-side TMA API added to upstream Triton in triton-lang/triton#4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024. - Due to Dynamo support implemented in the previous PR, the `tma_descriptor_metadata` dict is delivered to the `triton_kerenl_wrap_` lowering and passed to the `ir.UserDefinedTritonKernel` as additional argument. - Looking into the `tma_descriptor_metadata`, `ir.UserDefinedTritonKernel` substitutes the corresponding `TensorBox` arguments of the kernel (swapped upstream in Dynamo) by the new `ir.TMADescriptor` nodes implementing TMA descriptors in Inductor IR. - `ir.TMADescriptor.__init__` provides the wiring between the upstream underlying `ir.TensorBox` and the downstream `ir.UserDefinedTritonKernel` kernel. In particular, we use `ir.NonOwnedLayout` wrapping `ir.ReinterpretView` to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel). - Via `ir.TMADescriptor.codegen`, the Triton's `create_{1d,2d}_tma_descriptor` function call is codegened in the wrapper (in the host code). - New `TMADescriptorArg` dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA. - AOT Inductor support will be implemented in a follow-up PR. Pull Request resolved: pytorch#137950 Approved by: https://github.com/eellison ghstack dependencies: pytorch#137677

embg requested review from bertmaher, htyu and manman-ren August 9, 2024 22:57

embg commented Aug 9, 2024

View reviewed changes

python/test/unit/hopper/test_experimental_tma.py Show resolved Hide resolved

embg added 12 commits August 11, 2024 22:30

byval tma desc working prototype

8bb3c1d

nits for driver code

f14b9cb

TmaDescKernelParam class

1e010f7

refactor FuncOpConversion

1ed44a3

bugfix for null argAttrDict

98d7191

add lit test

4a8e66b

format

e3d4032

update unit tests for byval tma

20d63a8

format

8a3b707

check PTX in unit tests

b5b420f

remove local test script

cd03b7a

small bugfix

35dde67

embg force-pushed the grid_const_dev branch from 5d8e86c to 35dde67 Compare August 12, 2024 04:30

ThomasRaoux reviewed Aug 12, 2024

View reviewed changes

fence when byval_tma is false

84d0173

htyu reviewed Aug 14, 2024

View reviewed changes

ThomasRaoux approved these changes Aug 14, 2024

View reviewed changes

embg marked this pull request as ready for review August 15, 2024 03:49

embg requested a review from ptillet as a code owner August 15, 2024 03:49

embg added 2 commits August 15, 2024 12:19

nits

24079ae

Merge branch 'main' into grid_const_dev

6bec85b

embg merged commit c25f684 into triton-lang:main Aug 19, 2024
6 checks passed

embg deleted the grid_const_dev branch August 19, 2024 18:26

sfzhu93 mentioned this pull request Aug 22, 2024

add support for auto-tune TMA grid constant pytorch/benchmark#2428

Closed

jlebar mentioned this pull request Sep 3, 2024

Build LLVMAarch64CodeGen if CMAKE_OSX_ARCHITECTURES is arm64. #4637

Merged

gflegar mentioned this pull request Sep 19, 2024

Refactor the C code template in third_party/nvidia/backend/driver.py #4722

Open

5 tasks

aakhundov mentioned this pull request Oct 12, 2024

Add host-side Triton TMA support to Dynamo pytorch/pytorch#137677

Closed

aakhundov mentioned this pull request Oct 15, 2024

Add host-side Triton TMA support to Inductor pytorch/pytorch#137950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nvidia] Support passing TMA descriptors by-value #4498

[nvidia] Support passing TMA descriptors by-value #4498

embg commented Aug 9, 2024 •

edited

Loading

ThomasRaoux Aug 9, 2024

embg Aug 13, 2024

htyu left a comment

htyu Aug 14, 2024

embg Aug 15, 2024

ThomasRaoux left a comment •

edited

Loading

[nvidia] Support passing TMA descriptors by-value #4498

[nvidia] Support passing TMA descriptors by-value #4498

Conversation

embg commented Aug 9, 2024 • edited Loading

Motivation

User-level API

Implementation details

Limitations

ThomasRaoux Aug 9, 2024

Choose a reason for hiding this comment

embg Aug 13, 2024

Choose a reason for hiding this comment

htyu left a comment

Choose a reason for hiding this comment

htyu Aug 14, 2024

Choose a reason for hiding this comment

embg Aug 15, 2024

Choose a reason for hiding this comment

ThomasRaoux left a comment • edited Loading

Choose a reason for hiding this comment

embg commented Aug 9, 2024 •

edited

Loading

ThomasRaoux left a comment •

edited

Loading