run into dead loop when tuning the tma persistent kernel #4332

sleepwalker2017 · 2024-07-16T08:40:13Z

I'm try running matmul_kernel_tma_persistent this kernel with various configs, but it often get into dead loop with GPU 100% utilization.

Is that a bug?

The text was updated successfully, but these errors were encountered:

embg · 2024-07-17T12:35:40Z

You probably need to add a TMA fence. Can you try adding this at the top of the kernel body and let me know if it solves your issue?

tl.inline_asm_elementwise("fence.proxy.tensormap::generic.acquire.gpu [$1], 128; // $0 dummy reg", "=r, l",
                        [a_desc_ptr], dtype=tl.int32, is_pure=False, pack=1)
tl.inline_asm_elementwise("fence.proxy.tensormap::generic.acquire.gpu [$1], 128; // $0 dummy reg", "=r, l",
                        [b_desc_ptr], dtype=tl.int32, is_pure=False, pack=1)
tl.inline_asm_elementwise("fence.proxy.tensormap::generic.acquire.gpu [$1], 128; // $0 dummy reg", "=r, l",
                        [c_desc_ptr], dtype=tl.int32, is_pure=False, pack=1)

@ThomasRaoux

This PR fixes the bug demonstrated [here](https://github.com/embg/triton/blob/ed125e4a44e397e9a40e691bb7ce40c698120a1a/tma_repro.py), which is the probable root cause of #4332. ## The problem NVIDIA docs [recommend](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-multi-dimensional-arrays) that TMA descriptors should be passed through immutable device memory, but Triton currently passes them through mutable device memory. This is unsafe unless the TMA descriptor cache is flushed *on every SM*. The current implementation attempts to flush the cache by launching a [dedicated TMA flush kernel](https://github.com/triton-lang/triton/blob/3aeb223819d632303dd2b45f4dc533d6af90dc46/python/triton/tools/experimental_descriptor.py#L34). Unfortunately, this kernel does not run on all SMs. As a result, Triton TMA kernels may hang or return incorrect results. According to @ThomasRaoux, it isn't possible to guarantee a kernel will run on every SM (as there may be another workload on a different CUDA stream). So flushing in a separate kernel is not possible. ## Proposed solution * Add fences to all example code via inline assembly. * Add documentation to inform users about the fence issue. * Remove the existing cache flush code since it is incorrect. ## Why this solution? Since each kernel needs to issue its own fence instruction, we have three options: * Inline assembly * Add a new op, like `tl._experimental_tma_acquire_fence(addr)` * Use compiler analysis to insert the fence automatically I believe we should not add a new op or analysis pass until both `__grid_constant__` and on-device descriptor mutation are landed. Once host-side descriptors switch to `__grid_constant__`, the fence will only be needed for on-device mutation, which won't require a separate op or analysis pass (simply add a fence while lowering the mutation op). If I'm wrong and we do end up needing a separate op or analysis pass, it will be trivial to clean up 6 lines of inline assembly. ## Checklist - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - Select one of the following. - [x] I have not added any `lit` tests.

embg · 2024-07-17T19:24:00Z

@sleepwalker2017 Please let me know if #4342 solves your issue :)

sleepwalker2017 · 2024-07-18T02:08:31Z

#4342

OK I'll try that!

sleepwalker2017 · 2024-07-18T02:10:29Z

@sleepwalker2017 Please let me know if #4342 solves your issue :)

Hi, I have another question, Is this a bug?

loc("/data/triton/python/tutorials/tma_test.py":110:32): error: 'tt.trans' op inferred type(s) '!tt.memdesc<128x128xf8E4M3FNUZ, #triton_gpu.shared<{vec = 16, perPhase = 1, maxPhase = 8, order = [0, 1], hasLeadingOffset = true}>, #triton_gpu.shared_memory>' are incompatible with return type(s) of operation '!tt.memdesc<128x128xf8E4M3FNUZ, #triton_gpu.shared<{vec = 16, perPhase = 1, maxPhase = 8, order = [0, 1], hasLeadingOffset = false}>, #triton_gpu.shared_memory>'
loc("/data/triton/python/tutorials/tma_test.py":110:32): error: 'tt.trans' op failed to infer returned types

When I pass in some specific configs, it reports this error. Are these configs illegal?

sleepwalker2017 · 2024-07-18T02:12:44Z

@sleepwalker2017 Please let me know if #4342 solves your issue :)

I tried my tuning code, and it seems the dead loop is solved!

embg · 2024-07-18T03:46:21Z

When I pass in some specific configs, it reports this error. Are these configs illegal?

Can you share the exact code and configs you are using? For example, fork the Triton repo and push a branch with the exact code you are running + share the CLI command you are using to repro the error?

I would like to close this issue, as the dead loop problem was solved by adding the TMA fence. After you prepare the repro information for 'tt.trans' op failed to infer returned types, perhaps you can open a new issue?

## Motivation Currently, Triton passes TMA descriptors by-ref through global memory. This has a number of problems: * Significant launch overhead (5-10us) for the host-to-device memcpy * Users must insert fences for TMA descriptor cache flush (see #4342). When users don't insert these fences correctly, they run into very strange bugs: #4332 * The memcpy makes it nearly impossible to use cudagraphs There are two possible solutions: * [Pass the descriptor by-value](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-multi-dimensional-arrays) * [Create the descriptor on-device](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#encoding-a-tensor-map-on-device) Because of the tricky memory model for TMA descriptors on H100, creating a descriptor on-device requires moving data back and forth from L2 cache. This is relatively expensive (100s of cycles at least) and requires the user or compiler to correctly insert release/acquire fences. In some cases, there is no way to avoid creating the descriptor on-device. But for many use-cases, it's perfectly fine to set up the descriptor on the host and pass by-value, avoiding both performance and correctness issues. This PR implements the by-value functionality. ## User-level API Whenever the user provides a kernel param which implements the method `tma_desc_cpu_ptr()`, Triton will lower that argument to a `__grid_constant__` by-value param. The existing helper methods `create_[1d/2d]_tma_descriptor` were modified to return such a type, so existing code does not need any changes to take advantage of the new feature. ## Implementation details When a kernel param with `tma_desc_cpu_ptr()` is detected, we attach an attribute to that param at the TTIR level. The attribute is passed through to TTGIR. When lowering TTGIR to LLIR, we use code ported from Mosaic (jax-ml/jax#22175) to set up the correct LLVM attributes. The runtime is also modified to pass by-value TMA descriptors properly. ## Limitations This feature is currently broken when compiling an `IRSource` directly (which is useful for editing IR and re-compiling). That would require updating some [regexes](https://github.com/triton-lang/triton/blob/edcc2bcb8dd2e9224c94b689df9cbb7d2986ebea/python/triton/compiler/compiler.py#L52-L53) which infer the function signature from the IR. `IRSource` compilation still works fine for kernels which do not use the new feature. Once the approach I'm taking here is reviewed, I plan to fix that limitation, either in this PR or in a follow-up PR.

embg mentioned this issue Jul 17, 2024

Short-term solution for TMA descriptor cache management #4342

Merged

5 tasks

embg closed this as completed Jul 18, 2024

embg mentioned this issue Aug 9, 2024

[nvidia] Support passing TMA descriptors by-value #4498

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run into dead loop when tuning the tma persistent kernel #4332

run into dead loop when tuning the tma persistent kernel #4332

sleepwalker2017 commented Jul 16, 2024

embg commented Jul 17, 2024 •

edited

Loading

embg commented Jul 17, 2024

sleepwalker2017 commented Jul 18, 2024

sleepwalker2017 commented Jul 18, 2024

sleepwalker2017 commented Jul 18, 2024

embg commented Jul 18, 2024 •

edited

Loading

run into dead loop when tuning the tma persistent kernel #4332

run into dead loop when tuning the tma persistent kernel #4332

Comments

sleepwalker2017 commented Jul 16, 2024

embg commented Jul 17, 2024 • edited Loading

embg commented Jul 17, 2024

sleepwalker2017 commented Jul 18, 2024

sleepwalker2017 commented Jul 18, 2024

sleepwalker2017 commented Jul 18, 2024

embg commented Jul 18, 2024 • edited Loading

embg commented Jul 17, 2024 •

edited

Loading

embg commented Jul 18, 2024 •

edited

Loading