-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run into dead loop when tuning the tma persistent kernel #4332
Comments
You probably need to add a TMA fence. Can you try adding this at the top of the kernel body and let me know if it solves your issue?
|
This PR fixes the bug demonstrated [here](https://github.com/embg/triton/blob/ed125e4a44e397e9a40e691bb7ce40c698120a1a/tma_repro.py), which is the probable root cause of #4332. ## The problem NVIDIA docs [recommend](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-multi-dimensional-arrays) that TMA descriptors should be passed through immutable device memory, but Triton currently passes them through mutable device memory. This is unsafe unless the TMA descriptor cache is flushed *on every SM*. The current implementation attempts to flush the cache by launching a [dedicated TMA flush kernel](https://github.com/triton-lang/triton/blob/3aeb223819d632303dd2b45f4dc533d6af90dc46/python/triton/tools/experimental_descriptor.py#L34). Unfortunately, this kernel does not run on all SMs. As a result, Triton TMA kernels may hang or return incorrect results. According to @ThomasRaoux, it isn't possible to guarantee a kernel will run on every SM (as there may be another workload on a different CUDA stream). So flushing in a separate kernel is not possible. ## Proposed solution * Add fences to all example code via inline assembly. * Add documentation to inform users about the fence issue. * Remove the existing cache flush code since it is incorrect. ## Why this solution? Since each kernel needs to issue its own fence instruction, we have three options: * Inline assembly * Add a new op, like `tl._experimental_tma_acquire_fence(addr)` * Use compiler analysis to insert the fence automatically I believe we should not add a new op or analysis pass until both `__grid_constant__` and on-device descriptor mutation are landed. Once host-side descriptors switch to `__grid_constant__`, the fence will only be needed for on-device mutation, which won't require a separate op or analysis pass (simply add a fence while lowering the mutation op). If I'm wrong and we do end up needing a separate op or analysis pass, it will be trivial to clean up 6 lines of inline assembly. ## Checklist - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - Select one of the following. - [x] I have not added any `lit` tests.
@sleepwalker2017 Please let me know if #4342 solves your issue :) |
OK I'll try that! |
Hi, I have another question, Is this a bug?
When I pass in some specific configs, it reports this error. Are these configs illegal? |
I tried my tuning code, and it seems the dead loop is solved! |
Can you share the exact code and configs you are using? For example, fork the Triton repo and push a branch with the exact code you are running + share the CLI command you are using to repro the error? I would like to close this issue, as the dead loop problem was solved by adding the TMA fence. After you prepare the repro information for |
## Motivation Currently, Triton passes TMA descriptors by-ref through global memory. This has a number of problems: * Significant launch overhead (5-10us) for the host-to-device memcpy * Users must insert fences for TMA descriptor cache flush (see #4342). When users don't insert these fences correctly, they run into very strange bugs: #4332 * The memcpy makes it nearly impossible to use cudagraphs There are two possible solutions: * [Pass the descriptor by-value](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-multi-dimensional-arrays) * [Create the descriptor on-device](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#encoding-a-tensor-map-on-device) Because of the tricky memory model for TMA descriptors on H100, creating a descriptor on-device requires moving data back and forth from L2 cache. This is relatively expensive (100s of cycles at least) and requires the user or compiler to correctly insert release/acquire fences. In some cases, there is no way to avoid creating the descriptor on-device. But for many use-cases, it's perfectly fine to set up the descriptor on the host and pass by-value, avoiding both performance and correctness issues. This PR implements the by-value functionality. ## User-level API Whenever the user provides a kernel param which implements the method `tma_desc_cpu_ptr()`, Triton will lower that argument to a `__grid_constant__` by-value param. The existing helper methods `create_[1d/2d]_tma_descriptor` were modified to return such a type, so existing code does not need any changes to take advantage of the new feature. ## Implementation details When a kernel param with `tma_desc_cpu_ptr()` is detected, we attach an attribute to that param at the TTIR level. The attribute is passed through to TTGIR. When lowering TTGIR to LLIR, we use code ported from Mosaic (jax-ml/jax#22175) to set up the correct LLVM attributes. The runtime is also modified to pass by-value TMA descriptors properly. ## Limitations This feature is currently broken when compiling an `IRSource` directly (which is useful for editing IR and re-compiling). That would require updating some [regexes](https://github.com/triton-lang/triton/blob/edcc2bcb8dd2e9224c94b689df9cbb7d2986ebea/python/triton/compiler/compiler.py#L52-L53) which infer the function signature from the IR. `IRSource` compilation still works fine for kernels which do not use the new feature. Once the approach I'm taking here is reviewed, I plan to fix that limitation, either in this PR or in a follow-up PR.
I'm try running
matmul_kernel_tma_persistent
this kernel with various configs, but it often get into dead loop with GPU 100% utilization.Is that a bug?
The text was updated successfully, but these errors were encountered: