Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run into dead loop when tuning the tma persistent kernel #4332

Closed
sleepwalker2017 opened this issue Jul 16, 2024 · 6 comments
Closed

run into dead loop when tuning the tma persistent kernel #4332

sleepwalker2017 opened this issue Jul 16, 2024 · 6 comments

Comments

@sleepwalker2017
Copy link

I'm try running matmul_kernel_tma_persistent this kernel with various configs, but it often get into dead loop with GPU 100% utilization.

Is that a bug?

@embg
Copy link
Contributor

embg commented Jul 17, 2024

You probably need to add a TMA fence. Can you try adding this at the top of the kernel body and let me know if it solves your issue?

tl.inline_asm_elementwise("fence.proxy.tensormap::generic.acquire.gpu [$1], 128; // $0 dummy reg", "=r, l",
                        [a_desc_ptr], dtype=tl.int32, is_pure=False, pack=1)
tl.inline_asm_elementwise("fence.proxy.tensormap::generic.acquire.gpu [$1], 128; // $0 dummy reg", "=r, l",
                        [b_desc_ptr], dtype=tl.int32, is_pure=False, pack=1)
tl.inline_asm_elementwise("fence.proxy.tensormap::generic.acquire.gpu [$1], 128; // $0 dummy reg", "=r, l",
                        [c_desc_ptr], dtype=tl.int32, is_pure=False, pack=1)

ThomasRaoux pushed a commit that referenced this issue Jul 17, 2024
This PR fixes the bug demonstrated
[here](https://github.com/embg/triton/blob/ed125e4a44e397e9a40e691bb7ce40c698120a1a/tma_repro.py),
which is the probable root cause of
#4332.

## The problem
NVIDIA docs
[recommend](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-multi-dimensional-arrays)
that TMA descriptors should be passed through immutable device memory,
but Triton currently passes them through mutable device memory. This is
unsafe unless the TMA descriptor cache is flushed *on every SM*.

The current implementation attempts to flush the cache by launching a
[dedicated TMA flush
kernel](https://github.com/triton-lang/triton/blob/3aeb223819d632303dd2b45f4dc533d6af90dc46/python/triton/tools/experimental_descriptor.py#L34).
Unfortunately, this kernel does not run on all SMs. As a result, Triton
TMA kernels may hang or return incorrect results.

According to @ThomasRaoux, it isn't possible to guarantee a kernel will
run on every SM (as there may be another workload on a different CUDA
stream). So flushing in a separate kernel is not possible.

## Proposed solution
* Add fences to all example code via inline assembly.
* Add documentation to inform users about the fence issue.
* Remove the existing cache flush code since it is incorrect.

## Why this solution?
Since each kernel needs to issue its own fence instruction, we have
three options:
* Inline assembly
* Add a new op, like `tl._experimental_tma_acquire_fence(addr)`
* Use compiler analysis to insert the fence automatically

I believe we should not add a new op or analysis pass until both
`__grid_constant__` and on-device descriptor mutation are landed. Once
host-side descriptors switch to `__grid_constant__`, the fence will only
be needed for on-device mutation, which won't require a separate op or
analysis pass (simply add a fence while lowering the mutation op).

If I'm wrong and we do end up needing a separate op or analysis pass, it
will be trivial to clean up 6 lines of inline assembly.

## Checklist

- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.

- Select one of the following.
  - [x] I have not added any `lit` tests.
@embg
Copy link
Contributor

embg commented Jul 17, 2024

@sleepwalker2017 Please let me know if #4342 solves your issue :)

@sleepwalker2017
Copy link
Author

#4342

OK I'll try that!

@sleepwalker2017
Copy link
Author

@sleepwalker2017 Please let me know if #4342 solves your issue :)

Hi, I have another question, Is this a bug?

loc("/data/triton/python/tutorials/tma_test.py":110:32): error: 'tt.trans' op inferred type(s) '!tt.memdesc<128x128xf8E4M3FNUZ, #triton_gpu.shared<{vec = 16, perPhase = 1, maxPhase = 8, order = [0, 1], hasLeadingOffset = true}>, #triton_gpu.shared_memory>' are incompatible with return type(s) of operation '!tt.memdesc<128x128xf8E4M3FNUZ, #triton_gpu.shared<{vec = 16, perPhase = 1, maxPhase = 8, order = [0, 1], hasLeadingOffset = false}>, #triton_gpu.shared_memory>'
loc("/data/triton/python/tutorials/tma_test.py":110:32): error: 'tt.trans' op failed to infer returned types

When I pass in some specific configs, it reports this error. Are these configs illegal?

@sleepwalker2017
Copy link
Author

@sleepwalker2017 Please let me know if #4342 solves your issue :)

I tried my tuning code, and it seems the dead loop is solved!

@embg
Copy link
Contributor

embg commented Jul 18, 2024

When I pass in some specific configs, it reports this error. Are these configs illegal?

Can you share the exact code and configs you are using? For example, fork the Triton repo and push a branch with the exact code you are running + share the CLI command you are using to repro the error?

I would like to close this issue, as the dead loop problem was solved by adding the TMA fence. After you prepare the repro information for 'tt.trans' op failed to infer returned types, perhaps you can open a new issue?

@embg embg closed this as completed Jul 18, 2024
embg added a commit that referenced this issue Aug 19, 2024
## Motivation
Currently, Triton passes TMA descriptors by-ref through global memory.
This has a number of problems:
* Significant launch overhead (5-10us) for the host-to-device memcpy
* Users must insert fences for TMA descriptor cache flush (see
#4342). When users don't
insert these fences correctly, they run into very strange bugs:
#4332
* The memcpy makes it nearly impossible to use cudagraphs

There are two possible solutions:
* [Pass the descriptor
by-value](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-multi-dimensional-arrays)
* [Create the descriptor
on-device](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#encoding-a-tensor-map-on-device)

Because of the tricky memory model for TMA descriptors on H100, creating
a descriptor on-device requires moving data back and forth from L2
cache. This is relatively expensive (100s of cycles at least) and
requires the user or compiler to correctly insert release/acquire
fences.

In some cases, there is no way to avoid creating the descriptor
on-device. But for many use-cases, it's perfectly fine to set up the
descriptor on the host and pass by-value, avoiding both performance and
correctness issues. This PR implements the by-value functionality.

## User-level API
Whenever the user provides a kernel param which implements the method
`tma_desc_cpu_ptr()`, Triton will lower that argument to a
`__grid_constant__` by-value param. The existing helper methods
`create_[1d/2d]_tma_descriptor` were modified to return such a type, so
existing code does not need any changes to take advantage of the new
feature.

## Implementation details
When a kernel param with `tma_desc_cpu_ptr()` is detected, we attach an
attribute to that param at the TTIR level. The attribute is passed
through to TTGIR. When lowering TTGIR to LLIR, we use code ported from
Mosaic (jax-ml/jax#22175) to set up the correct
LLVM attributes. The runtime is also modified to pass by-value TMA
descriptors properly.

## Limitations
This feature is currently broken when compiling an `IRSource` directly
(which is useful for editing IR and re-compiling). That would require
updating some
[regexes](https://github.com/triton-lang/triton/blob/edcc2bcb8dd2e9224c94b689df9cbb7d2986ebea/python/triton/compiler/compiler.py#L52-L53)
which infer the function signature from the IR. `IRSource` compilation
still works fine for kernels which do not use the new feature.

Once the approach I'm taking here is reviewed, I plan to fix that
limitation, either in this PR or in a follow-up PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants