[AMD] OptimizeLDSUsage pass #3730

binarman · 2024-04-22T20:39:10Z

This PR inroduces OptimizeLDSUsage pass which generalizes LDS optimization, which was part of DecomposeUnsupportedLayouts pass.

binarman · 2024-04-22T20:41:01Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+    int tmpCvtLDS = getCvtOpLDSUsage(tmpCvt);
+    int newCvtLDS = getCvtOpLDSUsage(newEpilogueCvt);
+    if (tmpCvtLDS <= LDSSize && newCvtLDS <= LDSSize) {
+      int LDSUsage = std::max(tmpCvtLDS, newCvtLDS);


@oplavsic
I've changed this part of the algorithm: https://github.com/openai/triton/pull/3730/files#diff-0d63e5cd9cf58151489fd9a5206b43a0902939004e58f3a7ec5258fa7d473267L227

Was it crucial?

binarman · 2024-04-23T20:48:59Z

To clarify, what this PR is doing:

At the moment we have an optimization in DecomposeUnsupportedLayouts pass, which is looking for convert_layout operations that requires more shared memory, than we have. Optimization tries to decompose such convert_layouts in two converts with some intermediate layout, In many cases this helps to reduce LDS usage.

Current approach can not optimize convert_layout in hopper flash attention test, so LDS overflows.
This PR introduces two things:

adding more intermediate layouts variants
doing global analysis, to catch convert_layout operation which do not overflow LDS on its own, but overflows memory if there are some shared tensors.

First item is needed, because old set of intermediate layouts was not able to optimize conversions found int hopper FA.

Second item is needed to generalize optimization. For example, take a look at this example:

 %1 = triton_gpu.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !tt.memdesc<128x128xf16, #shared>
 %2 = triton_gpu.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>
 %3 = triton_gpu.local_load %1 : !tt.memdesc<128x128xf16, #shared> -> tensor<128x128xf16, #triton_gpu.dot_op<{opIdx = 0, parent = #mma}>>

%1 consumes 16 KB of LDS, %2 requires ~64KB of lds for a scratch buffer.
If there are no padding, %2 can be exactly 64KB, which fits into LDS, but %1 and %2 together do not.

P.s. I had some concerns that new optimization can affect existing benchmarks. I had an offline conversation with author of original optimization (@oplavsic) and we decided that best to leave old optimization functionally same, but move some functions in common place and make them parameterizable.

zhanglx13 · 2024-04-24T01:58:00Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+ * ->
+ * %1 = cvtOp %0 (srcLayout -> dstLayout)
+ * %2 = cvtOp %0 (srcLayout -> tmpLayout)
+ * %3 = cvtOp %1 (tmpLayout -> dstLayout)


Should this be %3 = cvtOp %2?

This function creates two cvtOps based on a given cvtOps. Could you be more specific about which cvtOp is the new one and which is the old one in the comment?

zhanglx13 · 2024-04-24T02:03:33Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+// LDS reduction is possible by changing the shape of WarpsPerCta attribute in
+// mfma layout. The implicit LDS usage of cvt(mfma->blocked) op depends on the
+// number of warps per CTA that mfma layout uses along x dimension and block
+// layout uses across y dimension.


It's a little confusing whether x refers to the row or column. We can use dim 0 and dim 1 instead.

zhanglx13 · 2024-04-24T02:06:26Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+// LDS usage of this op is roughly calculated as:
+// LDS_USAGE = getShapePerCTA(mfma_layout)[0] * getShapePerCTA(blocked_layout)[1] * sizeof(data_type)
+// LDS_USAGE = warpsPerCTA(mfma_layout)[0] * warpsPerCta(blocked_layout)[1] * C,
+// where C = 32 * sizePerWarp(blocked_layout)[1] * threadsPerWarp(blocked_layout)[1] * sizeof(data_type)


Why is 32 hardcoded? Is it assuming mfma32 is used?

To be honest, I did not look deep into this comment, just copied it from original algorithm.
It was implemented a log ago, we probably had only mfma32 at the time.

I'll take a closer look and adjust.

zhanglx13 · 2024-04-24T02:24:57Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+  for (int i = 0; i < tmpLayouts.size(); i++) {
+    auto tmpLayout = tmpLayouts[i];
+    std::tie(tmpCvt, newEpilogueCvt) =
+        createNewConvertOps(builder, cvtOp, tmpLayout);


In this loop, we only want to know the index of the tmpLayout that gives us the min LDS usage. Do we really need to create the cvtOps and erase them at the end of each iteration?

This creation/deletion is needed because algorithm use getScratchConfigForCvtLayout(ConvertLayoutOp unsigned&, unsigned&) function from Allocation.cpp to estimate LDS usage.

I can introduce new interface, so we can avoid these redundant stuff.

zhanglx13 · 2024-04-24T02:30:12Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+ * @return mapping from operation to list of live LDS buffers
+ */
+std::map<mlir::Operation *, SmallVector<Allocation::BufferId>>
+analyzeBufferLiveness(FunctionOpInterface func, const Allocation *allocations) {


This is not AMD specific. Maybe we should put it in Analysis/Allocation.cpp?

zhanglx13 · 2024-04-24T02:38:03Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+}
+
+SmallVector<triton::gpu::ConvertLayoutOp>
+findLDSBottleneck(ModuleAllocation &allocAnalysis, FunctionOpInterface func) {


We can also put this to the common part since it can benefit NV path. But after realizing NV GPUs have pretty large shared memory ....

zhanglx13 · 2024-04-24T02:49:03Z

@binarman I have a question regarding tryMinimizeLDS.

%1 consumes 16 KB of LDS, %2 requires ~64KB of lds for a scratch buffer.
If there are no padding, %2 can be exactly 64KB, which fits into LDS, but %1 and %2 together do not.

In this example, %2 will be a candidate from findLDSBottleneck, and tryMinimizeLDS is called on it. However, tryMinimizeLDS will early return since currLDSUsage <= LDSSize. I think the problem is that tryMinimizeLDS should not take LDSSize as target, instead it should take LDSSize - offset as target, where offset can be kept when we look for candidates in findLDSBottleneck.

zhanglx13 · 2024-04-24T02:52:24Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+
+namespace {
+
+constexpr int LDSSize = 65536;


Could we not hardcode it but pass it from the front end?

binarman · 2024-04-24T14:39:15Z

@zhanglx13 about tryMinimizeLDS

Condition is filters out cases which will definitely overflow LDS and there are no early exit.
We can actually remove this condition at all, because we are looking for the smallest LDS usage anyway.

zhanglx13 · 2024-04-24T18:53:37Z

yes, at least the early return condition needs to be removed
And when you find the minLDSUsage, it could still be larger than LDSSize - offset, so tryMinimizeLDS should also return nothing in this case.

binarman · 2024-04-24T22:11:47Z

the early return condition needs to be removed

Now I see, I've missed this early return, thank you!
At first I thought you were talking about early exit from loop.

zhanglx13 · 2024-04-25T04:34:37Z

test/TritonGPU/amd/optimize-lds-usage.mlir

+module attributes {"triton_gpu.num-warps" = 8 : i32, "triton_gpu.threads-per-warp" = 64 : i32} {
+  tt.func public @alloc_convert_load(%arg0: tensor<128x128xf16, #blocked>, %arg1: tensor<128x128xf32, #blocked>) attributes {noinline = false} {
+    %1 = triton_gpu.local_alloc %arg0 : (tensor<128x128xf16, #blocked>) -> !tt.memdesc<128x128xf16, #shared>
+    %2 = triton_gpu.convert_layout %arg1 : tensor<128x128xf32, #blocked> -> tensor<128x128xf32, #mma>


Sorry I forgot to mention that I think this cvtOp is decomposed just because it uses more than 64 KB of LDS since padding is used. Therefore, this test does not test the functionality that a cvtOp could still be decomposed even it uses less than 64 KB LDS.

Added new test: it uses fp16 instead of fp32, so cvt scratch buffer is x2 smaller

binarman · 2024-04-30T12:22:58Z

third_party/amd/backend/compiler.py

@@ -147,6 +147,8 @@ def make_llir(src, metadata, options):
        pm = ir.pass_manager(mod.context)
        pm.enable_debug()
        amd.passes.ttgpuir.add_decompose_unsupported_conversions(pm)
+        lds_size = 65536


I am not sure, where to place code choosing LDS size, so it is plain constant at this point.
Let's introduce some interface in later PR.

It should be convenient to rebase onto Lei's PR #3808

antiagainst · 2024-04-30T22:16:17Z

(coverting to draft as we chatted--need to first get all issues addressed from AMD side before making it as open)

binarman · 2024-05-01T14:20:05Z

@antiagainst @zhanglx13
This PR is ready for review, PTAL 🙂

zhanglx13 · 2024-05-01T14:50:14Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUtility.cpp

+namespace triton {
+namespace AMD {
+
+constexpr int kPtrBitWidth = 64;


Do we really need to hardcode the pointer bitwidth? Can we just use inline constant?

This part is copied from Allocation.cpp (it is not part of public interface).
Maybe I can actually take this part in some public interface, for example in Analysis/Utility module.

This is what I was talking about: binarman#6

zhanglx13 · 2024-05-01T14:53:16Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUtility.cpp

+  res.LDS = std::numeric_limits<typeof(res.LDS)>::max();
+
+  triton::gpu::ConvertLayoutOp tmpCvt;
+  triton::gpu::ConvertLayoutOp newEpilogueCvt;


The above three lines are not used.

zhanglx13 · 2024-05-01T15:11:32Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+      threadsPerWarp[rank - 2] = warpSize / threadsPerWarp[rank - 1];
+      auto order = triton::gpu::getOrder(srcEnc);
+      auto layoutCTA = triton::gpu::getCTALayout(srcEnc);
+      auto fallbackLayout = triton::gpu::BlockedEncodingAttr::get(


For this fallbackLayout, all the components, except warpsPerCTA, are loop invariants. Maybe we can create a base BlockLayout out of the loop and use createTmpLayout(blockEnc, warpsPerCTA) inside the loop to update the warpsPerCTA only?

Why is 8 chosen in warpSize / 8?

In general, why we need this fallbackLayout? Is it covered by either srcEnc or dstEnc?

Why is 8 chosen in warpSize / 8

For wave64 it will be [8, 8], for wave32 it will be [4, 8]. This is done to make layout tile "square", so no dim size of minimal tile is dominating.

In general, why we need this fallbackLayout? Is it covered by either srcEnc or dstEnc?

In some cases different warpsPerCTA of src or dst layout is not enough to reduce LDS usage, but some other layouts can be appropriate. These fallback layouts are designed to have as compact tile as possible, i.e. elementsPerThread = [1, ... 1], and threadsPerWarp are as "square" as possible.

I believe, that in most cases fallback layout will be chosen as a temporary layout. This could be non optimal in terms of performance, but it is fine, because without this transformation kernel will not compile at all.

zhanglx13 · 2024-05-01T15:16:23Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+      return;
+    }
+
+    triton::gpu::ConvertLayoutOp tmpCvt;


are we using this tmpCvt?

Nope, will rewrite this part as done in DecomposeUnsupportedConversions pass.

zhanglx13 · 2024-05-01T16:22:55Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+        if (offset + size > LDSSize) {
+          auto maxScratchBufferSize = computeMaxScratchBufferSize(
+              cvtOp, funcAnalysis, liveBuffers[cvtOp]);
+          candidates.push_back({cvtOp, maxScratchBufferSize});


This function is very confusing to me.

Why do we need opBuffer? Just to check it's valid?

Does liveBuffers[cvtOp] include opBuffer? To put it another way, does one of the bufId's for the scratch buffer allocated for this cvtOp?

It seems to me that this function assumes that there is at most one extra buffer that can overlap with the buffer for this cvtOp? If there are more live buffers that overlap with this cvtOp, we should still only push cvtOp into candidates once, but compute maxScratchBufferSize based on all overlapped live buffers.

Why do we need opBuffer? Just to check it's valid?

Sorry, this is reminder after refactoring, I used to pass it to computeMaxScratchBufferSize, but then start compute it inside function.

Does liveBuffers[cvtOp] include opBuffer? To put it another way, does one of the bufId's for the scratch buffer allocated for this cvtOp?

Yes, scratch buffer is the same as "long-living" buffers, the only difference, that it's live time is limited to one operation.

It seems to me that this function assumes that there is at most one extra buffer that can overlap with the buffer for this cvtOp? If there are more live buffers that overlap with this cvtOp, we should still only push cvtOp into candidates once, but compute maxScratchBufferSize based on all overlapped live buffers.

No, there could be any number of buffers with live-time overlapping with scratch buffer.

let me remove loop from this function, it should make algorithm clearer.

zhanglx13 · 2024-05-01T16:24:35Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+    int64_t scratchBufferSize = allocation->getAllocatedSize(scratchBufferId);
+    size_t totalLDSConsumption = 0;
+    for (auto buf : liveBuffers)
+      totalLDSConsumption = std::max(


If all liveBuffers are live at this cvtOp, should we use sum instead of max here?

Max is more conservative metric in this sense. Let's consider that we have "holes" in memory:

let's consider that green buffer is scratch buffer that we want to optimize, viollet and blue are long-living buffers in shared layout.

Hole is created, because pink tensor is allocated on tick 1 and reallocated on tick 2, but previously allocated violet tensor continue live.

Summarizing buffer sizes will tell that we have 20 KB(3 * 8 KB) for scratch buffer, but in reality we probably wan to make it smaller.

zhanglx13 · 2024-05-01T16:25:40Z

third_party/amd/lib/TritonAMDGPUToLLVM/OptimizeLDSUsage.cpp

+   * space available for scratch buffer.
+   */
+  int64_t
+  computeMaxScratchBufferSize(triton::gpu::ConvertLayoutOp op,


Maybe computeTargetBufferSize? I feel like "target" or "desired" is more accurate about what we want to do here.

binarman · 2024-05-08T20:02:30Z

@zhanglx13 @antiagainst PTAL

antiagainst · 2024-06-25T05:07:09Z

@binarman @zhanglx13 what's the status on this pull request? Do we still need it?

binarman · 2024-06-25T13:48:55Z

what's the status on this pull request? Do we still need it?

I don't think we should focus on this at the moment, because it is not blocking anything and no test/kernel requires this change.
But I still want to have this change at some point.

I have used this change few times during debug: adding device prints increases LDS consumption and normally working test can overflow LDS.

xiaohuguo2023 · 2024-07-01T22:45:03Z

third_party/amd/lib/TritonAMDGPUToLLVM/DecomposeUnsupportedConversions.cpp

-  auto srcType = cvtOp.getSrc().getType();
-  auto bytes =
-      isa<triton::PointerType>(srcType.getElementType())
-          ? elems * kPtrBitWidth / 8


where kPtrBitWidth is defined ?

It is defined here: https://github.com/triton-lang/triton/pull/3730/files#diff-69efd7149b566a254eabbb7b7808df841b5fb3e78f82d074bc26aa9369d4e4bfR19

I agree that it is not the cleanest solution, feel free to propose other place.

binarman · 2024-07-05T21:53:51Z

I've moved refactoring of DecomposeUnsupportedConversions.cpp to separate PR #4262 so now here we have only changes related to new pass. Hope this will make review slightly easier.

This PR inroduces OptimizeLDSUsage pass which generalizes LDS optimization, which was part of DecomposeUnsupportedLayouts pass.

- use arch name to infer lds size - remove unused code, simplify code, rename entities, etc.

binarman requested review from antiagainst, zhanglx13, Jokeren and ptillet as code owners April 22, 2024 20:39

binarman commented Apr 22, 2024

View reviewed changes

zhanglx13 reviewed Apr 24, 2024

View reviewed changes

zhanglx13 reviewed Apr 25, 2024

View reviewed changes

binarman force-pushed the reduce_lds_usage branch from 74d3bad to ada48d1 Compare April 26, 2024 00:04

binarman commented Apr 30, 2024

View reviewed changes

antiagainst marked this pull request as draft April 30, 2024 22:15

zhanglx13 reviewed May 1, 2024

View reviewed changes

binarman force-pushed the reduce_lds_usage branch from ada48d1 to dca2d6f Compare May 8, 2024 17:41

alefimov-amd force-pushed the reduce_lds_usage branch from 1eae89e to f5a73b7 Compare May 9, 2024 14:53

binarman force-pushed the reduce_lds_usage branch from 7f6f784 to c9cdc96 Compare July 1, 2024 21:37

xiaohuguo2023 reviewed Jul 1, 2024

View reviewed changes

antiagainst force-pushed the reduce_lds_usage branch from be5d0d8 to 27026ff Compare July 3, 2024 22:07

binarman mentioned this pull request Jul 5, 2024

[WIP] [AMD] Refactor decompose-unsupported-amd-conversions pass #4262

Draft

binarman and others added 19 commits July 18, 2024 17:00

[AMD] OptimizeLDSUsage pass

324240f

This PR inroduces OptimizeLDSUsage pass which generalizes LDS optimization, which was part of DecomposeUnsupportedLayouts pass.

remove redundant change

d946d78

add license

bf459a1

add pass argument

4e32f8d

bring back old DecomposeUnsupportedConversions

124132d

unify decompose unsupported convert and optimize lds

46b2c32

add precise scratch buffer size target

487d507

add additional test for small layout_convert

48e634e

change comment for LDS reduction algorithm

5d3c61c

move buffer analysis to Allocation class

80e25cb

support arbitrary rank of tensors

7828db0

Address review comments:

2a4f018

- use arch name to infer lds size - remove unused code, simplify code, rename entities, etc.

move inveriants out of tmp layout creation loop

4a1704b

move includes in proper order

4772207

post rebase fix

2e0c156

Fix macOS build issues

cc51df9

revert constant LDS size

55a7075

revert DecomposeUnsupportedConversions.cpp

be3be24

after rebase fixes

28620cd

binarman force-pushed the reduce_lds_usage branch from 67eaac0 to 28620cd Compare July 18, 2024 17:15

[AMD] OptimizeLDSUsage pass #3730

Are you sure you want to change the base?

[AMD] OptimizeLDSUsage pass #3730

Conversation

binarman commented Apr 22, 2024

Choose a reason for hiding this comment

binarman commented Apr 23, 2024

zhanglx13 Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhanglx13 commented Apr 24, 2024

Choose a reason for hiding this comment

binarman commented Apr 24, 2024

zhanglx13 commented Apr 24, 2024

binarman commented Apr 24, 2024

Choose a reason for hiding this comment

binarman Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antiagainst commented Apr 30, 2024

binarman commented May 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman commented May 8, 2024

antiagainst commented Jun 25, 2024 • edited Loading

binarman commented Jun 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman commented Jul 5, 2024

zhanglx13 Apr 24, 2024 •

edited

Loading

binarman Apr 26, 2024 •

edited

Loading

antiagainst commented Jun 25, 2024 •

edited

Loading