[WIP][AMDGPU] Improve the handling of `inreg` arguments #133614

shiltian · 2025-03-30T04:02:53Z

When SGPRs available for inreg argument passing run out, the compiler silently
falls back to using whole VGPRs to pass those arguments. Ideally, instead of
using whole VGPRs, we should pack inreg arguments into individual lanes of
VGPRs.

This PR introduces InregVGPRSpiller, which handles this packing. It uses
v_writelane at the call site to place inreg arguments into specific VGPR
lanes, and then extracts them in the callee using v_readlane.

Fixes #130443 and #129071.

shiltian · 2025-03-30T04:03:07Z

[WIP][AMDGPU] Improve the handling of inreg arguments #133614 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-03-30T04:03:29Z

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

When SGPRs available for inreg argument passing run out, the compiler silently
falls back to using whole VGPRs to pass those arguments. Ideally, instead of
using whole VGPRs, we should pack inreg arguments into individual lanes of
VGPRs.

This PR introduces InregVGPRSpiller, which handles this packing. It uses
v_writelane at the call site to place inreg arguments into specific VGPR
lanes, and then extracts them in the callee using v_readlane.

Fixes #130443 and #129071.

Full diff: https://github.com/llvm/llvm-project/pull/133614.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+97-2)
(added) llvm/test/CodeGen/AMDGPU/inreg-vgpr-spill.ll (+28)

diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index c8645850fe111..628ea2515482d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -2841,6 +2841,86 @@ void SITargetLowering::insertCopiesSplitCSR(
   }
 }
 
+class InregVPGRSpiller {
+  CCState &State;
+  const unsigned WaveFrontSize;
+
+  Register CurReg;
+  unsigned CurLane = 0;
+
+protected:
+  SelectionDAG &DAG;
+  MachineFunction &MF;
+
+  Register getCurReg() const { return CurReg; }
+  unsigned getCurLane() const { return CurLane; }
+
+  InregVPGRSpiller(SelectionDAG &DAG, MachineFunction &MF, CCState &State)
+      : State(State),
+        WaveFrontSize(MF.getSubtarget<GCNSubtarget>().getWavefrontSize()),
+        DAG(DAG), MF(MF) {}
+
+  void setReg(Register &Reg) {
+    if (CurReg.isValid()) {
+      State.DeallocateReg(Reg);
+      Reg = CurReg;
+    } else {
+      CurReg = Reg;
+    }
+  }
+
+  void forward() {
+    // We have used the same VGPRs of all the lanes, so we need to reset it and
+    // pick up a new one in the next move.
+    if (++CurLane % WaveFrontSize == 0)
+      CurReg = 0;
+  }
+};
+
+class InregVPGRSpillerCallee final : private InregVPGRSpiller {
+public:
+  InregVPGRSpillerCallee(SelectionDAG &DAG, MachineFunction &MF, CCState &State)
+      : InregVPGRSpiller(DAG, MF, State) {}
+
+  SDValue read(SDValue Chain, const SDLoc &SL, Register &Reg, EVT VT) {
+    setReg(Reg);
+
+    MF.addLiveIn(getCurReg(), &AMDGPU::VGPR_32RegClass);
+
+    // TODO: Do we need the chain here?
+    SmallVector<SDValue, 4> Operands{
+        DAG.getTargetConstant(Intrinsic::amdgcn_readlane, SL, MVT::i32),
+        DAG.getRegister(getCurReg(), VT),
+        DAG.getTargetConstant(getCurLane(), SL, MVT::i32)};
+    SDValue Res = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, SL, VT, Operands);
+
+    forward();
+
+    return Res;
+  }
+};
+
+class InregVPGRSpillerCallSite final : private InregVPGRSpiller {
+public:
+  InregVPGRSpillerCallSite(SelectionDAG &DAG, MachineFunction &MF,
+                           CCState &State)
+      : InregVPGRSpiller(DAG, MF, State) {}
+
+  SDValue write(const SDLoc &SL, Register &Reg, SDValue V, EVT VT) {
+    setReg(Reg);
+
+    SmallVector<SDValue, 4> Operands{
+        DAG.getTargetConstant(Intrinsic::amdgcn_writelane, SL, MVT::i32),
+        DAG.getRegister(getCurReg(), VT), V,
+        DAG.getTargetConstant(getCurLane(), SL, MVT::i32)};
+    SDValue Res = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, SL, VT, Operands);
+
+    forward();
+
+    return Res;
+  }
+};
+
 SDValue SITargetLowering::LowerFormalArguments(
     SDValue Chain, CallingConv::ID CallConv, bool isVarArg,
     const SmallVectorImpl<ISD::InputArg> &Ins, const SDLoc &DL,
@@ -2963,6 +3043,7 @@ SDValue SITargetLowering::LowerFormalArguments(
   // FIXME: Alignment of explicit arguments totally broken with non-0 explicit
   // kern arg offset.
   const Align KernelArgBaseAlign = Align(16);
+  InregVPGRSpillerCallee Spiller(DAG, MF, CCInfo);
 
   for (unsigned i = 0, e = Ins.size(), ArgIdx = 0; i != e; ++i) {
     const ISD::InputArg &Arg = Ins[i];
@@ -3130,8 +3211,17 @@ SDValue SITargetLowering::LowerFormalArguments(
       llvm_unreachable("Unexpected register class in LowerFormalArguments!");
     EVT ValVT = VA.getValVT();
 
-    Reg = MF.addLiveIn(Reg, RC);
-    SDValue Val = DAG.getCopyFromReg(Chain, DL, Reg, VT);
+    SDValue Val;
+    // If an argument is marked inreg but gets pushed to a VGPR, it indicates
+    // we've run out of SGPRs for argument passing. In such cases, we'd prefer
+    // to start packing inreg arguments into individual lanes of VGPRs, rather
+    // than placing them directly into VGPRs.
+    if (RC == &AMDGPU::VGPR_32RegClass && Arg.Flags.isInReg()) {
+      Val = Spiller.read(Chain, DL, Reg, VT);
+    } else {
+      Reg = MF.addLiveIn(Reg, RC);
+      Val = DAG.getCopyFromReg(Chain, DL, Reg, VT);
+    }
 
     if (Arg.Flags.isSRet()) {
       // The return object should be reasonably addressable.
@@ -3875,6 +3965,8 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
 
   MVT PtrVT = MVT::i32;
 
+  InregVPGRSpillerCallSite Spiller(DAG, MF, CCInfo);
+
   // Walk the register/memloc assignments, inserting copies/loads.
   for (unsigned i = 0, e = ArgLocs.size(); i != e; ++i) {
     CCValAssign &VA = ArgLocs[i];
@@ -3904,6 +3996,9 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
     }
 
     if (VA.isRegLoc()) {
+      Register Reg = VA.getLocReg();
+      if (Outs[i].Flags.isInReg() && AMDGPU::VGPR_32RegClass.contains(Reg))
+        Arg = Spiller.write(DL, Reg, Arg, VA.getLocVT());
       RegsToPass.push_back(std::pair(VA.getLocReg(), Arg));
     } else {
       assert(VA.isMemLoc());
diff --git a/llvm/test/CodeGen/AMDGPU/inreg-vgpr-spill.ll b/llvm/test/CodeGen/AMDGPU/inreg-vgpr-spill.ll
new file mode 100644
index 0000000000000..3ec4c86fa87ad
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/inreg-vgpr-spill.ll
@@ -0,0 +1,28 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx950 -o - %s | FileCheck %s
+
+; arg3 is v0, arg4 is in v1. These should be packed into a lane and extracted with readlane
+define i32 @callee(<8 x i32> inreg %arg0, <8 x i32> inreg %arg1, <2 x i32> inreg %arg2, i32 inreg %arg3, i32 inreg %arg4) {
+; CHECK-LABEL: test0:
+; CHECK:       ; %bb.0:
+; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CHECK-NEXT:    v_readlane_b32 s0, v0, 1
+; CHECK-NEXT:    v_readlane_b32 s1, v0, 0
+; CHECK-NEXT:    s_add_i32 s1, s1, s0
+; CHECK-NEXT:    s_nop 0
+; CHECK-NEXT:    v_mov_b32_e32 v0, s1
+; CHECK-NEXT:    s_setpc_b64 s[30:31]
+  %add = add i32 %arg3, %arg4
+  ret i32 %add
+}
+
+define amdgpu_kernel void @kernel(ptr %p0, ptr %p1, ptr %p2, ptr %p3, ptr %p4, ptr %p) {
+  %arg0 = load <8 x i32>, ptr %p0
+  %arg1 = load <8 x i32>, ptr %p1
+  %arg2 = load <2 x i32>, ptr %p2
+  %arg3 = load i32, ptr %p3
+  %arg4 = load i32, ptr %p4
+  %ret = call i32 @callee(<8 x i32> %arg0, <8 x i32> %arg1, <2 x i32> %arg2, i32 %arg3, i32 %arg4)
+  store i32 %ret, ptr %p
+  ret void
+}

shiltian · 2025-03-30T04:05:50Z

I'm still working on this, but I'd like to get some early feedback. The newly added test case currently crashes in the verifier with the following error:

error: Illegal instruction detected: Illegal immediate value for operand.
renamable $vgpr20 = V_WRITELANE_B32 killed $sgpr0, killed $sgpr1, 0
error: Illegal instruction detected: Illegal immediate value for operand.
renamable $vgpr1 = V_WRITELANE_B32 killed $sgpr0, killed $sgpr1, 1

Looks like I can't just use an immediate operand for the lane selection. I'll fix that in a later update.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/inreg-vgpr-spill.ll

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

github-actions · 2025-04-03T05:16:30Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff HEAD~1 HEAD --extensions h,cpp -- llvm/lib/Target/AMDGPU/SIISelLowering.cpp llvm/lib/Target/AMDGPU/SIISelLowering.h

View the diff from clang-format here.

diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index a12ff8d2e..9d1e2a0e7 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -4114,7 +4114,7 @@ SDValue SITargetLowering::LowerCall(CallLoweringInfo &CLI,
         Outs[ArgIdx - NumSpecialInputs].Flags.isInReg() &&
         AMDGPU::VGPR_32RegClass.contains(Reg)) {
       Spiller.writeLane(DL, Reg, Val,
-                    ArgLocs[ArgIdx - NumSpecialInputs].getLocVT());
+                        ArgLocs[ArgIdx - NumSpecialInputs].getLocVT());
     } else {
       Chain = DAG.getCopyToReg(Chain, DL, Reg, Val, InGlue);
       InGlue = Chain.getValue(1);
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index 0990d8187..07348611d 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -402,13 +402,11 @@ public:
                       const SmallVectorImpl<SDValue> &OutVals, const SDLoc &DL,
                       SelectionDAG &DAG) const override;
 
-  void passSpecialInputs(
-    CallLoweringInfo &CLI,
-    CCState &CCInfo,
-    const SIMachineFunctionInfo &Info,
-    SmallVectorImpl<std::pair<Register, SDValue>> &RegsToPass,
-    SmallVectorImpl<SDValue> &MemOpChains,
-    SDValue Chain) const;
+  void
+  passSpecialInputs(CallLoweringInfo &CLI, CCState &CCInfo,
+                    const SIMachineFunctionInfo &Info,
+                    SmallVectorImpl<std::pair<Register, SDValue>> &RegsToPass,
+                    SmallVectorImpl<SDValue> &MemOpChains, SDValue Chain) const;
 
   SDValue LowerCallResult(SDValue Chain, SDValue InGlue,
                           CallingConv::ID CallConv, bool isVarArg,

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

shiltian · 2025-04-03T21:59:38Z

llvm/test/CodeGen/AMDGPU/inreg-vgpr-spill.ll

+}
+
+define i32 @tail_caller(<8 x i32> inreg %arg0, <8 x i32> inreg %arg1, <2 x i32> inreg %arg2, i32 inreg %arg3, i32 inreg %arg4) {
+  %ret = tail call i32 @callee(<8 x i32> %arg0, <8 x i32> %arg1, <2 x i32> %arg2, i32 %arg3, i32 %arg4)


For some reason, it emits an error error: <unknown>:0:0: ran out of registers during register allocation in function 'tail_caller' for this tail call version.

shiltian · 2025-04-03T22:00:51Z

llvm/test/CodeGen/AMDGPU/inreg-vgpr-spill.ll

+; CHECK-NEXT:    s_addc_u32 s9, s5, 0
+; CHECK-NEXT:    v_mov_b32_e32 v1, v0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_writelane_b32 v1, s30, 0


There is some sort of mismatch here. It writes to v1 at call site, but reads v0 in the callee.

@arsenm The issue is here. We use v1 for the spilled inreg argument passing, but in the callee, it is v0. However, the physical register we have at the call site is in fact v0 as well.

@arsenm ping, this is for #133614 (comment)

When SGPRs available for `inreg` argument passing run out, the compiler silently falls back to using whole VGPRs to pass those arguments. Ideally, instead of using whole VGPRs, we should pack `inreg` arguments into individual lanes of VGPRs. This PR introduces `InregVGPRSpiller`, which handles this packing. It uses `v_writelane` at the call site to place `inreg` arguments into specific VGPR lanes, and then extracts them in the callee using `v_readlane`. Fixes #130443 and #129071.

shiltian · 2025-04-04T04:14:42Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+        DAG.getTargetConstant(Intrinsic::amdgcn_writelane, SL, MVT::i32), Val,
+        DAG.getTargetConstant(CurLane++, SL, MVT::i32)};
+    if (!LastWrite) {
+      Register VReg = MF.getRegInfo().getLiveInVirtReg(DstReg);


The mismatch comes from here: the tied-input operand for the first writelane intrinsic in the series of writelane intrinsics. Basically, we want the intrinsic to write directly to a specific register. The input Reg here is actually an MCRegister, but we need a virtual register for this (and the first) intrinsic. To achieve this, we first call getLiveInVirtReg to get a virtual register corresponding to the MCRegister, and then call getRegister to obtain an SDValue. This generates MIR code similar to the following:

%0:vgpr_32(s32) = COPY $vgpr0 ... %33:vgpr_32 = V_WRITELANE_B32 killed %28:sreg_32, 0, %0:vgpr_32(tied-def 0)(s32) %34:vgpr_32 = V_WRITELANE_B32 killed %29:sreg_32, 1, %33:vgpr_32(tied-def 0)

Later, this sequence is lowered to something like:

%34:vgpr_32 = COPY %0:vgpr_32(s32) %34:vgpr_32 = V_WRITELANE_B32 $sgpr30, 0, %34:vgpr_32(tied-def 0) %34:vgpr_32 = V_WRITELANE_B32 $sgpr31, 1, %34:vgpr_32(tied-def 0)

And after register allocation, it becomes:

renamable $vgpr1 = COPY renamable $vgpr0 renamable $vgpr1 = V_WRITELANE_B32 $sgpr30, 0, killed $vgpr1(tied-def 0) renamable $vgpr1 = V_WRITELANE_B32 $sgpr31, 1, killed $vgpr1(tied-def 0)

As we can see, the intended behavior is writing directly into $vgpr0. However, due to the COPY, the intrinsic instead writes to $vgpr1.

We probably shouldn't be using virtual registers as operands directly in the intrinsic, but doing so crashes the two-address instruction pass later, which expects operands to be virtual registers.

We probably shouldn't be using virtual registers as operands directly in the intrinsic,

The intrinsics should only use virtual registers.

%34:vgpr_32 = COPY %0:vgpr_32(s32)

The problem here is we probably should be using a WWM copy, not a regular one

This example MIR is missing the output part. This is all fine as long as it copies to the correct register in the end. I don't see what the issue is?

arsenm · 2025-04-08T10:06:29Z

Vaguely related, but we should also support inreg on function returns so we can avoid using the ugly hack of integer typed return values with shader calling conventions. Still not as nice as a proper per-field entry on the struct

shiltian · 2025-04-08T13:59:48Z

Vaguely related, but we should also support inreg on function returns so we can avoid using the ugly hack of integer typed return values with shader calling conventions. Still not as nice as a proper per-field entry on the struct

That'd be in the follow up PRs.

shiltian · 2025-04-30T14:39:15Z

gentle ping

arsenm · 2025-05-01T14:39:15Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+    if (SrcVal) {
+      State.DeallocateReg(Reg);
+    } else {


We shouldn't really be deallocating anything. The CC assignment should have directly returned return in vN at index M. Should be able to do this by using custom allocation type

The CC assignment should have directly returned return in vN at index M.

How to express it though? Like how to refer to VGPR N at lane M? It doesn't seem like we can do it at the moment but maybe to add it in the td file, like we have the all possible register sequences?

arsenm · 2025-05-01T14:41:34Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+  SDValue finalize(SDValue Chain, const SDLoc &SL, SDValue InGlue) {
+    if (!LastWrite)
+      return LastWrite;
+    return DAG.getCopyToReg(Chain, SL, DstReg, LastWrite, InGlue);


Unfortunately I think we need to use a WWM copy here. Unfortunately that would also break any of the coalescing we would want in the common case unless we do something about that

arsenm · 2025-05-01T14:42:12Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+      Val = Spiller.readLane(Chain, DL, Reg, VT);
+    } else {
+      Reg = MF.addLiveIn(Reg, RC);
+      Val = DAG.getCopyFromReg(Chain, DL, Reg, VT);


will probably also need to be a WWM copy

shiltian · 2025-05-29T18:22:21Z

At this point, if we want to move forward, we need an approach to express those registers in code, like VGPR1_12 stands for v1 of lane 12. It is pretty similar to those permutation of continuous registers. After that, we can simply implement the calling conventions in the td file. No need for custom calling convention implementation.

The changes required are gonna be quite large, but I don't see any better approach moving forward.

shiltian added the backend:AMDGPU label Mar 30, 2025

shiltian requested review from jayfoad, arsenm and rampitec March 30, 2025 04:03

shiltian commented Mar 30, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIISelLowering.cpp Outdated Show resolved Hide resolved

shiltian commented Mar 30, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIISelLowering.cpp Outdated Show resolved Hide resolved

arsenm reviewed Mar 30, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIISelLowering.cpp Outdated Show resolved Hide resolved

arsenm reviewed Mar 30, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/inreg-vgpr-spill.ll Show resolved Hide resolved

shiltian commented Mar 30, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIISelLowering.cpp Outdated Show resolved Hide resolved

shiltian force-pushed the users/shiltian/inreg-vgpr-improvement branch 2 times, most recently from d823879 to ff4fc41 Compare April 3, 2025 05:14

shiltian commented Apr 3, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIISelLowering.cpp Outdated Show resolved Hide resolved

shiltian force-pushed the users/shiltian/inreg-vgpr-improvement branch 2 times, most recently from 1a27c5f to bf0af25 Compare April 3, 2025 21:57

shiltian commented Apr 3, 2025

View reviewed changes

shiltian force-pushed the users/shiltian/inreg-vgpr-improvement branch from bf0af25 to bd81ac3 Compare April 4, 2025 04:04

shiltian commented Apr 4, 2025

View reviewed changes

arsenm reviewed May 1, 2025

View reviewed changes

[WIP][AMDGPU] Improve the handling of inreg arguments #133614

Are you sure you want to change the base?

[WIP][AMDGPU] Improve the handling of inreg arguments #133614

Conversation

shiltian commented Mar 30, 2025

Uh oh!

shiltian commented Mar 30, 2025

Uh oh!

llvmbot commented Mar 30, 2025

Uh oh!

shiltian commented Mar 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shiltian Apr 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shiltian Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arsenm commented Apr 8, 2025

Uh oh!

shiltian commented Apr 8, 2025

Uh oh!

shiltian commented Apr 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shiltian commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

[WIP][AMDGPU] Improve the handling of `inreg` arguments #133614

[WIP][AMDGPU] Improve the handling of `inreg` arguments #133614

github-actions bot commented Apr 3, 2025 •

edited

Loading

shiltian Apr 13, 2025 •

edited

Loading

shiltian Apr 4, 2025 •

edited

Loading

shiltian commented May 29, 2025 •

edited

Loading