-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis #133242
base: main
Are you sure you want to change the base?
Conversation
When using the amdgcn.init.whole.wave intrinsic, we add dummy VGPR arguments with the purpose of preserving their inactive lanes. The pattern may look something like this: ``` entry: call amdgcn.init.whole.wave brand to shader or tail shader: $vInactive = IMPLICIT_DEF ; Tells regalloc it's safe to use the active lanes actual code... tail: call amdgcn.cs.chain [...], implicit $vInactive ``` We should not report these VGPRs in the .vgpr_count metadata. This patch achieves that goal by ignoring IMPLICIT_DEFs and SI_TCRETURNs in functions that use the amdgcn.init.whole.wave intrinsic. All other VGPRs are counted as usual.
@llvm/pr-subscribers-backend-amdgpu Author: Diana Picus (rovka) ChangesWhen using the amdgcn.init.whole.wave intrinsic, we add dummy VGPR arguments with the purpose of preserving their inactive lanes. The pattern may look something like this:
We should not report these VGPRs in the .vgpr_count metadata. This patch achieves that goal by ignoring IMPLICIT_DEFs and SI_TCRETURNs in functions that use the amdgcn.init.whole.wave intrinsic. All other VGPRs are counted as usual. Patch is 24.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/133242.diff 6 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
index 800e2b9c0e657..7769bc5d74ebd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
@@ -990,7 +990,10 @@ void AMDGPUAsmPrinter::getSIProgramInfo(SIProgramInfo &ProgInfo,
// dispatch registers are function args.
unsigned WaveDispatchNumSGPR = 0, WaveDispatchNumVGPR = 0;
- if (isShader(F.getCallingConv())) {
+ // Shaders that use the init.whole.wave intrinsic sometimes have VGPR
+ // arguments that are only added for the purpose of preserving their inactive
+ // lanes. Skip including them in the VGPR count.
+ if (isShader(F.getCallingConv()) && !MFI->hasInitWholeWave()) {
bool IsPixelShader =
F.getCallingConv() == CallingConv::AMDGPU_PS && !STM.isAmdHsaOS();
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp b/llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp
index 9a609a1752de0..05d1aa38d4a25 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUResourceUsageAnalysis.cpp
@@ -156,8 +156,14 @@ AMDGPUResourceUsageAnalysis::analyzeResourceUsage(
int32_t MaxSGPR = -1;
Info.CalleeSegmentSize = 0;
+ bool IsIWWFunction = MFI->hasInitWholeWave();
+
for (const MachineBasicBlock &MBB : MF) {
for (const MachineInstr &MI : MBB) {
+ // At this point, the chain call pseudos are already expanded.
+ bool IsChainCall = MI.getOpcode() == AMDGPU::SI_TCRETURN;
+ bool IsImplicitDef = MI.isImplicitDef();
+
// TODO: Check regmasks? Do they occur anywhere except calls?
for (const MachineOperand &MO : MI.operands()) {
unsigned Width = 0;
@@ -239,6 +245,16 @@ AMDGPUResourceUsageAnalysis::analyzeResourceUsage(
break;
}
+ // For functions that use the llvm.amdgcn.init.whole.wave intrinsic, we
+ // often add artificial VGPR arguments for the purpose of preserving
+ // their inactive lanes. These should not be reported as part of our
+ // VGPR usage. We can identify them easily because they're only used in
+ // the chain call, and possibly in an IMPLICIT_DEF coming from an
+ // llvm.amdgcn.dead intrinsic.
+ if (IsIWWFunction && (IsChainCall || IsImplicitDef) &&
+ TRI.isVectorRegister(MRI, Reg))
+ continue;
+
if (AMDGPU::SGPR_32RegClass.contains(Reg) ||
AMDGPU::SGPR_LO16RegClass.contains(Reg) ||
AMDGPU::SGPR_HI16RegClass.contains(Reg)) {
diff --git a/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-large.ll b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-large.ll
new file mode 100644
index 0000000000000..e47f5e25ead3a
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-large.ll
@@ -0,0 +1,76 @@
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 < %s | FileCheck %s
+
+; CHECK-LABEL: .shader_functions:
+
+; Use VGPRs above the input arguments.
+; CHECK-LABEL: _miss_1:
+; CHECK: .vgpr_count:{{.*}}0x1d{{$}}
+
+define amdgpu_cs_chain void @_miss_1(ptr inreg %next.callee, i32 inreg %global.table, i32 inreg %max.outgoing.vgpr.count,
+ i32 %vcr, { i32 } %system.data,
+ i32 %inactive.vgpr, i32 %inactive.vgpr1, i32 %inactive.vgpr2, i32 %inactive.vgpr3,
+ i32 %inactive.vgpr4, i32 %inactive.vgpr5, i32 %inactive.vgpr6, i32 %inactive.vgpr7,
+ i32 %inactive.vgpr8, i32 %inactive.vgpr9)
+ local_unnamed_addr {
+entry:
+ %system.data.value = extractvalue { i32 } %system.data, 0
+ %dead.val = call i32 @llvm.amdgcn.dead.i32()
+ %is.whole.wave = call i1 @llvm.amdgcn.init.whole.wave()
+ br i1 %is.whole.wave, label %shader, label %tail
+
+shader:
+ %system.data.extract = extractvalue { i32 } %system.data, 0
+ %data.mul = mul i32 %system.data.extract, 2
+ %data.add = add i32 %data.mul, 1
+ call void asm sideeffect "; clobber v28", "~{v28}"()
+ br label %tail
+
+tail:
+ %final.vcr = phi i32 [ %vcr, %entry ], [ %data.mul, %shader ]
+ %final.sys.data = phi i32 [ %system.data.value, %entry ], [ %data.add, %shader ]
+ %final.inactive0 = phi i32 [ %inactive.vgpr, %entry ], [ %dead.val, %shader ]
+ %final.inactive1 = phi i32 [ %inactive.vgpr1, %entry ], [ %dead.val, %shader ]
+ %final.inactive2 = phi i32 [ %inactive.vgpr2, %entry ], [ %dead.val, %shader ]
+ %final.inactive3 = phi i32 [ %inactive.vgpr3, %entry ], [ %dead.val, %shader ]
+ %final.inactive4 = phi i32 [ %inactive.vgpr4, %entry ], [ %dead.val, %shader ]
+ %final.inactive5 = phi i32 [ %inactive.vgpr5, %entry ], [ %dead.val, %shader ]
+ %final.inactive6 = phi i32 [ %inactive.vgpr6, %entry ], [ %dead.val, %shader ]
+ %final.inactive7 = phi i32 [ %inactive.vgpr7, %entry ], [ %dead.val, %shader ]
+ %final.inactive8 = phi i32 [ %inactive.vgpr8, %entry ], [ %dead.val, %shader ]
+ %final.inactive9 = phi i32 [ %inactive.vgpr9, %entry ], [ %dead.val, %shader ]
+
+ %struct.init = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } poison, i32 %final.vcr, 0
+ %struct.with.data = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.init, i32 %final.sys.data, 1
+ %struct.with.inactive0 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.data, i32 %final.inactive0, 2
+ %struct.with.inactive1 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive0, i32 %final.inactive1, 3
+ %struct.with.inactive2 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive1, i32 %final.inactive2, 4
+ %struct.with.inactive3 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive2, i32 %final.inactive3, 5
+ %struct.with.inactive4 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive3, i32 %final.inactive4, 6
+ %struct.with.inactive5 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive4, i32 %final.inactive5, 7
+ %struct.with.inactive6 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive5, i32 %final.inactive6, 8
+ %struct.with.inactive7 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive6, i32 %final.inactive7, 9
+ %struct.with.inactive8 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive7, i32 %final.inactive8, 10
+ %final.struct = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive8, i32 %final.inactive9, 11
+
+ %vec.global = insertelement <4 x i32> poison, i32 %global.table, i64 0
+ %vec.max.vgpr = insertelement <4 x i32> %vec.global, i32 %max.outgoing.vgpr.count, i64 1
+ %vec.sys.data = insertelement <4 x i32> %vec.max.vgpr, i32 %final.sys.data, i64 2
+ %final.vec = insertelement <4 x i32> %vec.sys.data, i32 0, i64 3
+
+ call void (ptr, i32, <4 x i32>, { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }, i32, ...)
+ @llvm.amdgcn.cs.chain.p0.i32.v4i32.sl_i32i32i32i32i32i32i32i32i32i32i32i32s(
+ ptr %next.callee, i32 0, <4 x i32> inreg %final.vec,
+ { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %final.struct,
+ i32 1, i32 %max.outgoing.vgpr.count, i32 -1, ptr @retry_vgpr_alloc.v4i32)
+ unreachable
+}
+
+declare i32 @llvm.amdgcn.dead.i32()
+declare i1 @llvm.amdgcn.init.whole.wave()
+declare void @llvm.amdgcn.cs.chain.p0.i32.v4i32.sl_i32i32i32i32i32i32i32i32i32i32i32i32s(ptr, i32, <4 x i32>, { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }, i32 immarg, ...)
+
+declare amdgpu_cs_chain void @retry_vgpr_alloc.v4i32(<4 x i32> inreg)
+
+!amdgpu.pal.metadata.msgpack = !{!0}
+
+!0 = !{!"\82\B0amdpal.pipelines\91\8B\A4.api\A6Vulkan\B2.compute_registers\85\AB.tg_size_en\C3\AA.tgid_x_en\C3\AA.tgid_y_en\C3\AA.tgid_z_en\C3\AF.tidig_comp_cnt\00\B0.hardware_stages\81\A3.cs\8D\AF.checksum_value\00\AB.debug_mode\00\AB.float_mode\CC\C0\A9.image_op\C2\AC.mem_ordered\C3\AB.sgpr_limitj\B7.threadgroup_dimensions\93 \01\01\AD.trap_present\00\B2.user_data_reg_map\90\AB.user_sgprs\10\AB.vgpr_limit\CD\01\00\AF.wavefront_size \AF.wg_round_robin\C2\B7.internal_pipeline_hash\92\CF|{2&\DCC\85M\CFep\8A\EDR\DE\D6\E1\B1.shader_functions\81\A7_miss_1\82\B4.frontend_stack_size\00\B4.outgoing_vgpr_countP\A8.shaders\81\A8.compute\82\B0.api_shader_hash\92\00\00\B1.hardware_mapping\91\A3.cs\B0.spill_threshold\CD\FF\FF\A5.type\A2Cs\B0.user_data_limit\01\A9.uses_cps\C3\AF.xgl_cache_info\82\B3.128_bit_cache_hash\92\CF\B4\AF\9D\0B\07\88\03\02\CF\01o\C9\CAf?)\DA\AD.llpc_version\A476.0\AEamdpal.version\92\03\00"}
\ No newline at end of file
diff --git a/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-leaf.ll b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-leaf.ll
new file mode 100644
index 0000000000000..5d7472fd3c56e
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-leaf.ll
@@ -0,0 +1,50 @@
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 < %s | FileCheck %s
+
+; CHECK-LABEL: .shader_functions:
+
+; Make sure that .vgpr_count doesn't include the %inactive.vgpr registers.
+; CHECK-LABEL: leaf_shader:
+; CHECK: .vgpr_count:{{.*}}0xc{{$}}
+
+; Function without calls.
+define amdgpu_cs_chain void @_leaf_shader(ptr %output.ptr, i32 inreg %input.value,
+ i32 %active.vgpr1, i32 %active.vgpr2,
+ i32 %inactive.vgpr1, i32 %inactive.vgpr2, i32 %inactive.vgpr3,
+ i32 %inactive.vgpr4, i32 %inactive.vgpr5, i32 %inactive.vgpr6)
+ local_unnamed_addr {
+entry:
+ %dead.val = call i32 @llvm.amdgcn.dead.i32()
+ %is.whole.wave = call i1 @llvm.amdgcn.init.whole.wave()
+ br i1 %is.whole.wave, label %compute, label %merge
+
+compute:
+ ; Perform a more complex computation using active VGPRs
+ %square = mul i32 %active.vgpr1, %active.vgpr1
+ %product = mul i32 %square, %active.vgpr2
+ %sum = add i32 %product, %input.value
+ %result = add i32 %sum, 42
+ br label %merge
+
+merge:
+ %final.result = phi i32 [ 0, %entry ], [ %result, %compute ]
+ %final.inactive1 = phi i32 [ %inactive.vgpr1, %entry ], [ %dead.val, %compute ]
+ %final.inactive2 = phi i32 [ %inactive.vgpr2, %entry ], [ %dead.val, %compute ]
+ %final.inactive3 = phi i32 [ %inactive.vgpr3, %entry ], [ %dead.val, %compute ]
+ %final.inactive4 = phi i32 [ %inactive.vgpr4, %entry ], [ %dead.val, %compute ]
+ %final.inactive5 = phi i32 [ %inactive.vgpr5, %entry ], [ %dead.val, %compute ]
+ %final.inactive6 = phi i32 [ %inactive.vgpr6, %entry ], [ %dead.val, %compute ]
+
+ store i32 %final.result, ptr %output.ptr, align 4
+
+ ret void
+}
+
+declare i32 @llvm.amdgcn.dead.i32()
+declare i1 @llvm.amdgcn.init.whole.wave()
+declare void @llvm.amdgcn.cs.chain.p0.i32.v4i32.sl_i32i32i32i32i32i32i32i32i32i32i32i32s(ptr, i32, <4 x i32>, { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }, i32 immarg, ...)
+
+declare amdgpu_cs_chain void @retry_vgpr_alloc.v4i32(<4 x i32> inreg)
+
+!amdgpu.pal.metadata.msgpack = !{!0}
+
+!0 = !{!"\82\B0amdpal.pipelines\91\8B\A4.api\A6Vulkan\B2.compute_registers\85\AB.tg_size_en\C3\AA.tgid_x_en\C3\AA.tgid_y_en\C3\AA.tgid_z_en\C3\AF.tidig_comp_cnt\00\B0.hardware_stages\81\A3.cs\8D\AF.checksum_value\00\AB.debug_mode\00\AB.float_mode\CC\C0\A9.image_op\C2\AC.mem_ordered\C3\AB.sgpr_limitj\B7.threadgroup_dimensions\93 \01\01\AD.trap_present\00\B2.user_data_reg_map\90\AB.user_sgprs\10\AB.vgpr_limit\CD\01\00\AF.wavefront_size \AF.wg_round_robin\C2\B7.internal_pipeline_hash\92\CF|{2&\DCC\85M\CFep\8A\EDR\DE\D6\E1\B1.shader_functions\81\A7_miss_1\82\B4.frontend_stack_size\00\B4.outgoing_vgpr_countP\A8.shaders\81\A8.compute\82\B0.api_shader_hash\92\00\00\B1.hardware_mapping\91\A3.cs\B0.spill_threshold\CD\FF\FF\A5.type\A2Cs\B0.user_data_limit\01\A9.uses_cps\C3\AF.xgl_cache_info\82\B3.128_bit_cache_hash\92\CF\B4\AF\9D\0B\07\88\03\02\CF\01o\C9\CAf?)\DA\AD.llpc_version\A476.0\AEamdpal.version\92\03\00"}
diff --git a/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-use-inactive.ll b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-use-inactive.ll
new file mode 100644
index 0000000000000..0c699a07cb3fd
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-use-inactive.ll
@@ -0,0 +1,78 @@
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 < %s | FileCheck %s
+
+; CHECK-LABEL: .shader_functions:
+
+; Make sure that .vgpr_count doesn't include the %inactive.vgpr registers.
+; The shader is free to use any of the VGPRs mapped to a %inactive.vpgr as long as it only touches its active lanes.
+; In that case, the VGPR should be included in the .vgpr_count
+; CHECK-LABEL: _miss_1:
+; CHECK: .vgpr_count:{{.*}}0xd{{$}}
+
+define amdgpu_cs_chain void @_miss_1(ptr inreg %next.callee, i32 inreg %global.table, i32 inreg %max.outgoing.vgpr.count,
+ i32 %vcr, { i32 } %system.data,
+ i32 %inactive.vgpr, i32 %inactive.vgpr1, i32 %inactive.vgpr2, i32 %inactive.vgpr3,
+ i32 %inactive.vgpr4, i32 %inactive.vgpr5, i32 %inactive.vgpr6, i32 %inactive.vgpr7,
+ i32 %inactive.vgpr8, i32 %inactive.vgpr9)
+ local_unnamed_addr {
+entry:
+ %system.data.value = extractvalue { i32 } %system.data, 0
+ %dead.val = call i32 @llvm.amdgcn.dead.i32()
+ %is.whole.wave = call i1 @llvm.amdgcn.init.whole.wave()
+ br i1 %is.whole.wave, label %shader, label %tail
+
+shader:
+ %system.data.extract = extractvalue { i32 } %system.data, 0
+ %data.mul = mul i32 %system.data.extract, 2
+ %data.add = add i32 %data.mul, 1
+ call void asm sideeffect "; use VGPR for %inactive.vgpr2", "~{v12}"()
+ br label %tail
+
+tail:
+ %final.vcr = phi i32 [ %vcr, %entry ], [ %data.mul, %shader ]
+ %final.sys.data = phi i32 [ %system.data.value, %entry ], [ %data.add, %shader ]
+ %final.inactive0 = phi i32 [ %inactive.vgpr, %entry ], [ %dead.val, %shader ]
+ %final.inactive1 = phi i32 [ %inactive.vgpr1, %entry ], [ %dead.val, %shader ]
+ %final.inactive2 = phi i32 [ %inactive.vgpr2, %entry ], [ %dead.val, %shader ]
+ %final.inactive3 = phi i32 [ %inactive.vgpr3, %entry ], [ %dead.val, %shader ]
+ %final.inactive4 = phi i32 [ %inactive.vgpr4, %entry ], [ %dead.val, %shader ]
+ %final.inactive5 = phi i32 [ %inactive.vgpr5, %entry ], [ %dead.val, %shader ]
+ %final.inactive6 = phi i32 [ %inactive.vgpr6, %entry ], [ %dead.val, %shader ]
+ %final.inactive7 = phi i32 [ %inactive.vgpr7, %entry ], [ %dead.val, %shader ]
+ %final.inactive8 = phi i32 [ %inactive.vgpr8, %entry ], [ %dead.val, %shader ]
+ %final.inactive9 = phi i32 [ %inactive.vgpr9, %entry ], [ %dead.val, %shader ]
+
+ %struct.init = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } poison, i32 %final.vcr, 0
+ %struct.with.data = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.init, i32 %final.sys.data, 1
+ %struct.with.inactive0 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.data, i32 %final.inactive0, 2
+ %struct.with.inactive1 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive0, i32 %final.inactive1, 3
+ %struct.with.inactive2 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive1, i32 %final.inactive2, 4
+ %struct.with.inactive3 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive2, i32 %final.inactive3, 5
+ %struct.with.inactive4 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive3, i32 %final.inactive4, 6
+ %struct.with.inactive5 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive4, i32 %final.inactive5, 7
+ %struct.with.inactive6 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive5, i32 %final.inactive6, 8
+ %struct.with.inactive7 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive6, i32 %final.inactive7, 9
+ %struct.with.inactive8 = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive7, i32 %final.inactive8, 10
+ %final.struct = insertvalue { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %struct.with.inactive8, i32 %final.inactive9, 11
+
+ %vec.global = insertelement <4 x i32> poison, i32 %global.table, i64 0
+ %vec.max.vgpr = insertelement <4 x i32> %vec.global, i32 %max.outgoing.vgpr.count, i64 1
+ %vec.sys.data = insertelement <4 x i32> %vec.max.vgpr, i32 %final.sys.data, i64 2
+ %final.vec = insertelement <4 x i32> %vec.sys.data, i32 0, i64 3
+
+ call void (ptr, i32, <4 x i32>, { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }, i32, ...)
+ @llvm.amdgcn.cs.chain.p0.i32.v4i32.sl_i32i32i32i32i32i32i32i32i32i32i32i32s(
+ ptr %next.callee, i32 0, <4 x i32> inreg %final.vec,
+ { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 } %final.struct,
+ i32 1, i32 %max.outgoing.vgpr.count, i32 -1, ptr @retry_vgpr_alloc.v4i32)
+ unreachable
+}
+
+declare i32 @llvm.amdgcn.dead.i32()
+declare i1 @llvm.amdgcn.init.whole.wave()
+declare void @llvm.amdgcn.cs.chain.p0.i32.v4i32.sl_i32i32i32i32i32i32i32i32i32i32i32i32s(ptr, i32, <4 x i32>, { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }, i32 immarg, ...)
+
+declare amdgpu_cs_chain void @retry_vgpr_alloc.v4i32(<4 x i32> inreg)
+
+!amdgpu.pal.metadata.msgpack = !{!0}
+
+!0 = !{!"\82\B0amdpal.pipelines\91\8B\A4.api\A6Vulkan\B2.compute_registers\85\AB.tg_size_en\C3\AA.tgid_x_en\C3\AA.tgid_y_en\C3\AA.tgid_z_en\C3\AF.tidig_comp_cnt\00\B0.hardware_stages\81\A3.cs\8D\AF.checksum_value\00\AB.debug_mode\00\AB.float_mode\CC\C0\A9.image_op\C2\AC.mem_ordered\C3\AB.sgpr_limitj\B7.threadgroup_dimensions\93 \01\01\AD.trap_present\00\B2.user_data_reg_map\90\AB.user_sgprs\10\AB.vgpr_limit\CD\01\00\AF.wavefront_size \AF.wg_round_robin\C2\B7.internal_pipeline_hash\92\CF|{2&\DCC\85M\CFep\8A\EDR\DE\D6\E1\B1.shader_functions\81\A7_miss_1\82\B4.frontend_stack_size\00\B4.outgoing_vgpr_countP\A8.shaders\81\A8.compute\82\B0.api_shader_hash\92\00\00\B1.hardware_mapping\91\A3.cs\B0.spill_threshold\CD\FF\FF\A5.type\A2Cs\B0.user_data_limit\01\A9.uses_cps\C3\AF.xgl_cache_info\82\B3.128_bit_cache_hash\92\CF\B4\AF\9D\0B\07\88\03\02\CF\01o\C9\CAf?)\DA\AD.llpc_version\A476.0\AEamdpal.version\92\03\00"}
diff --git a/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count.ll b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count.ll
new file mode 100644
index 0000000000000..b9130dd1b7ed4
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count.ll
@@ -0,0 +1,75 @@
+; RUN: llc -mtriple=amdgcn--amdpal -mcpu=gfx1200 < %s | FileCheck %s
+
+; CHECK-LABEL: .shader_functions:
+
+; Make sure that .vgpr_count doesn't include the %inactive.vgpr registers.
+; CHECK-LABEL: _miss_1:
+; CHECK: .vgpr_count:{{.*}}0xa{{$}}
+
+define amdgpu_cs_chain void @_miss_1(ptr inreg %next.callee, i32 inreg %global.table, i32 inreg %max.outgoing.vgpr.count,
+ i32 %vcr, { i32 } %system.data,
+ i32 %inactive.vgpr, i32 %inactive.vgpr1, i32 %inactive.vgpr2, i32 %inactive.vgpr3,
+ i32 %inactive.vgpr4, i32 %inactive.vgpr5, i32 %inactive.vgpr6, i32 %inactive.vgpr7,
+ i32 %inactive.vgpr8, i32 %inactive.vgpr9)
+ local_unnamed_addr {
+entry:
+ %system.data.value = extractvalue { i32 } %system.data, 0
+ %dead.val = call i32 @llvm.amdgcn.dead.i32()
+ %is.whole.wave = call i1 @llvm.amdgcn.init.whole.wave()
+ br i1 %is.whole.wave, label %shader, label %tail
+
+shader:
+ %system.data.extract = extractvalue { i32 } %system.data, 0
+ %data.mul = mul i32 %system.data.extract, 2
+ %data.add = add i32 %data.mul, 1
+ br label %tail
+
+tail:
+ %final.vcr = phi i32 [ %vcr, %entry ], [ %data.mul, %shader ]
+ %final.sys.data = phi i...
[truncated]
|
This could be much more generic. AMDGPUResourceUsageAnalysis could unconditionally ignore IMPLICIT_DEF and all use operands. |
if (IsIWWFunction && (IsChainCall || IsImplicitDef) && | ||
TRI.isVectorRegister(MRI, Reg)) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter if it's a vector register. I thought we already had a helper somewhere to skip synthetic uses
I don't think it's safe to ignore all implicit def operands (but you can ignore the def operand of IMPLICIT_DEF). Do you need this to get the optimization you're hoping for? Is ignoring all use operands not enough? |
// For functions that use the llvm.amdgcn.init.whole.wave intrinsic, we | ||
// often add artificial VGPR arguments for the purpose of preserving | ||
// their inactive lanes. These should not be reported as part of our | ||
// VGPR usage. We can identify them easily because they're only used in | ||
// the chain call, and possibly in an IMPLICIT_DEF coming from an | ||
// llvm.amdgcn.dead intrinsic. | ||
if (IsIWWFunction && IsChainCall && TRI.isVectorRegister(MRI, Reg)) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extend to any implicit argument that isn't in the MCInstrDesc, ignoring variadic instructions in the variadic section
Don't count register uses when determining the maximum number of registers used by a function. Count only the defs. This is really an underestimate of the true register usage, but in practice that's not a problem because if a function uses a register, then it has either defined it earlier, or some other function that executed before has defined it. In particular, the register counts are used: 1. When launching an entry function - in which case we're safe because the register counts of the entry function will include the register counts of all callees. 2. At function boundaries in dynamic VGPR mode. In this case it's safe because whenever we set the new VGPR allocation we take into account the outgoing_vgpr_count set by the middle-end. The main advantage of doing this is that the artificial VGPR arguments used only for preserving the inactive lanes when using the llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This enables us to allocate only the registers we need in dynamic VGPR mode.
The code was only ignoring IMPLICIT_DEF instructions, not implicit def operands :) Sorry about the confusing commit message. |
Ok, I switched to that. @arsenm I think this makes your other comments obsolete. But anyway, let me know what you think of this approach :) |
llvm/test/CodeGen/AMDGPU/init-whole-wave-vgpr-count-use-inactive.ll
Outdated
Show resolved
Hide resolved
Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>
I created some small tests, to make sure this works as intended in all cases. Probably makes sense to add them here. PAL tests: ; RUN: llc -mcpu=gfx1200 -o - < %s | FileCheck %s
; Check that reads of a VGPR in kernels counts towards VGPR count, but in functions, only writes of VGPRs count towards VGPR count.
target triple = "amdgcn--amdpal"
@global = addrspace(1) global i32 poison, align 4
; CHECK-LABEL: amdpal.pipelines:
; Neither uses not writes a VGPR, but the hardware initializes the VGPRs that the kernel receives, so they count as used.
; CHECK-LABEL: .entry_point_symbol: kernel_use
; CHECK: .vgpr_count: 0x20
define amdgpu_cs void @kernel_use([32 x i32] %args) {
entry:
%a = extractvalue [32 x i32] %args, 14
store i32 %a, ptr addrspace(1) @global
ret void
}
; Neither uses not writes a VGPR
; CHECK-LABEL: gfx_func:
; CHECK: .vgpr_count: 0x20
define amdgpu_gfx [32 x i32] @gfx_func([32 x i32] %args) {
entry:
ret [32 x i32] %args
}
; Neither uses not writes a VGPR
; CHECK-LABEL: chain_func:
; CHECK: .vgpr_count: 0x1
define amdgpu_cs_chain void @chain_func([32 x i32] %args) {
entry:
call void (ptr, i32, {}, [32 x i32], i32, ...) @llvm.amdgcn.cs.chain.p0.i32.s.a(
ptr @chain_func, i32 0, {} inreg {}, [32 x i32] %args, i32 0)
unreachable
} The (to me) surprising one is This one works as expected: ; RUN: llc -mcpu=gfx1200 -o - < %s | FileCheck %s
target triple = "amdgcn--amdpal"
declare amdgpu_gfx void @gfx_dummy([32 x i32] %args)
; CHECK-LABEL: .entry_point_symbol: kernel_call
; CHECK: .vgpr_count: 0x20
define amdgpu_cs void @kernel_call([32 x i32] %args) {
entry:
call amdgpu_gfx void @gfx_dummy([32 x i32] %args)
ret void
} Carefully crafted compute test (the hw initializes at most one VGPR, so the test needs to ensure, no VGPR is ever written from any instruction). Also works as expected (correctly marks one VGPR as used). ; RUN: llc -mcpu=gfx1200 -o - < %s | FileCheck %s
target triple = "amdgcn-amd-amdhsa"
@global = addrspace(1) global i32 poison, align 4
; Carefully crafted kernel that uses v0 but never writes a VGPR or reads another VGPR.
; Only hardware-initialized VGPRs (v0) are read in this kernel.
; CHECK-LABEL: amdhsa.kernels:
; CHECK: .vgpr_count: 1
define amdgpu_kernel void @kernel(ptr addrspace(8) %rsrc) #0 {
entry:
%id = call i32 @llvm.amdgcn.workitem.id.x()
call void @llvm.amdgcn.raw.ptr.buffer.store.i32(i32 %id, ptr addrspace(8) %rsrc, i32 0, i32 0, i32 0)
ret void
}
attributes #0 = { "amdgpu-no-workitem-id-y" "amdgpu-no-workitem-id-z" } |
Hi Sebastian, thanks for looking into this! I don't think the existing test coverage is really that bad. The testcase that you pointed out (amgpu_gfx leaf function) is not conceptually that different from the many amdgpu_gfx testcases we have here, or even from the leaf function test I added. All of these have in common the fact that they follow the code path for leaf functions, which gets the VGPR/SGPR/AGPR usage by just checking TRI. I tried updating this code path to look only at register defs, so we're consistent with non-leaf functions, but that produces different results in a lot of tests, including for things like |
Don't count register uses when determining the maximum number of
registers used by a function. Count only the defs. This is really an
underestimate of the true register usage, but in practice that's not
a problem because if a function uses a register, then it has either
defined it earlier, or some other function that executed before has
defined it.
In particular, the register counts are used:
the register counts of the entry function will include the register
counts of all callees.
because whenever we set the new VGPR allocation we take into account
the outgoing_vgpr_count set by the middle-end.
The main advantage of doing this is that the artificial VGPR arguments
used only for preserving the inactive lanes when using the
llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This
enables us to allocate only the registers we need in dynamic VGPR mode.