-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[AMDGPU][SIInsertWaitcnts] drop OldWaitcntInstr only when it is processed #145720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Sameer Sahasrabuddhe (ssahasra) ChangesWhen iterating over instructions, the pass keeps track of pending soft waitcnt To fix that, the recorded pointer should be dropped iff pending soft waitcnt Patch is 39.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/145720.diff 9 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index f43831016952a..d620789806a5f 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -777,12 +777,12 @@ class SIInsertWaitcnts {
bool isVmemAccess(const MachineInstr &MI) const;
bool generateWaitcntInstBefore(MachineInstr &MI,
WaitcntBrackets &ScoreBrackets,
- MachineInstr *OldWaitcntInstr,
+ MachineInstr *&OldWaitcntInstr,
bool FlushVmCnt);
bool generateWaitcnt(AMDGPU::Waitcnt Wait,
MachineBasicBlock::instr_iterator It,
MachineBasicBlock &Block, WaitcntBrackets &ScoreBrackets,
- MachineInstr *OldWaitcntInstr);
+ MachineInstr *&OldWaitcntInstr);
void updateEventWaitcntAfter(MachineInstr &Inst,
WaitcntBrackets *ScoreBrackets);
bool isNextENDPGM(MachineBasicBlock::instr_iterator It,
@@ -1781,7 +1781,7 @@ static bool callWaitsOnFunctionReturn(const MachineInstr &MI) { return true; }
/// flush the vmcnt counter here.
bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
WaitcntBrackets &ScoreBrackets,
- MachineInstr *OldWaitcntInstr,
+ MachineInstr *&OldWaitcntInstr,
bool FlushVmCnt) {
setForceEmitWaitcnt();
@@ -2047,14 +2047,16 @@ bool SIInsertWaitcnts::generateWaitcnt(AMDGPU::Waitcnt Wait,
MachineBasicBlock::instr_iterator It,
MachineBasicBlock &Block,
WaitcntBrackets &ScoreBrackets,
- MachineInstr *OldWaitcntInstr) {
+ MachineInstr *&OldWaitcntInstr) {
bool Modified = false;
- if (OldWaitcntInstr)
+ if (OldWaitcntInstr) {
// Try to merge the required wait with preexisting waitcnt instructions.
// Also erase redundant waitcnt.
Modified =
WCG->applyPreexistingWaitcnt(ScoreBrackets, *OldWaitcntInstr, Wait, It);
+ OldWaitcntInstr = nullptr;
+ }
// Any counts that could have been applied to any existing waitcnt
// instructions will have been done so, now deal with any remaining.
@@ -2224,6 +2226,7 @@ bool SIInsertWaitcnts::insertForcedWaitAfter(MachineInstr &Inst,
WaitcntBrackets &ScoreBrackets) {
AMDGPU::Waitcnt Wait;
bool NeedsEndPGMCheck = false;
+ MachineInstr *IgnoredOldWaitcnt = nullptr;
if (ST->isPreciseMemoryEnabled() && Inst.mayLoadOrStore())
Wait = WCG->getAllZeroWaitcnt(Inst.mayStore() &&
@@ -2238,7 +2241,7 @@ bool SIInsertWaitcnts::insertForcedWaitAfter(MachineInstr &Inst,
auto SuccessorIt = std::next(Inst.getIterator());
bool Result = generateWaitcnt(Wait, SuccessorIt, Block, ScoreBrackets,
- /*OldWaitcntInstr=*/nullptr);
+ IgnoredOldWaitcnt);
if (Result && NeedsEndPGMCheck && isNextENDPGM(SuccessorIt, &Block)) {
BuildMI(Block, SuccessorIt, Inst.getDebugLoc(), TII->get(AMDGPU::S_NOP))
@@ -2489,7 +2492,6 @@ bool SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF,
// Generate an s_waitcnt instruction to be placed before Inst, if needed.
Modified |= generateWaitcntInstBefore(Inst, ScoreBrackets, OldWaitcntInstr,
FlushVmCnt);
- OldWaitcntInstr = nullptr;
// Restore vccz if it's not known to be correct already.
bool RestoreVCCZ = !VCCZCorrect && readsVCCZ(Inst);
diff --git a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
index 9ae7c4aaa1e95..b0439b1f7968f 100644
--- a/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
+++ b/llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll
@@ -2250,7 +2250,6 @@ define void @test_dynamic_stackalloc_device_control_flow(i32 %n, i32 %m) {
; GFX9-SDAG-NEXT: s_mov_b32 s32, s34
; GFX9-SDAG-NEXT: s_mov_b32 s34, s12
; GFX9-SDAG-NEXT: s_mov_b32 s33, s11
-; GFX9-SDAG-NEXT: s_waitcnt vmcnt(0)
; GFX9-SDAG-NEXT: s_setpc_b64 s[30:31]
;
; GFX9-GISEL-LABEL: test_dynamic_stackalloc_device_control_flow:
@@ -2317,7 +2316,6 @@ define void @test_dynamic_stackalloc_device_control_flow(i32 %n, i32 %m) {
; GFX9-GISEL-NEXT: s_mov_b32 s32, s34
; GFX9-GISEL-NEXT: s_mov_b32 s34, s12
; GFX9-GISEL-NEXT: s_mov_b32 s33, s11
-; GFX9-GISEL-NEXT: s_waitcnt vmcnt(0)
; GFX9-GISEL-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-SDAG-LABEL: test_dynamic_stackalloc_device_control_flow:
diff --git a/llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll b/llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll
index bb66bb319d481..a07f1d8a02941 100644
--- a/llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll
+++ b/llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll
@@ -731,7 +731,6 @@ define <4 x i16> @vec_16xi16_extract_4xi16(ptr addrspace(1) %p0, ptr addrspace(1
; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: ; kill: killed $vgpr0 killed $vgpr1
; GFX9-NEXT: .LBB3_3: ; %exit
-; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_pk_ashrrev_i16 v0, 15, v5 op_sel_hi:[0,0]
; GFX9-NEXT: s_movk_i32 s4, 0x8000
; GFX9-NEXT: v_or_b32_e32 v1, 0xffff8000, v0
@@ -973,7 +972,6 @@ define <4 x i16> @vec_16xi16_extract_4xi16_2(ptr addrspace(1) %p0, ptr addrspace
; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: ; kill: killed $vgpr0 killed $vgpr1
; GFX9-NEXT: .LBB4_3: ; %exit
-; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_pk_ashrrev_i16 v0, 15, v7 op_sel_hi:[0,1]
; GFX9-NEXT: s_movk_i32 s4, 0x8000
; GFX9-NEXT: v_or_b32_e32 v1, 0xffff8000, v0
@@ -1217,7 +1215,6 @@ define <4 x half> @vec_16xf16_extract_4xf16(ptr addrspace(1) %p0, ptr addrspace(
; GFX9-NEXT: .LBB5_3: ; %exit
; GFX9-NEXT: v_mov_b32_e32 v0, 0x3900
; GFX9-NEXT: v_mov_b32_e32 v1, 0x3d00
-; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_cmp_ge_f16_e32 vcc, 0.5, v4
; GFX9-NEXT: v_mov_b32_e32 v3, 0x3800
; GFX9-NEXT: v_cndmask_b32_e32 v2, v0, v1, vcc
@@ -1595,7 +1592,6 @@ define amdgpu_gfx <8 x i16> @vec_16xi16_extract_8xi16_0(i1 inreg %cond, ptr addr
; GFX9-NEXT: s_movk_i32 s34, 0x3800
; GFX9-NEXT: v_mov_b32_e32 v0, 0x3900
; GFX9-NEXT: v_mov_b32_e32 v1, 0x3d00
-; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_cmp_gt_u16_e32 vcc, s35, v7
; GFX9-NEXT: v_cndmask_b32_e32 v3, v0, v1, vcc
; GFX9-NEXT: v_cmp_gt_u16_sdwa vcc, v7, s34 src0_sel:WORD_1 src1_sel:DWORD
@@ -1933,7 +1929,6 @@ define amdgpu_gfx <8 x half> @vec_16xf16_extract_8xf16_0(i1 inreg %cond, ptr add
; GFX9-NEXT: v_mov_b32_e32 v0, 0x3800
; GFX9-NEXT: v_mov_b32_e32 v1, 0x3900
; GFX9-NEXT: v_mov_b32_e32 v2, 0x3d00
-; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_cmp_ge_f16_e32 vcc, 0.5, v7
; GFX9-NEXT: v_cndmask_b32_e32 v3, v1, v2, vcc
; GFX9-NEXT: v_cmp_nle_f16_sdwa vcc, v7, v0 src0_sel:WORD_1 src1_sel:DWORD
diff --git a/llvm/test/CodeGen/AMDGPU/extract-subvector.ll b/llvm/test/CodeGen/AMDGPU/extract-subvector.ll
index 41082821bafe3..a8d94146195f4 100644
--- a/llvm/test/CodeGen/AMDGPU/extract-subvector.ll
+++ b/llvm/test/CodeGen/AMDGPU/extract-subvector.ll
@@ -127,7 +127,6 @@ define <2 x i64> @extract_2xi64(ptr addrspace(1) %p0, ptr addrspace(1) %p1, i1 %
; GCN-NEXT: s_mov_b32 s11, 0xf000
; GCN-NEXT: s_mov_b32 s8, s10
; GCN-NEXT: s_mov_b32 s9, s10
-; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_load_dwordx4 v[4:7], v[0:1], s[8:11], 0 addr64 glc
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_load_dwordx4 v[8:11], v[0:1], s[8:11], 0 addr64 offset:16 glc
@@ -138,7 +137,6 @@ define <2 x i64> @extract_2xi64(ptr addrspace(1) %p0, ptr addrspace(1) %p1, i1 %
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: .LBB1_4: ; %exit
; GCN-NEXT: s_or_b64 exec, exec, s[4:5]
-; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v1, 0xffff8000
; GCN-NEXT: v_cmp_lt_i64_e32 vcc, -1, v[4:5]
; GCN-NEXT: v_cndmask_b32_e32 v0, -1, v1, vcc
@@ -197,7 +195,6 @@ define <4 x i64> @extract_4xi64(ptr addrspace(1) %p0, ptr addrspace(1) %p1, i1 %
; GCN-NEXT: s_mov_b32 s11, 0xf000
; GCN-NEXT: s_mov_b32 s8, s10
; GCN-NEXT: s_mov_b32 s9, s10
-; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_load_dwordx4 v[4:7], v[0:1], s[8:11], 0 addr64 glc
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_load_dwordx4 v[8:11], v[0:1], s[8:11], 0 addr64 offset:16 glc
@@ -208,7 +205,6 @@ define <4 x i64> @extract_4xi64(ptr addrspace(1) %p0, ptr addrspace(1) %p1, i1 %
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: .LBB2_4: ; %exit
; GCN-NEXT: s_or_b64 exec, exec, s[4:5]
-; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v1, 0xffff8000
; GCN-NEXT: v_cmp_gt_i64_e32 vcc, 0, v[4:5]
; GCN-NEXT: v_cndmask_b32_e64 v0, v1, -1, vcc
@@ -305,7 +301,6 @@ define <8 x i64> @extract_8xi64(ptr addrspace(1) %p0, ptr addrspace(1) %p1, i1 %
; GCN-NEXT: v_cmp_gt_i64_e64 s[6:7], 0, v[10:11]
; GCN-NEXT: v_cmp_gt_i64_e64 s[8:9], 0, v[12:13]
; GCN-NEXT: v_cmp_gt_i64_e64 s[10:11], 0, v[14:15]
-; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_cmp_gt_i64_e64 s[12:13], 0, v[16:17]
; GCN-NEXT: v_cmp_gt_i64_e64 s[14:15], 0, v[18:19]
; GCN-NEXT: v_cmp_gt_i64_e64 s[16:17], 0, v[4:5]
@@ -376,7 +371,6 @@ define <2 x double> @extract_2xf64(ptr addrspace(1) %p0, ptr addrspace(1) %p1, i
; GCN-NEXT: s_mov_b32 s11, 0xf000
; GCN-NEXT: s_mov_b32 s8, s10
; GCN-NEXT: s_mov_b32 s9, s10
-; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_load_dwordx4 v[4:7], v[0:1], s[8:11], 0 addr64 glc
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_load_dwordx4 v[8:11], v[0:1], s[8:11], 0 addr64 offset:16 glc
@@ -387,7 +381,6 @@ define <2 x double> @extract_2xf64(ptr addrspace(1) %p0, ptr addrspace(1) %p1, i
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: .LBB4_4: ; %exit
; GCN-NEXT: s_or_b64 exec, exec, s[4:5]
-; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v0, 0xbff00000
; GCN-NEXT: v_cmp_lt_f64_e32 vcc, -1.0, v[4:5]
; GCN-NEXT: v_cndmask_b32_e64 v1, v0, -2.0, vcc
@@ -446,7 +439,6 @@ define <4 x double> @extract_4xf64(ptr addrspace(1) %p0, ptr addrspace(1) %p1, i
; GCN-NEXT: s_mov_b32 s11, 0xf000
; GCN-NEXT: s_mov_b32 s8, s10
; GCN-NEXT: s_mov_b32 s9, s10
-; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_load_dwordx4 v[4:7], v[0:1], s[8:11], 0 addr64 glc
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_load_dwordx4 v[8:11], v[0:1], s[8:11], 0 addr64 offset:16 glc
@@ -457,7 +449,6 @@ define <4 x double> @extract_4xf64(ptr addrspace(1) %p0, ptr addrspace(1) %p1, i
; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: .LBB5_4: ; %exit
; GCN-NEXT: s_or_b64 exec, exec, s[4:5]
-; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v0, 0xbff00000
; GCN-NEXT: v_cmp_nlt_f64_e32 vcc, -1.0, v[4:5]
; GCN-NEXT: v_cndmask_b32_e32 v1, -2.0, v0, vcc
@@ -554,7 +545,6 @@ define <8 x double> @extract_8xf64(ptr addrspace(1) %p0, ptr addrspace(1) %p1, i
; GCN-NEXT: v_cmp_nlt_f64_e64 s[6:7], -1.0, v[10:11]
; GCN-NEXT: v_cmp_nlt_f64_e64 s[8:9], -1.0, v[12:13]
; GCN-NEXT: v_cmp_nlt_f64_e64 s[10:11], -1.0, v[14:15]
-; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_cmp_nlt_f64_e64 s[12:13], -1.0, v[16:17]
; GCN-NEXT: v_cmp_nlt_f64_e64 s[14:15], -1.0, v[18:19]
; GCN-NEXT: v_cmp_nlt_f64_e64 s[16:17], -1.0, v[4:5]
diff --git a/llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll b/llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll
index b5665835eaf7a..17a5f520ff41e 100644
--- a/llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll
+++ b/llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll
@@ -6178,13 +6178,12 @@ define amdgpu_kernel void @insert_vgpr_offset_multiple_in_block(ptr addrspace(1)
; NOOPT-NEXT: v_mov_b32_e32 v11, v14
; NOOPT-NEXT: v_mov_b32_e32 v12, v13
; NOOPT-NEXT: buffer_store_dwordx4 v[9:12], off, s[4:7], 0 offset:32
-; NOOPT-NEXT: s_waitcnt vmcnt(0)
+; NOOPT-NEXT: s_waitcnt vmcnt(0) expcnt(0)
; NOOPT-NEXT: ; implicit-def: $sgpr1
; NOOPT-NEXT: ; implicit-def: $sgpr1
; NOOPT-NEXT: ; implicit-def: $sgpr1
; NOOPT-NEXT: ; implicit-def: $sgpr1
; NOOPT-NEXT: ; kill: def $vgpr8 killed $vgpr8 def $vgpr8_vgpr9_vgpr10_vgpr11 killed $exec
-; NOOPT-NEXT: s_waitcnt expcnt(0)
; NOOPT-NEXT: v_mov_b32_e32 v9, v4
; NOOPT-NEXT: v_mov_b32_e32 v10, v3
; NOOPT-NEXT: v_mov_b32_e32 v11, v2
@@ -7297,7 +7296,6 @@ define amdgpu_kernel void @extract_adjacent_blocks(i32 %arg) {
; NOOPT-NEXT: buffer_load_dwordx4 v[0:3], off, s[0:3], 0 glc
; NOOPT-NEXT: s_waitcnt vmcnt(0)
; NOOPT-NEXT: ; implicit-def: $sgpr0
-; NOOPT-NEXT: s_waitcnt vmcnt(0)
; NOOPT-NEXT: ;;#ASMSTART
; NOOPT-NEXT: ; reg use v[0:3]
; NOOPT-NEXT: ;;#ASMEND
@@ -7320,7 +7318,6 @@ define amdgpu_kernel void @extract_adjacent_blocks(i32 %arg) {
; NOOPT-NEXT: buffer_load_dwordx4 v[0:3], off, s[0:3], 0 glc
; NOOPT-NEXT: s_waitcnt vmcnt(0)
; NOOPT-NEXT: ; implicit-def: $sgpr0
-; NOOPT-NEXT: s_waitcnt vmcnt(0)
; NOOPT-NEXT: ;;#ASMSTART
; NOOPT-NEXT: ; reg use v[0:3]
; NOOPT-NEXT: ;;#ASMEND
@@ -7541,7 +7538,6 @@ define amdgpu_kernel void @insert_adjacent_blocks(i32 %arg, float %val0) {
; NOOPT-NEXT: s_waitcnt vmcnt(0)
; NOOPT-NEXT: ; implicit-def: $sgpr0_sgpr1_sgpr2_sgpr3
; NOOPT-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
-; NOOPT-NEXT: s_waitcnt vmcnt(0)
; NOOPT-NEXT: ;;#ASMSTART
; NOOPT-NEXT: ; reg use v[0:3]
; NOOPT-NEXT: ;;#ASMEND
@@ -7565,7 +7561,6 @@ define amdgpu_kernel void @insert_adjacent_blocks(i32 %arg, float %val0) {
; NOOPT-NEXT: s_waitcnt vmcnt(0)
; NOOPT-NEXT: ; implicit-def: $sgpr0_sgpr1_sgpr2_sgpr3
; NOOPT-NEXT: ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
-; NOOPT-NEXT: s_waitcnt vmcnt(0)
; NOOPT-NEXT: ;;#ASMSTART
; NOOPT-NEXT: ; reg use v[0:3]
; NOOPT-NEXT: ;;#ASMEND
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll
index 8b600c835a160..74a72e04fa4ae 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll
@@ -7367,7 +7367,6 @@ define amdgpu_kernel void @global_agent_acquire_monotonic_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -7918,7 +7917,6 @@ define amdgpu_kernel void @global_agent_acq_rel_monotonic_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -8215,7 +8213,6 @@ define amdgpu_kernel void @global_agent_seq_cst_monotonic_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -8505,7 +8502,6 @@ define amdgpu_kernel void @global_agent_monotonic_acquire_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -8777,7 +8773,6 @@ define amdgpu_kernel void @global_agent_acquire_acquire_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -9052,7 +9047,6 @@ define amdgpu_kernel void @global_agent_release_acquire_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -9349,7 +9343,6 @@ define amdgpu_kernel void @global_agent_acq_rel_acquire_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -9646,7 +9639,6 @@ define amdgpu_kernel void @global_agent_seq_cst_acquire_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -9943,7 +9935,6 @@ define amdgpu_kernel void @global_agent_monotonic_seq_cst_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -10240,7 +10231,6 @@ define amdgpu_kernel void @global_agent_acquire_seq_cst_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -10533,7 +10523,6 @@ define amdgpu_kernel void @global_agent_release_seq_cst_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -10830,7 +10819,6 @@ define amdgpu_kernel void @global_agent_acq_rel_seq_cst_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ; kill: def $vgpr0 killed $vgpr0 killed $vgpr0_vgpr1 killed $exec
-; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: buffer_store_dword v0, off, s[0:3], 0
; SKIP-CACHE-INV-NEXT: s_endpgm
;
@@ -11127,7 +11115,6 @@ define amdgpu_kernel void @global_agent_seq_cst_seq_cst_ret_cmpxchg(
; SKIP-CACHE-INV-NEXT: buffer_atomic_cmpswap v[0:1], off, s[0:3], 0 offset:16 glc
; SKIP-CACHE-INV-NEXT: s_waitcnt vmcnt(0)
; SKIP-CACHE-INV-NEXT: ...
[truncated]
|
The affected tests need a careful review. I do believe that in each of these cases, the
This is the motivating testcase for this change. The first commit contains the test including the spurious s_waitcnt, and the second commit shows how it is optimized away. |
This makes it easier to fix the failure in #145720
…ssed When iterating over instructions, the pass keeps track of pending soft waitcnt instructions. These are usually processed when a non-waitcnt instruction is visited, but some visit may return early. Dropping the recorded pointer at this early exit produces spurious waitcnt. This does not affect correctness, but the waitcnt is unnecessary. To fix that, the recorded pointer should be dropped iff pending soft waitcnt instructions have been processed.
5abc838
to
7f69f65
Compare
This makes it easier to fix the failure in llvm#145720
Bump! |
When iterating over a block, meta instructions have no effect on wait counts, but their presence drops the reference to earlier waitcnt instructions before they are processed. This results in spurious wait counts, which do not affect correctness, but are also not required in the resulting program. Skipping meta instructions as soon as they are seen cleans this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Everything else looks good to me.
@@ -2474,6 +2474,10 @@ bool SIInsertWaitcnts::insertWaitcntInBlock(MachineFunction &MF, | |||
E = Block.instr_end(); | |||
Iter != E;) { | |||
MachineInstr &Inst = *Iter; | |||
if (Inst.isMetaInstruction()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean generateWaitcntInstBefore
no longer has to worry about meta instruction, or could assert that it does not see them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's the part that I meant about duplicating the handling of meta instructions. We just skipped the meta instruction at the top level. But when we eventually do visit a subsequent instruction, OldWaitcntInstr
could still be pointing to something earlier than the skipped instruction. In that case, the range of instructions that generateWaitcntInstBefore
visits will contain that skipped instruction.
EDIT: Ooops, I spoke too soon from memory. You're right ... generateWaitcntInstBefore
should be able to assert that it's not a meta instruction! Will go fix that.
This makes it easier to fix the failure in llvm#145720
When iterating over instructions, the pass keeps track of pending soft waitcnt
instructions. These are usually processed when a non-waitcnt instruction is
visited, but some visit may return early. Dropping the recorded pointer at this
early exit produces spurious waitcnt. This does not affect correctness, but the
waitcnt is unnecessary.
To fix that, the recorded pointer should be dropped iff pending soft waitcnt
instructions have been processed.