Description
Repro Repo:
https://github.com/damageboy/coreclr-redundant-mov-around-pdep
Relevant piece of code:
var p0 = ParallelBitDeposit((uint) e0, 0x0707070707070707);
var p1 = ParallelBitDeposit((uint) e1, 0x0707070707070707);
var p2 = ParallelBitDeposit((uint) e2, 0x0707070707070707);
var p3 = ParallelBitDeposit((uint) e3, 0x0707070707070707);
var p4 = ParallelBitDeposit((uint) e4, 0x0707070707070707);
var p5 = ParallelBitDeposit((uint) e5, 0x0707070707070707);
var p6 = ParallelBitDeposit((uint) e6, 0x0707070707070707);
var p7 = ParallelBitDeposit((uint) e7, 0x0707070707070707);
Generated asm:
; var p0 = ParallelBitDeposit((uint) e0, 0x0707070707070707);
00007F06A0BA0833 8BFF mov edi,edi
00007F06A0BA0835 49BB0707070707070707 mov r11,707070707070707h
00007F06A0BA083F C4C2C3F5FB pdep rdi,rdi,r11
; var p1 = ParallelBitDeposit((uint) e1, 0x0707070707070707);
00007F06A0BA0844 8BC0 mov eax,eax
00007F06A0BA0846 C4C2FBF5C3 pdep rax,rax,r11
; var p2 = ParallelBitDeposit((uint) e2, 0x0707070707070707);
00007F06A0BA084B 8BD2 mov edx,edx
00007F06A0BA084D C4C2EBF5D3 pdep rdx,rdx,r11
; var p3 = ParallelBitDeposit((uint) e3, 0x0707070707070707);
00007F06A0BA0852 8BF6 mov esi,esi
00007F06A0BA0854 C4C2CBF5F3 pdep rsi,rsi,r11
; var p4 = ParallelBitDeposit((uint) e4, 0x0707070707070707);
00007F06A0BA0859 458BC0 mov r8d,r8d
00007F06A0BA085C C442BBF5C3 pdep r8,r8,r11
; var p5 = ParallelBitDeposit((uint) e5, 0x0707070707070707);
00007F06A0BA0861 8BC9 mov ecx,ecx
00007F06A0BA0863 C4C2F3F5CB pdep rcx,rcx,r11
; var p6 = ParallelBitDeposit((uint) e6, 0x0707070707070707);
00007F06A0BA0868 458BD2 mov r10d,r10d
00007F06A0BA086B C442ABF5D3 pdep r10,r10,r11
; var p7 = ParallelBitDeposit((uint) e7, 0x0707070707070707);
00007F06A0BA0870 458BC9 mov r9d,r9d
00007F06A0BA0873 C442B3F5CB pdep r9,r9,r11
Issue
In the asm listing, we can clearly see that the upper 32 bits of the source register for PDEP is cleared as part of the implicit cast to ulong
.
While this is a sensible behaviour, as PDEP might end up reading from those bits, for cases such as the above, where PDEP is supplied with a constant mask, which "happens" to never read a single bit past the first 32 bits anyway, clearing these top bits seems redundant:
From the Intel docs:
PDEP uses a mask in the second source operand (the third operand) to transfer/scatter contiguous low order bits in
the first source operand (the second operand) into the destination (the first operand). PDEP takes the low bits from
the first source operand and deposit them in the destination operand at the corresponding bit locations that are set
in the second source operand (mask). All other bits (bits not set in mask) in destination are set to zero.
In other words, the amount of toggled bits in the mask controls how many low-order bits are read from the source.
As such, a constant mask with 24 bits toggled in this case: (0x0707070707070707 -> each 0x7 is 0b111 x 8 == 24 bits) means that not a single bit from the upper 32 bits will ever be read by this instruction, so clearing those bits as part of the cast is meaningless.
For this case, it would shave off the 8 mov reg,reg
instructions that are currently emmited to clear those bits.
category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium