Redundant clearing of top-32 bit before certain calls to PDEP

## Repro Repo:
https://github.com/damageboy/coreclr-redundant-mov-around-pdep


## Relevant piece of code:

https://github.com/damageboy/coreclr-redundant-mov-around-pdep/blob/2f87e88ccd419f26d2b5afcfe09cfadd7a220996/Program.cs#L40-L47
```csharp
            var p0 = ParallelBitDeposit((uint) e0, 0x0707070707070707);
            var p1 = ParallelBitDeposit((uint) e1, 0x0707070707070707);
            var p2 = ParallelBitDeposit((uint) e2, 0x0707070707070707);
            var p3 = ParallelBitDeposit((uint) e3, 0x0707070707070707);
            var p4 = ParallelBitDeposit((uint) e4, 0x0707070707070707);
            var p5 = ParallelBitDeposit((uint) e5, 0x0707070707070707);
            var p6 = ParallelBitDeposit((uint) e6, 0x0707070707070707);
            var p7 = ParallelBitDeposit((uint) e7, 0x0707070707070707);
```

## Generated asm:
https://github.com/damageboy/coreclr-redundant-mov-around-pdep/blob/2f87e88ccd419f26d2b5afcfe09cfadd7a220996/listing.asm#L56-L87

```nasm
;             var p0 = ParallelBitDeposit((uint) e0, 0x0707070707070707);
00007F06A0BA0833 8BFF                 mov     edi,edi
00007F06A0BA0835 49BB0707070707070707 mov     r11,707070707070707h
00007F06A0BA083F C4C2C3F5FB           pdep    rdi,rdi,r11
;             var p1 = ParallelBitDeposit((uint) e1, 0x0707070707070707);
00007F06A0BA0844 8BC0                 mov     eax,eax
00007F06A0BA0846 C4C2FBF5C3           pdep    rax,rax,r11
;             var p2 = ParallelBitDeposit((uint) e2, 0x0707070707070707);
00007F06A0BA084B 8BD2                 mov     edx,edx
00007F06A0BA084D C4C2EBF5D3           pdep    rdx,rdx,r11
;             var p3 = ParallelBitDeposit((uint) e3, 0x0707070707070707);
00007F06A0BA0852 8BF6                 mov     esi,esi
00007F06A0BA0854 C4C2CBF5F3           pdep    rsi,rsi,r11
;             var p4 = ParallelBitDeposit((uint) e4, 0x0707070707070707);
00007F06A0BA0859 458BC0               mov     r8d,r8d
00007F06A0BA085C C442BBF5C3           pdep    r8,r8,r11
;             var p5 = ParallelBitDeposit((uint) e5, 0x0707070707070707);
00007F06A0BA0861 8BC9                 mov     ecx,ecx
00007F06A0BA0863 C4C2F3F5CB           pdep    rcx,rcx,r11
;             var p6 = ParallelBitDeposit((uint) e6, 0x0707070707070707);
00007F06A0BA0868 458BD2               mov     r10d,r10d
00007F06A0BA086B C442ABF5D3           pdep    r10,r10,r11
;             var p7 = ParallelBitDeposit((uint) e7, 0x0707070707070707);
00007F06A0BA0870 458BC9               mov     r9d,r9d
00007F06A0BA0873 C442B3F5CB           pdep    r9,r9,r11
```

## Issue
In the asm listing, we can clearly see that the upper 32 bits of the source register for PDEP is cleared as part of the implicit cast to `ulong`.
While this is a sensible behaviour, as PDEP might end up reading from those bits, for cases such as the above, where PDEP is supplied with a constant mask, which "happens" to never read a single bit past the first 32 bits anyway, clearing these top bits seems redundant:

From the Intel docs:
> PDEP uses a mask in the second source operand (the third operand) to transfer/scatter contiguous low order bits in
the first source operand (the second operand) into the destination (the first operand). PDEP takes the low bits from
the first source operand and deposit them in the destination operand at the corresponding bit locations that are set
in the second source operand (mask). All other bits (bits not set in mask) in destination are set to zero.

In other words, the amount of toggled bits in the mask controls how many low-order bits are read from the source.

As such, a constant mask with 24 bits toggled in this case: (0x0707070707070707 -> each 0x7 is 0b111 x 8 == 24 bits) means that not a single bit from the upper 32 bits will ever be read by this instruction, so clearing those bits as part of the cast is meaningless.

For this case, it would shave off the 8 `mov reg,reg` instructions that are currently emmited to clear those bits.

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redundant clearing of top-32 bit before certain calls to PDEP #442

Repro Repo:

Relevant piece of code:

Generated asm:

Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Redundant clearing of top-32 bit before certain calls to PDEP #442

Description

Repro Repo:

Relevant piece of code:

Generated asm:

Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions