Skip to content

Needless clearing of top 32 bits after vmovmaskps #431

Open
@damageboy

Description

@damageboy

Repro Repo:

https://github.com/damageboy/coreclr-jit-why-mov-now

Relevant piece of code:

https://github.com/damageboy/coreclr-jit-why-mov-now/blob/09e990e4d87190e33ca9ea1a794ec045a6651f97/Program.cs#L37-L44

            maskyMcMaskFace  = (ulong) (uint) MoveMask(CompareGreaterThan(L0, P).AsSingle()) << 00;
            maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L1, P).AsSingle()) << 08;
            maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L2, P).AsSingle()) << 16;
            maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L3, P).AsSingle()) << 24;
            maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L4, P).AsSingle()) << 32;
            maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L5, P).AsSingle()) << 40;
            maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L6, P).AsSingle()) << 48;
            maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L7, P).AsSingle()) << 56;

Generated asm:

https://github.com/damageboy/coreclr-jit-why-mov-now/blob/09e990e4d87190e33ca9ea1a794ec045a6651f97/listing.asm#L57-L115

;             maskyMcMaskFace  = (ulong) (uint) MoveMask(CompareGreaterThan(L0, P).AsSingle()) << 00;
00007F05AE6B097A C5F566C8             vpcmpgtd ymm1,ymm1,ymm0
00007F05AE6B097E C5FC50F9             vmovmskps edi,ymm1
00007F05AE6B0982 8BFF                 mov     edi,edi


;             maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L1, P).AsSingle()) << 08;
00007F05AE6B0984 C5ED66C8             vpcmpgtd ymm1,ymm2,ymm0
00007F05AE6B0988 C5FC50C1             vmovmskps eax,ymm1
00007F05AE6B098C 8BC0                 mov     eax,eax
00007F05AE6B098E 48C1E008             shl     rax,8
00007F05AE6B0992 480BF8               or      rdi,rax


;             maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L2, P).AsSingle()) << 16;
00007F05AE6B0995 C5E566C8             vpcmpgtd ymm1,ymm3,ymm0
00007F05AE6B0999 C5FC50C1             vmovmskps eax,ymm1
00007F05AE6B099D 8BC0                 mov     eax,eax
00007F05AE6B099F 48C1E010             shl     rax,10h
00007F05AE6B09A3 480BC7               or      rax,rdi
00007F05AE6B09A6 488BF8               mov     rdi,rax


;             maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L3, P).AsSingle()) << 24;
00007F05AE6B09A9 C5DD66C8             vpcmpgtd ymm1,ymm4,ymm0
00007F05AE6B09AD C5FC50C1             vmovmskps eax,ymm1
00007F05AE6B09B1 8BC0                 mov     eax,eax
00007F05AE6B09B3 48C1E018             shl     rax,18h
00007F05AE6B09B7 480BC7               or      rax,rdi
00007F05AE6B09BA 488BF8               mov     rdi,rax


;             maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L4, P).AsSingle()) << 32;
00007F05AE6B09BD C5D566C8             vpcmpgtd ymm1,ymm5,ymm0
00007F05AE6B09C1 C5FC50C1             vmovmskps eax,ymm1
00007F05AE6B09C5 8BC0                 mov     eax,eax
00007F05AE6B09C7 48C1E020             shl     rax,20h
00007F05AE6B09CB 480BC7               or      rax,rdi
00007F05AE6B09CE 488BF8               mov     rdi,rax


;             maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L5, P).AsSingle()) << 40;
00007F05AE6B09D1 C5CD66C8             vpcmpgtd ymm1,ymm6,ymm0
00007F05AE6B09D5 C5FC50C1             vmovmskps eax,ymm1
00007F05AE6B09D9 8BC0                 mov     eax,eax
00007F05AE6B09DB 48C1E028             shl     rax,28h
00007F05AE6B09DF 480BC7               or      rax,rdi
00007F05AE6B09E2 488BF8               mov     rdi,rax


;             maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L6, P).AsSingle()) << 48;
00007F05AE6B09E5 C5C566C8             vpcmpgtd ymm1,ymm7,ymm0
00007F05AE6B09E9 C5FC50C1             vmovmskps eax,ymm1
00007F05AE6B09ED 8BC0                 mov     eax,eax
00007F05AE6B09EF 48C1E030             shl     rax,30h
00007F05AE6B09F3 480BC7               or      rax,rdi
00007F05AE6B09F6 488BF8               mov     rdi,rax


;             maskyMcMaskFace |= (ulong) (uint) MoveMask(CompareGreaterThan(L7, P).AsSingle()) << 56;
00007F05AE6B09F9 C5BD66C0             vpcmpgtd ymm0,ymm8,ymm0
00007F05AE6B09FD C5FC50C0             vmovmskps eax,ymm0
00007F05AE6B0A01 8BC0                 mov     eax,eax
00007F05AE6B0A03 48C1E038             shl     rax,38h
00007F05AE6B0A07 480BC7               or      rax,rdi
00007F05AE6B0A0A 488BF8               mov     rdi,rax

Issue

In the asm listing, There is a mov eax,eax between the vmovmskps
and the shl opcodes.

Unless I'm gravely mistaken, this is superfluous?

The Intel docs are a bit fuzzy on this, I admit:
VMOVMSKPS (VEX.256 encoded version)

DEST[0] 🡐 SRC[31]
DEST[1] 🡐 SRC[63]
DEST[2] 🡐 SRC[95]
DEST[3] 🡐 SRC[127]
DEST[4] 🡐 SRC[159]
DEST[5] 🡐 SRC[191]
DEST[6] 🡐 SRC[223]
DEST[7] 🡐 SRC[255]
IF DEST = r32
    THEN DEST[31:8] 🡐 0;
    ELSE DEST[63:8] 🡐 0;
FI

But it would seem that the entire register is cleared in 64 bit mode.
My reasoning for this, is that I didn't manage to generate both a 32 and a 64 bit variant of
vmovmaskps {e,r}ax,ymm0

That is to say, whenever I try to encode both:

vmovmskps rax,ymm0
vmovmskps eax,ymm0

I always get the same instruction stream:

$ cat x.asm
vmovmskps rax,ymm0
vmovmskps eax,ymm0

$ nasm -f elf64 x.asm

$ objdump -D -Mintel  x.o
x.o:     file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <.text>:
   0:   c5 fc 50 c0             vmovmskps eax,ymm0
   4:   c5 fc 50 c0             vmovmskps eax,ymm0

This is leading me to an understanding that there is no distinction between using a 32-bit and 64-bit register in this case... so the entire top 56 should be already cleared and the mov eax,eax is not required?

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium
impact:small

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimization

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions