Skip to content

Generate better code for successive VectorNNN<T>.GetElement() calls, or alternatively provide a method returning 2-4 elements in one call #437

Open
@damageboy

Description

@damageboy

The currently generated code for cases where successive calls to VectorNNN.GetElement() are made, leaves free performance on the floor, consider this short fragment:

    var e0 = P.GetElement(4);
    var e1 = P.GetElement(5);
    var e2 = P.GetElement(6);
    var e3 = P.GetElement(7);

This currently generates the following assembly listing:

;             var e0 = P.GetElement(4);
00007F2A8D8E07A4 C4E37D19C101         vextractf128 xmm1,ymm0,1
00007F2A8D8E07AA C5F97ECB             vmovd   ebx,xmm1

;             var e1 = P.GetElement(5);
00007F2A8D8E07AE C4E37D19C101         vextractf128 xmm1,ymm0,1
00007F2A8D8E07B4 C4C37916CE01         vpextrd r14d,xmm1,1

;             var e2 = P.GetElement(6);
00007F2A8D8E07BA C4E37D19C101         vextractf128 xmm1,ymm0,1
00007F2A8D8E07C0 C4C37916CF02         vpextrd r15d,xmm1,2

;             var e3 = P.GetElement(7);
00007F2A8D8E07C6 C4E37D19C001         vextractf128 xmm0,ymm0,1
00007F2A8D8E07CC C4C37916C403         vpextrd r12d,xmm0,3

Modern compilers usually see through this pattern and make sure to execute the vextractf128 opcode only once for all 4 calls:

For example, clang 9.0 from godbolt.org:

        vextracti128    xmm0, ymm0, 1
        vpextrd ecx, xmm0, 1
        vmovd   eax, xmm0
        vpextrd ebx, xmm0, 2
        vpextrd edx, xmm0, 3

This sort of optimization obviously saves a lot of redundant machine code and extra cycles (vexctracti128 is a 6-byte instruction and 3 cycles of latency).

I don't know if the right way to approach this is to detect this through the JIT and generate efficient code in all cases, or alternatively provide a direct path through a GetElements() that reruns a tuple of elements, which would then be handled in much the same as far as the machine code is concerned, but will less hackery around detecting this usage.

There are additional optimizations of the same manner that can be performed (to name them, when two successive elements such as 0,1 are extracted, it makes more sense to read them as one 64 bit value and deal with the two 32-bit halves in normal scalar code, which can execute on different ports), but for now, I think handling the more common cases is sufficient...

category:cq
theme:cse
skill-level:intermediate
cost:medium
impact:medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimization

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions