Generate better code for successive VectorNNN<T>.GetElement() calls, or alternatively provide a method returning 2-4 elements in one call

The currently generated code for cases where successive calls to VectorNNN<T>.GetElement() are made, leaves free performance on the floor, consider this short fragment:

```csharp
    var e0 = P.GetElement(4);
    var e1 = P.GetElement(5);
    var e2 = P.GetElement(6);
    var e3 = P.GetElement(7);
```
This currently generates the following assembly listing:

```nasm
;             var e0 = P.GetElement(4);
00007F2A8D8E07A4 C4E37D19C101         vextractf128 xmm1,ymm0,1
00007F2A8D8E07AA C5F97ECB             vmovd   ebx,xmm1

;             var e1 = P.GetElement(5);
00007F2A8D8E07AE C4E37D19C101         vextractf128 xmm1,ymm0,1
00007F2A8D8E07B4 C4C37916CE01         vpextrd r14d,xmm1,1

;             var e2 = P.GetElement(6);
00007F2A8D8E07BA C4E37D19C101         vextractf128 xmm1,ymm0,1
00007F2A8D8E07C0 C4C37916CF02         vpextrd r15d,xmm1,2

;             var e3 = P.GetElement(7);
00007F2A8D8E07C6 C4E37D19C001         vextractf128 xmm0,ymm0,1
00007F2A8D8E07CC C4C37916C403         vpextrd r12d,xmm0,3
```

Modern compilers usually see through this pattern and make sure to execute the `vextractf128` opcode only once for all 4 calls:

For example, [clang 9.0 from godbolt.org](https://godbolt.org/z/cyA55x):
```nasm
        vextracti128    xmm0, ymm0, 1
        vpextrd ecx, xmm0, 1
        vmovd   eax, xmm0
        vpextrd ebx, xmm0, 2
        vpextrd edx, xmm0, 3
```

This sort of optimization obviously saves a lot of redundant machine code and extra cycles (`vexctracti128` is a 6-byte instruction and 3 cycles of latency).

I don't know if the right way to approach this is to detect this through the JIT and generate efficient code in all cases, or alternatively provide a direct path through a `GetElements()` that reruns a tuple of elements, which would then be handled in much the same as far as the machine code is concerned, but will less hackery around detecting this usage.

There are additional optimizations of the same manner that can be performed (to name them, when two successive elements such as 0,1 are extracted, it makes more sense to read them as one 64 bit value and deal with the two 32-bit halves in normal scalar code, which can execute on different ports), but for now, I think handling the more common cases is sufficient...



category:cq
theme:cse
skill-level:intermediate
cost:medium
impact:medium

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generate better code for successive VectorNNN<T>.GetElement() calls, or alternatively provide a method returning 2-4 elements in one call #437

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generate better code for successive VectorNNN<T>.GetElement() calls, or alternatively provide a method returning 2-4 elements in one call #437

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions