Description
The currently generated code for cases where successive calls to VectorNNN.GetElement() are made, leaves free performance on the floor, consider this short fragment:
var e0 = P.GetElement(4);
var e1 = P.GetElement(5);
var e2 = P.GetElement(6);
var e3 = P.GetElement(7);
This currently generates the following assembly listing:
; var e0 = P.GetElement(4);
00007F2A8D8E07A4 C4E37D19C101 vextractf128 xmm1,ymm0,1
00007F2A8D8E07AA C5F97ECB vmovd ebx,xmm1
; var e1 = P.GetElement(5);
00007F2A8D8E07AE C4E37D19C101 vextractf128 xmm1,ymm0,1
00007F2A8D8E07B4 C4C37916CE01 vpextrd r14d,xmm1,1
; var e2 = P.GetElement(6);
00007F2A8D8E07BA C4E37D19C101 vextractf128 xmm1,ymm0,1
00007F2A8D8E07C0 C4C37916CF02 vpextrd r15d,xmm1,2
; var e3 = P.GetElement(7);
00007F2A8D8E07C6 C4E37D19C001 vextractf128 xmm0,ymm0,1
00007F2A8D8E07CC C4C37916C403 vpextrd r12d,xmm0,3
Modern compilers usually see through this pattern and make sure to execute the vextractf128
opcode only once for all 4 calls:
For example, clang 9.0 from godbolt.org:
vextracti128 xmm0, ymm0, 1
vpextrd ecx, xmm0, 1
vmovd eax, xmm0
vpextrd ebx, xmm0, 2
vpextrd edx, xmm0, 3
This sort of optimization obviously saves a lot of redundant machine code and extra cycles (vexctracti128
is a 6-byte instruction and 3 cycles of latency).
I don't know if the right way to approach this is to detect this through the JIT and generate efficient code in all cases, or alternatively provide a direct path through a GetElements()
that reruns a tuple of elements, which would then be handled in much the same as far as the machine code is concerned, but will less hackery around detecting this usage.
There are additional optimizations of the same manner that can be performed (to name them, when two successive elements such as 0,1 are extracted, it makes more sense to read them as one 64 bit value and deal with the two 32-bit halves in normal scalar code, which can execute on different ports), but for now, I think handling the more common cases is sufficient...
category:cq
theme:cse
skill-level:intermediate
cost:medium
impact:medium