Open
Description
Repro Repo:
https://github.com/damageboy/coreclr-redundant-vmovaps
Relevant piece of code:
var e0 = P.GetElement(0);
var e1 = P.GetElement(1);
var e2 = P.GetElement(2);
var e3 = P.GetElement(3);
Generated asm:
00007F67249407A4 C5FC28C8 vmovaps ymm1,ymm0
00007F67249407A8 C5F97ECB vmovd ebx,xmm1
; var e1 = P.GetElement(1);
00007F67249407AC C5FC28C8 vmovaps ymm1,ymm0
00007F67249407B0 C4C37916CE01 vpextrd r14d,xmm1,1
; var e2 = P.GetElement(2);
00007F67249407B6 C5FC28C8 vmovaps ymm1,ymm0
00007F67249407BA C4C37916CF02 vpextrd r15d,xmm1,2
; var e3 = P.GetElement(3);
00007F67249407C0 C4C37916C403 vpextrd r12d,xmm0,3
Issue
In the asm listing, you can see that the first 3 GetElement()
calls generate a superfluous vmovaps
to copy ymm0
to ymm1
before issuing vmovd for the first element or vpextrd
for
elements 1-3.
For some reason, the first 3 are generating this extra copy/opcodes.
The 4th call is "doing the right thing", in that it simply extracts directly from xmm0 (the lower 128 bits of ymm0) without extra fanfare.
category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium
impact:small