-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Closed
Labels
A-codegenArea: Code generationArea: Code generationT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.
Description
#![crate_type = "lib"]
#![feature(tuple_indexing)]
use std::simd::f32x4;
pub fn foo(x: f32x4) -> f32x4 {
f32x4(x.0, x.2, x.3, x.1)
}
becomes, with no optimisations:
define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
%sret_slot = alloca <4 x float>
%x = alloca <4 x float>
store <4 x float> %0, <4 x float>* %x
%1 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 0
%2 = getelementptr inbounds <4 x float>* %x, i32 0, i32 0
%3 = load float* %2
store float %3, float* %1
%4 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 1
%5 = getelementptr inbounds <4 x float>* %x, i32 0, i32 2
%6 = load float* %5
store float %6, float* %4
%7 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 2
%8 = getelementptr inbounds <4 x float>* %x, i32 0, i32 3
%9 = load float* %8
store float %9, float* %7
%10 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 3
%11 = getelementptr inbounds <4 x float>* %x, i32 0, i32 1
%12 = load float* %11
store float %12, float* %10
%13 = load <4 x float>* %sret_slot
ret <4 x float> %13
}
with optimisations it becomes
define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
%sret_slot.12.vec.insert = shufflevector <4 x float> %0, <4 x float> undef, <4 x i32> <i32 0, i32 2, i32 3, i32 1>
ret <4 x float> %sret_slot.12.vec.insert
}
We could detect when a SIMD vector is being created directly from elements of another (pair of*) SIMD vector(s) and convert it directly into the appropriate shuffle instruction. This will save allocas and LLVM doing work, and probably guarantees it more than LLVM currently does. This should even work for vectors of different lengths, as long as the elements are the same.
(This is just a bug since it's an implementation detail.)
*shufflevector
actually takes two operands, so f32x2(x.0, y.0, x.1, y.1)
can also directly become a shuffle.
Metadata
Metadata
Assignees
Labels
A-codegenArea: Code generationArea: Code generationT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.