Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider compiling SIMD-from-SIMD initialisation directly to a shuffle #18148

Closed
huonw opened this Issue Oct 19, 2014 · 4 comments

Comments

Projects
None yet
4 participants
@huonw
Copy link
Member

huonw commented Oct 19, 2014

#![crate_type = "lib"]
#![feature(tuple_indexing)]

use std::simd::f32x4;

pub fn foo(x: f32x4) -> f32x4 {
    f32x4(x.0, x.2, x.3, x.1)
}

becomes, with no optimisations:

define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
  %sret_slot = alloca <4 x float>
  %x = alloca <4 x float>
  store <4 x float> %0, <4 x float>* %x
  %1 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 0
  %2 = getelementptr inbounds <4 x float>* %x, i32 0, i32 0
  %3 = load float* %2
  store float %3, float* %1
  %4 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 1
  %5 = getelementptr inbounds <4 x float>* %x, i32 0, i32 2
  %6 = load float* %5
  store float %6, float* %4
  %7 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 2
  %8 = getelementptr inbounds <4 x float>* %x, i32 0, i32 3
  %9 = load float* %8
  store float %9, float* %7
  %10 = getelementptr inbounds <4 x float>* %sret_slot, i32 0, i32 3
  %11 = getelementptr inbounds <4 x float>* %x, i32 0, i32 1
  %12 = load float* %11
  store float %12, float* %10
  %13 = load <4 x float>* %sret_slot
  ret <4 x float> %13
}

with optimisations it becomes

define <4 x float> @_ZN3foo20h2254f602671f886ceaaE(<4 x float>) unnamed_addr #0 {
entry-block:
  %sret_slot.12.vec.insert = shufflevector <4 x float> %0, <4 x float> undef, <4 x i32> <i32 0, i32 2, i32 3, i32 1>
  ret <4 x float> %sret_slot.12.vec.insert
}

We could detect when a SIMD vector is being created directly from elements of another (pair of*) SIMD vector(s) and convert it directly into the appropriate shuffle instruction. This will save allocas and LLVM doing work, and probably guarantees it more than LLVM currently does. This should even work for vectors of different lengths, as long as the elements are the same.

(This is just a bug since it's an implementation detail.)

*shufflevector actually takes two operands, so f32x2(x.0, y.0, x.1, y.1) can also directly become a shuffle.

@eddyb

This comment has been minimized.

Copy link
Member

eddyb commented Oct 19, 2014

If we would stop using allocas in that case, it would be much easier for LLVM to combine extractelement and insertelement into shufflevector.
I have somewhat of a plan for that, and @luqmana has shown interest in helping, but I can't promise anything concrete (I've wanted to do it for more than half an year now, with #12424 and #13004 as my first attempts).

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Oct 27, 2015

Two things:

  1. simd has been moved out, and @huonw, you've been doing a bunch of work on stuff, so I don't know how much of a bug this still is
  2. the unoptimized version seems to remove one getelementptr now:
define <4 x float> @_ZN3foo20h88e8b14ab8916c70faaE(<4 x float>) unnamed_addr #0 {
entry-block:
  %dropflag_hint_7 = alloca i8
  %x = alloca <4 x float>
  store i8 61, i8* %dropflag_hint_7
  store <4 x float> %0, <4 x float>* %x, align 16
  %1 = getelementptr inbounds <4 x float>, <4 x float>* %x, i32 0, i32 0
  %2 = load float, float* %1, align 4
  %3 = insertelement <4 x float> undef, float %2, i64 0
  %4 = getelementptr inbounds <4 x float>, <4 x float>* %x, i32 0, i32 2
  %5 = load float, float* %4, align 4
  %6 = insertelement <4 x float> %3, float %5, i64 1
  %7 = getelementptr inbounds <4 x float>, <4 x float>* %x, i32 0, i32 3
  %8 = load float, float* %7, align 4
  %9 = insertelement <4 x float> %6, float %8, i64 2
  %10 = getelementptr inbounds <4 x float>, <4 x float>* %x, i32 0, i32 1
  %11 = load float, float* %10, align 4
  %12 = insertelement <4 x float> %9, float %11, i64 3
  ret <4 x float> %12
}

updated code:

#![crate_type = "lib"]
#![feature(core_simd)]

use std::simd::f32x4;

pub fn foo(x: f32x4) -> f32x4 {
        f32x4(x.0, x.2, x.3, x.1)
}
@huonw

This comment has been minimized.

Copy link
Member Author

huonw commented Oct 27, 2015

simd has been moved out, and @huonw, you've been doing a bunch of work on stuff, so I don't know how much of a bug this still is

This is a code-generation thing, so external SIMD would still benefit.

the unoptimized version seems to remove one getelementptr now:

There's still 4 that could theoretically be removed.

(Thanks for prompting for updates!)

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Mar 23, 2017

Closing wishlist.

@brson brson closed this Mar 23, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.