Introduce "dynamic swizzling" into LLVMIR and Rust intrinsics #242

workingjubilee · 2022-02-06T21:11:15Z

workingjubilee · 2022-02-06T22:27:42Z

It should be noted this can also be seen as a weakening of the shufflevector instruction to accept a non-constant ("register") argument. However, an instruction is more deeply embedded into the logic of LLVM and altering an instruction may involve a change to the LLVM "bitcode" format, so alterations to an instruction are less likely to be accepted.

Thus, it is more likely to be accepted if defined as an LLVM intrinsic function, but this isn't terribly important from our perspective.

Arguably, it is also an instance of llvm.masked.gather.* but for loading from a register instead of memory. However, using that would involve storing, gather-loading, and then hoping mem2reg magically has an opt to clean up after us and into pshufb or vtbl. That's... quite a bit more magical than I would like.

workingjubilee · 2022-02-23T19:24:42Z

It seems the GCC backend can already do this essentially "as-is", so we might as well aim to implement the intrinsic first on the Rust side so that cg_gccjit can implement it as well. We also ought to start drawing intrinsics into cg_ssa.

programmerjake · 2022-05-22T04:31:05Z

one important subset of dynamic swizzling that we should probably have separate operations for is compress/expand, since, due to their requirement of not duplicating elements and not reordering elements are generally quite hard for a compiler to detect afaict. They can use more efficient instructions on some architectures (risc-v has a reg->reg vcompress instruction, for SimpleV compress/expand can be done as part of most unary instructions), also they have their element-selection input as a mask rather than a vector of indexes.

programmerjake · 2022-05-22T04:32:21Z

llvm has intrinsics for combined load/store and compress/expand, but doesn't yet have compress/expand as separate ops.

jhorstmann · 2022-05-22T09:02:41Z

one important subset of dynamic swizzling that we should probably have separate operations for is compress/expand, since, due to their requirement of not duplicating elements and not reordering elements are generally quite hard for a compiler to detect afaict

I was working on a prefix sum algortihm yesterday and was surprised that llvm actually was turning some of my permutes into expand instructions.

The pattern that was optimized looked like

_mm512_maskz_permutexvar_epi32(
    0b1111_1111_1111_1100,
    _mm512_set_epi32(13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, 0),
    input)

The optimization is probably x86 specific and didn't happen for all permutes following that pattern. That might be because on x86 expand has a bit higher latency.

Regarding the "portable" branding, I'm wondering how portable swizzles with vectors > 128bits actually are. AVX2 AFAIK has only a 256bit swizzle for i32/f32 lanes (which can also be used to emulate i64/f64 swizzles) and even with AVX512 you need the Cannonlake/Icelake generation for >128bit byte and word swizzles. ARM Neon is AFAIK also limited to 128bit swizzles, I don't know the support status of SVE.

With that in mind, would it be reasonable to only "portably" support swizzles on 128bit vectors?

workingjubilee · 2022-05-25T02:45:25Z

With that in mind, would it be reasonable to only "portably" support swizzles on 128bit vectors?

That's not necessarily what is best, in actuality. If an "LLVM vector" is greater than what is effective with a "machine vector", LLVM is allowed to use that information to improve its scheduling as it interlaces multiple machine instructions to satisfy the request. This limit only makes sense if you see it as a 1 to 1 mapping between LLVM instructions and machine instructions, but that was never the case.

And from the Rust perspective this just adds another painful predicate that needs to be guaranteed in the source, with not much benefit if the programmer was just going to do that repeatedly over multiple 128 bit segments anyways.

The size limits we have in place now on vectors are more of a feature of LLVM inducing compilation errors at higher sizes and rustc not having the full generics capability we would like to express a more fluent boundary.

programmerjake · 2022-05-25T03:38:08Z

Imho the object of portable-simd isn't to support just what's widely available as a single instruction, but closer to what's available on at least a few cpus (or we otherwise deem important enough) and that llvm can produce correct code basically everywhere for (even if it isn't a single simd instruction).

FallingSnow · 2022-11-09T19:36:24Z

Is there a work around to getting a dynamic shuffle or is the best option to use runtime detection and _mm_shuffle_epi8, _mm256_shuffle_epi8, _mm512_shuffle_epi8, vqtbl1q_u8, vec_perm, or __builtin_shuffle?

workingjubilee · 2023-03-12T22:36:50Z

Wow, uh, after opening this issue... things became very busy in my life. But I'm back to the vector mines! And I decided to start things off in a slightly more roundabout way. In #334 I have introduced a demo for how to have byte-level dynamic swizzling for "one vector of bytes, one vector of index bytes" in wasm, AArch64, and x86, including SSSE3, AVX2, and AVX512VBMI feature levels, using "library code" (a pile of intrinsics).

The way I implemented the AVX2 version illuminates a path forward for more "arbitrary" implementations. It isn't the best codegen to be quite honest, but I looked at the scalar version and... woof. Still winning. In fact, the performance could probably get better if I went behind LLVM's back entirely, whipped out asm!, and hand-picked the instructions, but I want to have benches for that before I start in on it.

My intention is to introduce the intrinsic in Rust and have a desugaring step in our backend that does essentially what my library version does, hitting LLVM's "target intrinsics". Then, having written the code into our codegen, I'll try to port that from Rust to "LLVM C++".

So, @FallingSnow, the answer is that soon enough it'll be available as a function in our library. You'll still want to multiversion it, though.

workingjubilee added C-feature-request Category: a feature request, i.e. not implemented / a PR A-LLVM Area: LLVM E-hard Call for participation. Experience needed: Hard. C-tracking-issue Ongoing issue with checkboxes for partial progress and such labels Feb 6, 2022

RalfJung mentioned this issue Mar 17, 2022

Support variable-index swizzles #226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce "dynamic swizzling" into LLVMIR and Rust intrinsics #242

Introduce "dynamic swizzling" into LLVMIR and Rust intrinsics #242

workingjubilee commented Feb 6, 2022 •

edited

workingjubilee commented Feb 6, 2022 •

edited

workingjubilee commented Feb 23, 2022

programmerjake commented May 22, 2022

programmerjake commented May 22, 2022

jhorstmann commented May 22, 2022

workingjubilee commented May 25, 2022 •

edited

programmerjake commented May 25, 2022

FallingSnow commented Nov 9, 2022

workingjubilee commented Mar 12, 2023

Introduce "dynamic swizzling" into LLVMIR and Rust intrinsics #242

Introduce "dynamic swizzling" into LLVMIR and Rust intrinsics #242

Comments

workingjubilee commented Feb 6, 2022 • edited

LLVM-side

Rust-side

workingjubilee commented Feb 6, 2022 • edited

workingjubilee commented Feb 23, 2022

programmerjake commented May 22, 2022

programmerjake commented May 22, 2022

jhorstmann commented May 22, 2022

workingjubilee commented May 25, 2022 • edited

programmerjake commented May 25, 2022

FallingSnow commented Nov 9, 2022

workingjubilee commented Mar 12, 2023

workingjubilee commented Feb 6, 2022 •

edited

workingjubilee commented Feb 6, 2022 •

edited

workingjubilee commented May 25, 2022 •

edited