Add compress store / expand load intrinsics & maybe register -> register variant too #241 #240

jorgecarleitao · 2022-02-06T17:21:10Z

One of the most important operations in analytics is the SQL-equivalent of a WHERE clause.

For SIMD, this is an operation that can be summarized as (pseudo rust code)

fn where<LANES: const usize>(array: [T; LANES], mask: Mask<LANES>, receiver: &mut [T; LANES], selected: &mut usize)

roughy translated to:

fn where<LANES: const usize>(array: [T; LANES], mask: Mask<LANES>, receiver: &mut [T; LANES], selected: &mut usize) {
    *selected = 0
    for (value: T, is_selected: bool) in array.iter().zip(mask.iter()) {
        if is_selected {
            *receiver[*selected] = value;
            *selected += 1;
        }
    }
}

As pointed out by @programmerjake on zulip, there are instructions for this.

imo this is sufficiently relevant for portable-simd to support.

programmerjake · 2022-02-06T17:28:22Z

oops, we created bug reports simultaneously...description from #241:
Add support for llvm.masked.compressstore.* and llvm.masked.expandload.* intrinsics.

Also, I don't know if there are portable LLVM intrinsics for them, but supporting register-register compress/expand (without the memory access) would be nice too...those are supported on at least AVX512, SVE, RISC-V V, and SimpleV.

Zulip Discussion

calebzulawski · 2022-02-06T17:30:35Z

The in-register variant actually seems more useful to me. What ends up in the remaining lanes? Zeros?

programmerjake · 2022-02-06T17:36:40Z

x86 supports leaving either zeros or whatever was there before in the destination register in the unused lanes

workingjubilee · 2022-02-06T21:45:46Z

Wouldn't that get fixed up by LLVM's mem2reg pass?

programmerjake · 2022-02-07T06:38:01Z

Wouldn't that get fixed up by LLVM's mem2reg pass?

compressed store / expand load would probably not be cleaned by llvm, because iirc it doesn't have arch-independent intrinsics for compress/expand reg->reg.

programmerjake · 2023-02-16T08:47:31Z

one major use case is speeding up [T]::sort -- used in the avx512 fast sort mentioned here: https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/Fast.20sorting/near/328108693
source code (c++): https://github.com/intel/x86-simd-sort

i haven't looked, but i'd assume it likely can be converted to rust using portable-simd if we added compress-store

farnoy · 2024-01-04T10:52:43Z

I don't see a register-to-register LLVM intrinsic for compress/expand. Is the only version they have of this is with a memory operand?

I thought it would be easier to implement register versions of these and then use my additions for masked load & store #374 to complete these operations. This is what I do in my program using my own fork of the library. It's also more optimal than doing compressstore as one operation on Zen 4, since LLVM doesn't know about the performance cliff of that instruction on this specific uarch.

But if we wanted to emulate compress/expand using swizzle_dyn, it would limit our options heavily. A generic LUT to compress u8x16 elements is 2^16 * 16B = 1MiB. Using a const expression to build it takes long to evaluate and the table pollutes the binary, and it's only one of the multiple variants we'd need. It would be more practical for larger elements, since u32x8 would only need a LUT that is 2^8 * 32B = 8KiB. One blocker for this is that swizzle_dyn is only implemented for Simd<u8, N> today, I'm not sure why.

I am using an approach like this myself, in application code, but I'm also exploiting domain-specific behavior to know which compress masks are possible in my use case and have a two-level LUT that's three orders of magnitude smaller.

programmerjake · 2024-01-04T11:26:07Z

I don't see a register-to-register LLVM intrinsic for compress/expand. Is the only version they have of this is with a memory operand?

yes, unfortunately.

I thought it would be easier to implement register versions of these and then use my additions for masked load & store #374 to complete these operations. This is what I do in my program using my own fork of the library. It's also more optimal than doing compressstore as one operation on Zen 4, since LLVM doesn't know about the performance cliff of that instruction on this specific uarch.

that sounds like an optimization that should be added to LLVM, ideally rustc's job should be to translate (after optimizations) to whatever LLVM IR is the target-independent canonical representation and LLVM's job is to translate that to the best target-specific code.

But if we wanted to emulate compress/expand using swizzle_dyn, it would limit our options heavily. A generic LUT to compress u8x16 elements is 2^16 * 16B = 1MiB. Using a const expression to build it takes long to evaluate and the table pollutes the binary, and it's only one of the multiple variants we'd need. It would be more practical for larger elements, since u32x8 would only need a LUT that is 2^8 * 32B = 8KiB.

if the look-up table doesn't fit in the L1 cache, compress store is almost certainly faster.

One blocker for this is that swizzle_dyn is only implemented for Simd<u8, N> today, I'm not sure why.

because we want LLVM to gain a target-independent dynamic swizzle IR operation, which it doesn't currently have, so we're currently using the x86 dynamic byte swizzle intrinsic so we can experiment with it and figure out a good API without waiting on LLVM.

programmerjake · 2024-01-04T11:43:47Z

I don't see a register-to-register LLVM intrinsic for compress/expand. Is the only version they have of this is with a memory operand?

yes, unfortunately (unless you use target-specific intrinsics, which we're trying to avoid as part of being portable).

I thought it would be easier to implement register versions of these and then use my additions for masked load & store #374 to complete these operations. This is what I do in my program using my own fork of the library. It's also more optimal than doing compressstore as one operation on Zen 4, since LLVM doesn't know about the performance cliff of that instruction on this specific uarch.

that sounds like an optimization that should be added to LLVM, ideally rustc's job should be to translate (after optimizations) to whatever LLVM IR is the target-independent canonical representation and LLVM's job is to translate that to the best target-specific code.

actually, now that i'm looking at it, 128-bit vpcompressb takes only 2 clock cycles on Zen 4, which is really fast (should be faster than a load, dynamic swizzle, and store sequence), so idk what performance cliff you're referring to... timings from Agner Fog's nice PDF

But if we wanted to emulate compress/expand using swizzle_dyn, it would limit our options heavily. A generic LUT to compress u8x16 elements is 2^16 * 16B = 1MiB. Using a const expression to build it takes long to evaluate and the table pollutes the binary, and it's only one of the multiple variants we'd need. It would be more practical for larger elements, since u32x8 would only need a LUT that is 2^8 * 32B = 8KiB.

if the look-up table doesn't fit in the L1 cache, compress store is almost certainly faster.

One blocker for this is that swizzle_dyn is only implemented for Simd<u8, N> today, I'm not sure why.

because we want LLVM to gain a target-independent dynamic swizzle IR operation, which it doesn't currently have, so we're currently using the x86 dynamic byte swizzle intrinsic so we can experiment with it and figure out a good API without waiting on LLVM.

farnoy · 2024-01-04T12:30:09Z

that sounds like an optimization that should be added to LLVM, ideally rustc's job should be to translate (after optimizations) to whatever LLVM IR is the target-independent canonical representation and LLVM's job is to translate that to the best target-specific code.

I totally agree, I just mentioned it in passing. I opened up an issue with LLVM about this but have not heard anything since.

so idk what performance cliff you're referring to

The cliff is on compressstore specifically. Register compress, masked store and even expandload are all "fast" - in line with what you'd expect. Agner doesn't list vpcompressb with an m512 operand, but it's shown here: https://uops.info/table.html

if the look-up table doesn't fit in the L1 cache, compress store is almost certainly faster.

I'm proposing to only use the look-up table if there's no native intrinsic supported. AVX512 and SVE (for 32b lanes at least) seem to have instructions for this. And only use a LUT for <=16 lanes, falling back to a scalar impl for bigger vectors, just like swizzle_dyn. I think that's sensible.

because we want LLVM to gain a target-independent dynamic swizzle IR operation, which it doesn't currently have, so we're currently using the x86 dynamic byte swizzle intrinsic so we can experiment with it and figure out a good API without waiting on LLVM.

Is this new swizzle op going to be parameterized for different element types, or fixed to support only bytes? IOW, would we implement Simd::<u16, N>::swizzle_dyn with an endian-aware wrapper around Simd::<u8, N>::swizzle_dyn or directly with a different monomorphization of a platform-intrinsic leading to a slightly different LLVM IR op? If it's the first option, then we should be able to add that today, by using the existing implementation we have for bytes.

jorgecarleitao added the C-feature-request Category: a feature request, i.e. not implemented / a PR label Feb 6, 2022

jorgecarleitao changed the title ~~Add "llvm.masked.expandload.*" Intrinsics~~ Add "llvm.masked.compressstore.*" Intrinsics Feb 6, 2022

programmerjake mentioned this issue Feb 6, 2022

compress store / expand load & maybe register -> register variant too #241

Closed

programmerjake changed the title ~~Add "llvm.masked.compressstore.*" Intrinsics~~ Add compress store / expand load intrinsics & maybe register -> register variant too #241 Feb 6, 2022

workingjubilee mentioned this issue Feb 7, 2022

Proposal: Operator context. A new hope for memory churn RoaringBitmap/roaring-rs#173

Open

jhorstmann mentioned this issue Jun 9, 2022

AVX512 + simd binary and/or kernels slower than autovectorized version apache/arrow-rs#1829

Closed

jhorstmann mentioned this issue Jun 26, 2022

AVX512 optimized filter kernels for primitive types apache/arrow-rs#1949

Open

workingjubilee added the A-LLVM Area: LLVM label Mar 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add compress store / expand load intrinsics & maybe register -> register variant too #241 #240

Add compress store / expand load intrinsics & maybe register -> register variant too #241 #240

jorgecarleitao commented Feb 6, 2022 •

edited

programmerjake commented Feb 6, 2022

calebzulawski commented Feb 6, 2022

programmerjake commented Feb 6, 2022

workingjubilee commented Feb 6, 2022

programmerjake commented Feb 7, 2022

programmerjake commented Feb 16, 2023

farnoy commented Jan 4, 2024

programmerjake commented Jan 4, 2024

programmerjake commented Jan 4, 2024 •

edited

farnoy commented Jan 4, 2024

Add compress store / expand load intrinsics & maybe register -> register variant too #241 #240

Add compress store / expand load intrinsics & maybe register -> register variant too #241 #240

Comments

jorgecarleitao commented Feb 6, 2022 • edited

programmerjake commented Feb 6, 2022

calebzulawski commented Feb 6, 2022

programmerjake commented Feb 6, 2022

workingjubilee commented Feb 6, 2022

programmerjake commented Feb 7, 2022

programmerjake commented Feb 16, 2023

farnoy commented Jan 4, 2024

programmerjake commented Jan 4, 2024

programmerjake commented Jan 4, 2024 • edited

farnoy commented Jan 4, 2024

jorgecarleitao commented Feb 6, 2022 •

edited

programmerjake commented Jan 4, 2024 •

edited