-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Proposal
Problem statement
There's currently no way to take advantage of compiler optimizations or hardware instructions if a naive implementation is written to extract multiple bits from an integer or deposit those bits into an integer according to a mask.
These operations (known as parallel bit extract/deposit, gather/scatter, or compress/expand) are useful building blocks for algorithms and data structures, and they are backed by hardware instructions on some platforms.
I will refer to these functions as some variant of "bit expand/compress" since I find those names more descriptive.
Motivating examples or use cases
Bit expand and compress would be good candidates for std inclusion because they are non-trivial to implement efficiently and may require the use of special intrinsics from the compiler to optimize portably according to the target platform, like the other bit operations of trailing_zeros and count_ones.
Expand and compress have use cases for working with variable length encodings (varints, Unicode), bit interleaving/deinterleaving (Morton codes/space-filling curves), perfect hashing/lookup tables, parsing tasks, and succinct data structures. They're also useful for expressing operations that would ordinarily take several masks and shifts, and simplifying that to one function call with a specified mask.
I'm not aware of any pattern that will get optimized into the desired instruction on x86 so the intrinsics must be used.
LLVM doesn't currently have a generic intrinsic but there has been discussion on the Discourse and there's a C++ proposal for C++26 that would include the functionality so it may be implemented in the future.
Solution sketch
A naive implementation is shown for each function. Nonzero and signed integers would be omitted.
/// Returns `self` with the bit locations specified by `mask` packed
/// contiguously into the least significant bits of the result.
/// ```
/// let n = 0b1011_1100;
///
/// assert_eq!(compress_bits(n, 0b0010_0100), 0b0000_0011);
/// assert_eq!(compress_bits(n, 0xF0), 0b0000_1011);
/// ```
pub const fn compress_bits(self_: u32, mask: u32) -> u32 {
let mut result = 0;
let mut i = 0;
let mut j = 0;
while i < u32::BITS {
let mask_bit = (mask >> i) & 1;
result |= (mask_bit & (self_ >> i)) << j;
j += mask_bit;
i += 1;
}
result
}
/// Returns a result with the least significant bits of `self` distributed to
/// the bit locations specified by `mask`.
/// ```
/// let n = 0b1010_1101;
///
/// assert_eq!(expand_bits(n, 0b0101_0101), 0b0101_0001);
/// assert_eq!(expand_bits(n, 0xF0), 0b1101_0000);
/// ```
pub const fn expand_bits(self_: u32, mask: u32) -> u32 {
let mut result = 0;
let mut i = 0;
let mut j = 0;
while i < u32::BITS {
let mask_bit = (mask >> i) & 1;
result |= (mask_bit & (self_ >> j)) << i;
j += mask_bit;
i += 1;
}
result
}A more optimal way to implement this is with an XOR parallel suffix as discussed in Hacker's Delight, or with native instructions.
The function can optimize to only a few instructions when the mask is known, as shown here.
Alternatives
- Do nothing and rely on third party crates implementing this in software or wrappers around the platform intrinsics.
- Wait for LLVM to implement the intrinsic.
As for naming, I think suitable Rust names would be either of the first two.
"Extract" is in my opinion too overloaded with meaning to be a good name for what it does.
compress_bits,expand_bitsgather_bits,scatter_bitsextract_bits,deposit_bits
Links and related work
PEXT, PDEP - Intel BMI2 (introduced with Haswell BMI2)
BEXT, BDEP - Arm SVE2
APL's Replicate and Expand
https://orlp.net/blog/extracting-depositing-bits/
https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#Parallel_bit_deposit_and_extract
https://eisenwave.github.io/cpp-proposals/bit-permutations.html#interleaving-bits - C++26 proposal examples
https://discourse.llvm.org/t/how-would-i-go-about-implementing-new-bit-manipulation-builtins-for-my-proposal-especially-generic-builtins/76523
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
- We think this problem seems worth solving, and the standard library might be the right place to solve it.
- We think that this probably doesn't belong in the standard library.
Second, if there's a concrete solution:
- We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
- We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.