-
Notifications
You must be signed in to change notification settings - Fork 20
Read batch of values for RLE, DELTA_BINARY_PACKED and PLAIN decoding #80
Conversation
This adds a `get_batch` method to `BitReader`, which is used by RLE, DELTA_BINARY_PACKED, and PLAIN decoding. It is also used indirectly by DICT encoding which uses RLE encoding. This is done similarly to PARQUET-671, where we use fast unpacking in https://github.com/lemire/FrameOfReference to unpack bit-packed integers (each patch has 32 values). This is then used in: 1. PLAIN encoding, when decoding boolean types 2. DELTA_BINARY_PACKED encoding, when decoding deltas for each mini-block 3. RLE encoding, when bit-packing is used 3. Level decoder, when bit-packing is used Test results show this can improve performance a lot: Before: test delta_bit_pack_i32_1k_128 ... bench: 46,039 ns/iter (+/- 9,908) = 31 MB/s test delta_bit_pack_i32_1k_32 ... bench: 49,225 ns/iter (+/- 56,843) = 29 MB/s test delta_bit_pack_i32_1k_64 ... bench: 47,451 ns/iter (+/- 26,391) = 30 MB/s test delta_bit_pack_i32_1m_128 ... bench: 45,701,937 ns/iter (+/- 6,000,442) = 32 MB/s test delta_bit_pack_i32_1m_32 ... bench: 46,565,152 ns/iter (+/- 11,740,361) = 32 MB/s test delta_bit_pack_i32_1m_64 ... bench: 44,812,203 ns/iter (+/- 4,263,573) = 33 MB/s test dict_i32_1k_128 ... bench: 36,773 ns/iter (+/- 3,072) = 34 MB/s test dict_i32_1k_32 ... bench: 37,843 ns/iter (+/- 14,263) = 33 MB/s test dict_i32_1k_64 ... bench: 37,140 ns/iter (+/- 3,701) = 34 MB/s test dict_i32_1m_128 ... bench: 37,544,476 ns/iter (+/- 2,914,721) = 34 MB/s test dict_i32_1m_32 ... bench: 37,173,955 ns/iter (+/- 1,527,749) = 35 MB/s test dict_i32_1m_64 ... bench: 36,951,548 ns/iter (+/- 1,694,937) = 35 MB/s test dict_str_1m_128 ... bench: 39,279,048 ns/iter (+/- 4,481,945) = 13 MB/s After: test delta_bit_pack_i32_1k_128 ... bench: 6,274 ns/iter (+/- 989) = 233 MB/s test delta_bit_pack_i32_1k_32 ... bench: 6,295 ns/iter (+/- 722) = 232 MB/s test delta_bit_pack_i32_1k_64 ... bench: 6,340 ns/iter (+/- 325) = 230 MB/s test delta_bit_pack_i32_1m_128 ... bench: 5,981,495 ns/iter (+/- 333,282) = 249 MB/s test delta_bit_pack_i32_1m_32 ... bench: 6,073,897 ns/iter (+/- 334,358) = 245 MB/s test delta_bit_pack_i32_1m_64 ... bench: 6,311,875 ns/iter (+/- 317,241) = 236 MB/s test dict_i32_1k_128 ... bench: 3,060 ns/iter (+/- 331) = 419 MB/s test dict_i32_1k_32 ... bench: 6,734 ns/iter (+/- 590) = 190 MB/s test dict_i32_1k_64 ... bench: 4,321 ns/iter (+/- 195) = 297 MB/s test dict_i32_1m_128 ... bench: 3,879,224 ns/iter (+/- 200,880) = 338 MB/s test dict_i32_1m_32 ... bench: 9,877,528 ns/iter (+/- 405,837) = 132 MB/s test dict_i32_1m_64 ... bench: 5,911,826 ns/iter (+/- 265,232) = 222 MB/s test dict_str_1m_128 ... bench: 7,353,068 ns/iter (+/- 583,300) = 71 MB/s
Great improvement! I will have a closer look later today to learn how it is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just left some minor comments.
Quite complicated code, is there some sort of summary that I could read to understand this?:)
Cargo.toml
Outdated
@@ -23,4 +23,4 @@ brotli = "1.1.2" | |||
flate2 = "0.2" | |||
rand = "0.4" | |||
thrift = "0.0.4" | |||
x86intrin = "0.4.3" | |||
x86intrin = "0.4.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there is no new line at the end of file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll revert this change.
assert!(loaded == self.values_current_mini_block); | ||
} else { | ||
for _ in 0..self.values_current_mini_block { | ||
// TODO: load one batch at a time similar to int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What currently stops us from doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit_packing::unpack32
only handles values up to 32 bit-width, but this is int64, so cannot be processed.
|
||
unsafe { | ||
let in_buf = &self.buffer.data()[self.byte_offset..]; | ||
let mut in_ptr = in_buf as *const [u8] as *const u8 as *const u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, interesting cast.
src/util/mod.rs
Outdated
@@ -21,3 +21,4 @@ pub mod test_common; | |||
#[macro_use] | |||
pub mod bit_util; | |||
pub mod hash_util; | |||
pub mod bit_packing; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be just mod bit_packing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will remove the pub
.
@sadikovi Thanks for the review. The core of this patch is to apply ideas in this: https://github.com/lemire/FrameOfReference, which unpack 32 values in a tight sequence at each time , with very few instructions for each. This is implemented in |
@sunchao Thanks for the explanation and your effort for improving performance with these code changes. We should merge it! |
Merged. Thanks @sadikovi for the review! |
No problem, thanks for improvements! |
This adds a
get_batch
method toBitReader
, which is used byRLE, DELTA_BINARY_PACKED, and PLAIN decoding. It is also used indirectly
by DICT encoding which uses RLE encoding.
This is done similarly to PARQUET-671, where we use fast unpacking in
https://github.com/lemire/FrameOfReference to unpack bit-packed integers
(each patch has 32 values). This is then used in:
Test results show this can improve performance a lot:
Before:
test delta_bit_pack_i32_1k_128 ... bench: 46,039 ns/iter (+/- 9,908) = 31 MB/s
test delta_bit_pack_i32_1k_32 ... bench: 49,225 ns/iter (+/- 56,843) = 29 MB/s
test delta_bit_pack_i32_1k_64 ... bench: 47,451 ns/iter (+/- 26,391) = 30 MB/s
test delta_bit_pack_i32_1m_128 ... bench: 45,701,937 ns/iter (+/- 6,000,442) = 32 MB/s
test delta_bit_pack_i32_1m_32 ... bench: 46,565,152 ns/iter (+/- 11,740,361) = 32 MB/s
test delta_bit_pack_i32_1m_64 ... bench: 44,812,203 ns/iter (+/- 4,263,573) = 33 MB/s
test dict_i32_1k_128 ... bench: 36,773 ns/iter (+/- 3,072) = 34 MB/s
test dict_i32_1k_32 ... bench: 37,843 ns/iter (+/- 14,263) = 33 MB/s
test dict_i32_1k_64 ... bench: 37,140 ns/iter (+/- 3,701) = 34 MB/s
test dict_i32_1m_128 ... bench: 37,544,476 ns/iter (+/- 2,914,721) = 34 MB/s
test dict_i32_1m_32 ... bench: 37,173,955 ns/iter (+/- 1,527,749) = 35 MB/s
test dict_i32_1m_64 ... bench: 36,951,548 ns/iter (+/- 1,694,937) = 35 MB/s
test dict_str_1m_128 ... bench: 39,279,048 ns/iter (+/- 4,481,945) = 13 MB/s
After:
test delta_bit_pack_i32_1k_128 ... bench: 6,274 ns/iter (+/- 989) = 233 MB/s
test delta_bit_pack_i32_1k_32 ... bench: 6,295 ns/iter (+/- 722) = 232 MB/s
test delta_bit_pack_i32_1k_64 ... bench: 6,340 ns/iter (+/- 325) = 230 MB/s
test delta_bit_pack_i32_1m_128 ... bench: 5,981,495 ns/iter (+/- 333,282) = 249 MB/s
test delta_bit_pack_i32_1m_32 ... bench: 6,073,897 ns/iter (+/- 334,358) = 245 MB/s
test delta_bit_pack_i32_1m_64 ... bench: 6,311,875 ns/iter (+/- 317,241) = 236 MB/s
test dict_i32_1k_128 ... bench: 3,060 ns/iter (+/- 331) = 419 MB/s
test dict_i32_1k_32 ... bench: 6,734 ns/iter (+/- 590) = 190 MB/s
test dict_i32_1k_64 ... bench: 4,321 ns/iter (+/- 195) = 297 MB/s
test dict_i32_1m_128 ... bench: 3,879,224 ns/iter (+/- 200,880) = 338 MB/s
test dict_i32_1m_32 ... bench: 9,877,528 ns/iter (+/- 405,837) = 132 MB/s
test dict_i32_1m_64 ... bench: 5,911,826 ns/iter (+/- 265,232) = 222 MB/s
test dict_str_1m_128 ... bench: 7,353,068 ns/iter (+/- 583,300) = 71 MB/s
Fix #63