Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Read batch of values for RLE, DELTA_BINARY_PACKED and PLAIN decoding #80

Merged
merged 4 commits into from
Apr 10, 2018

Conversation

sunchao
Copy link
Owner

@sunchao sunchao commented Apr 8, 2018

This adds a get_batch method to BitReader, which is used by
RLE, DELTA_BINARY_PACKED, and PLAIN decoding. It is also used indirectly
by DICT encoding which uses RLE encoding.

This is done similarly to PARQUET-671, where we use fast unpacking in
https://github.com/lemire/FrameOfReference to unpack bit-packed integers
(each patch has 32 values). This is then used in:

  1. PLAIN encoding, when decoding boolean types
  2. DELTA_BINARY_PACKED encoding, when decoding deltas for each mini-block
  3. RLE encoding, when bit-packing is used
  4. Level decoder, when bit-packing is used

Test results show this can improve performance a lot:

Before:

test delta_bit_pack_i32_1k_128 ... bench: 46,039 ns/iter (+/- 9,908) = 31 MB/s
test delta_bit_pack_i32_1k_32 ... bench: 49,225 ns/iter (+/- 56,843) = 29 MB/s
test delta_bit_pack_i32_1k_64 ... bench: 47,451 ns/iter (+/- 26,391) = 30 MB/s
test delta_bit_pack_i32_1m_128 ... bench: 45,701,937 ns/iter (+/- 6,000,442) = 32 MB/s
test delta_bit_pack_i32_1m_32 ... bench: 46,565,152 ns/iter (+/- 11,740,361) = 32 MB/s
test delta_bit_pack_i32_1m_64 ... bench: 44,812,203 ns/iter (+/- 4,263,573) = 33 MB/s
test dict_i32_1k_128 ... bench: 36,773 ns/iter (+/- 3,072) = 34 MB/s
test dict_i32_1k_32 ... bench: 37,843 ns/iter (+/- 14,263) = 33 MB/s
test dict_i32_1k_64 ... bench: 37,140 ns/iter (+/- 3,701) = 34 MB/s
test dict_i32_1m_128 ... bench: 37,544,476 ns/iter (+/- 2,914,721) = 34 MB/s
test dict_i32_1m_32 ... bench: 37,173,955 ns/iter (+/- 1,527,749) = 35 MB/s
test dict_i32_1m_64 ... bench: 36,951,548 ns/iter (+/- 1,694,937) = 35 MB/s
test dict_str_1m_128 ... bench: 39,279,048 ns/iter (+/- 4,481,945) = 13 MB/s

After:

test delta_bit_pack_i32_1k_128 ... bench: 6,274 ns/iter (+/- 989) = 233 MB/s
test delta_bit_pack_i32_1k_32 ... bench: 6,295 ns/iter (+/- 722) = 232 MB/s
test delta_bit_pack_i32_1k_64 ... bench: 6,340 ns/iter (+/- 325) = 230 MB/s
test delta_bit_pack_i32_1m_128 ... bench: 5,981,495 ns/iter (+/- 333,282) = 249 MB/s
test delta_bit_pack_i32_1m_32 ... bench: 6,073,897 ns/iter (+/- 334,358) = 245 MB/s
test delta_bit_pack_i32_1m_64 ... bench: 6,311,875 ns/iter (+/- 317,241) = 236 MB/s
test dict_i32_1k_128 ... bench: 3,060 ns/iter (+/- 331) = 419 MB/s
test dict_i32_1k_32 ... bench: 6,734 ns/iter (+/- 590) = 190 MB/s
test dict_i32_1k_64 ... bench: 4,321 ns/iter (+/- 195) = 297 MB/s
test dict_i32_1m_128 ... bench: 3,879,224 ns/iter (+/- 200,880) = 338 MB/s
test dict_i32_1m_32 ... bench: 9,877,528 ns/iter (+/- 405,837) = 132 MB/s
test dict_i32_1m_64 ... bench: 5,911,826 ns/iter (+/- 265,232) = 222 MB/s
test dict_str_1m_128 ... bench: 7,353,068 ns/iter (+/- 583,300) = 71 MB/s

Fix #63

This adds a `get_batch` method to `BitReader`, which is used by
RLE, DELTA_BINARY_PACKED, and PLAIN decoding. It is also used indirectly
by DICT encoding which uses RLE encoding.

This is done similarly to PARQUET-671, where we use fast unpacking in
https://github.com/lemire/FrameOfReference to unpack bit-packed integers
(each patch has 32 values). This is then used in:

1. PLAIN encoding, when decoding boolean types
2. DELTA_BINARY_PACKED encoding, when decoding deltas for each mini-block
3. RLE encoding, when bit-packing is used
3. Level decoder, when bit-packing is used

Test results show this can improve performance a lot:

Before:

test delta_bit_pack_i32_1k_128 ... bench:      46,039 ns/iter (+/- 9,908) = 31 MB/s
test delta_bit_pack_i32_1k_32  ... bench:      49,225 ns/iter (+/- 56,843) = 29 MB/s
test delta_bit_pack_i32_1k_64  ... bench:      47,451 ns/iter (+/- 26,391) = 30 MB/s
test delta_bit_pack_i32_1m_128 ... bench:  45,701,937 ns/iter (+/- 6,000,442) = 32 MB/s
test delta_bit_pack_i32_1m_32  ... bench:  46,565,152 ns/iter (+/- 11,740,361) = 32 MB/s
test delta_bit_pack_i32_1m_64  ... bench:  44,812,203 ns/iter (+/- 4,263,573) = 33 MB/s
test dict_i32_1k_128           ... bench:      36,773 ns/iter (+/- 3,072) = 34 MB/s
test dict_i32_1k_32            ... bench:      37,843 ns/iter (+/- 14,263) = 33 MB/s
test dict_i32_1k_64            ... bench:      37,140 ns/iter (+/- 3,701) = 34 MB/s
test dict_i32_1m_128           ... bench:  37,544,476 ns/iter (+/- 2,914,721) = 34 MB/s
test dict_i32_1m_32            ... bench:  37,173,955 ns/iter (+/- 1,527,749) = 35 MB/s
test dict_i32_1m_64            ... bench:  36,951,548 ns/iter (+/- 1,694,937) = 35 MB/s
test dict_str_1m_128           ... bench:  39,279,048 ns/iter (+/- 4,481,945) = 13 MB/s

After:

test delta_bit_pack_i32_1k_128 ... bench:       6,274 ns/iter (+/- 989) = 233 MB/s
test delta_bit_pack_i32_1k_32  ... bench:       6,295 ns/iter (+/- 722) = 232 MB/s
test delta_bit_pack_i32_1k_64  ... bench:       6,340 ns/iter (+/- 325) = 230 MB/s
test delta_bit_pack_i32_1m_128 ... bench:   5,981,495 ns/iter (+/- 333,282) = 249 MB/s
test delta_bit_pack_i32_1m_32  ... bench:   6,073,897 ns/iter (+/- 334,358) = 245 MB/s
test delta_bit_pack_i32_1m_64  ... bench:   6,311,875 ns/iter (+/- 317,241) = 236 MB/s
test dict_i32_1k_128           ... bench:       3,060 ns/iter (+/- 331) = 419 MB/s
test dict_i32_1k_32            ... bench:       6,734 ns/iter (+/- 590) = 190 MB/s
test dict_i32_1k_64            ... bench:       4,321 ns/iter (+/- 195) = 297 MB/s
test dict_i32_1m_128           ... bench:   3,879,224 ns/iter (+/- 200,880) = 338 MB/s
test dict_i32_1m_32            ... bench:   9,877,528 ns/iter (+/- 405,837) = 132 MB/s
test dict_i32_1m_64            ... bench:   5,911,826 ns/iter (+/- 265,232) = 222 MB/s
test dict_str_1m_128           ... bench:   7,353,068 ns/iter (+/- 583,300) = 71 MB/s
@coveralls
Copy link

Coverage Status

Coverage increased (+1.9%) to 94.783% when pulling 1fab4e7 on bit-packing into 266766f on master.

@coveralls
Copy link

coveralls commented Apr 8, 2018

Coverage Status

Coverage increased (+1.9%) to 94.772% when pulling f2e380c on bit-packing into 266766f on master.

@sadikovi
Copy link
Collaborator

sadikovi commented Apr 8, 2018

Great improvement! I will have a closer look later today to learn how it is done.

Copy link
Collaborator

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just left some minor comments.

Quite complicated code, is there some sort of summary that I could read to understand this?:)

Cargo.toml Outdated
@@ -23,4 +23,4 @@ brotli = "1.1.2"
flate2 = "0.2"
rand = "0.4"
thrift = "0.0.4"
x86intrin = "0.4.3"
x86intrin = "0.4.3"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is no new line at the end of file.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll revert this change.

assert!(loaded == self.values_current_mini_block);
} else {
for _ in 0..self.values_current_mini_block {
// TODO: load one batch at a time similar to int32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What currently stops us from doing this?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit_packing::unpack32 only handles values up to 32 bit-width, but this is int64, so cannot be processed.


unsafe {
let in_buf = &self.buffer.data()[self.byte_offset..];
let mut in_ptr = in_buf as *const [u8] as *const u8 as *const u32;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, interesting cast.

src/util/mod.rs Outdated
@@ -21,3 +21,4 @@ pub mod test_common;
#[macro_use]
pub mod bit_util;
pub mod hash_util;
pub mod bit_packing;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be just mod bit_packing?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will remove the pub.

@sunchao
Copy link
Owner Author

sunchao commented Apr 9, 2018

@sadikovi Thanks for the review. The core of this patch is to apply ideas in this: https://github.com/lemire/FrameOfReference, which unpack 32 values in a tight sequence at each time , with very few instructions for each. This is implemented in bit_packing.rs. The new get_batch method then uses this to return a batch of values at a time. This is much more efficient than calling get_value multiple times since the latter costs a lot of instructions, and is wrapped with steps such as computing trailing bits, reload buffered values, etc. The rest of the patch just replaces the get_value calls with get_batch whenever possible.

@sadikovi
Copy link
Collaborator

sadikovi commented Apr 9, 2018

@sunchao Thanks for the explanation and your effort for improving performance with these code changes. We should merge it!

@sunchao sunchao merged commit 1fd277a into master Apr 10, 2018
@sunchao
Copy link
Owner Author

sunchao commented Apr 10, 2018

Merged. Thanks @sadikovi for the review!

@sunchao sunchao deleted the bit-packing branch April 10, 2018 04:57
@sadikovi
Copy link
Collaborator

No problem, thanks for improvements!

@sunchao sunchao mentioned this pull request Apr 14, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants