Read batch of values for RLE, DELTA_BINARY_PACKED and PLAIN decoding #80

sunchao · 2018-04-08T17:53:16Z

This adds a get_batch method to BitReader, which is used by
RLE, DELTA_BINARY_PACKED, and PLAIN decoding. It is also used indirectly
by DICT encoding which uses RLE encoding.

This is done similarly to PARQUET-671, where we use fast unpacking in
https://github.com/lemire/FrameOfReference to unpack bit-packed integers
(each patch has 32 values). This is then used in:

PLAIN encoding, when decoding boolean types
DELTA_BINARY_PACKED encoding, when decoding deltas for each mini-block
RLE encoding, when bit-packing is used
Level decoder, when bit-packing is used

Test results show this can improve performance a lot:

Before:

test delta_bit_pack_i32_1k_128 ... bench: 46,039 ns/iter (+/- 9,908) = 31 MB/s
test delta_bit_pack_i32_1k_32 ... bench: 49,225 ns/iter (+/- 56,843) = 29 MB/s
test delta_bit_pack_i32_1k_64 ... bench: 47,451 ns/iter (+/- 26,391) = 30 MB/s
test delta_bit_pack_i32_1m_128 ... bench: 45,701,937 ns/iter (+/- 6,000,442) = 32 MB/s
test delta_bit_pack_i32_1m_32 ... bench: 46,565,152 ns/iter (+/- 11,740,361) = 32 MB/s
test delta_bit_pack_i32_1m_64 ... bench: 44,812,203 ns/iter (+/- 4,263,573) = 33 MB/s
test dict_i32_1k_128 ... bench: 36,773 ns/iter (+/- 3,072) = 34 MB/s
test dict_i32_1k_32 ... bench: 37,843 ns/iter (+/- 14,263) = 33 MB/s
test dict_i32_1k_64 ... bench: 37,140 ns/iter (+/- 3,701) = 34 MB/s
test dict_i32_1m_128 ... bench: 37,544,476 ns/iter (+/- 2,914,721) = 34 MB/s
test dict_i32_1m_32 ... bench: 37,173,955 ns/iter (+/- 1,527,749) = 35 MB/s
test dict_i32_1m_64 ... bench: 36,951,548 ns/iter (+/- 1,694,937) = 35 MB/s
test dict_str_1m_128 ... bench: 39,279,048 ns/iter (+/- 4,481,945) = 13 MB/s

After:

test delta_bit_pack_i32_1k_128 ... bench: 6,274 ns/iter (+/- 989) = 233 MB/s
test delta_bit_pack_i32_1k_32 ... bench: 6,295 ns/iter (+/- 722) = 232 MB/s
test delta_bit_pack_i32_1k_64 ... bench: 6,340 ns/iter (+/- 325) = 230 MB/s
test delta_bit_pack_i32_1m_128 ... bench: 5,981,495 ns/iter (+/- 333,282) = 249 MB/s
test delta_bit_pack_i32_1m_32 ... bench: 6,073,897 ns/iter (+/- 334,358) = 245 MB/s
test delta_bit_pack_i32_1m_64 ... bench: 6,311,875 ns/iter (+/- 317,241) = 236 MB/s
test dict_i32_1k_128 ... bench: 3,060 ns/iter (+/- 331) = 419 MB/s
test dict_i32_1k_32 ... bench: 6,734 ns/iter (+/- 590) = 190 MB/s
test dict_i32_1k_64 ... bench: 4,321 ns/iter (+/- 195) = 297 MB/s
test dict_i32_1m_128 ... bench: 3,879,224 ns/iter (+/- 200,880) = 338 MB/s
test dict_i32_1m_32 ... bench: 9,877,528 ns/iter (+/- 405,837) = 132 MB/s
test dict_i32_1m_64 ... bench: 5,911,826 ns/iter (+/- 265,232) = 222 MB/s
test dict_str_1m_128 ... bench: 7,353,068 ns/iter (+/- 583,300) = 71 MB/s

Fix #63

This adds a `get_batch` method to `BitReader`, which is used by RLE, DELTA_BINARY_PACKED, and PLAIN decoding. It is also used indirectly by DICT encoding which uses RLE encoding. This is done similarly to PARQUET-671, where we use fast unpacking in https://github.com/lemire/FrameOfReference to unpack bit-packed integers (each patch has 32 values). This is then used in: 1. PLAIN encoding, when decoding boolean types 2. DELTA_BINARY_PACKED encoding, when decoding deltas for each mini-block 3. RLE encoding, when bit-packing is used 3. Level decoder, when bit-packing is used Test results show this can improve performance a lot: Before: test delta_bit_pack_i32_1k_128 ... bench: 46,039 ns/iter (+/- 9,908) = 31 MB/s test delta_bit_pack_i32_1k_32 ... bench: 49,225 ns/iter (+/- 56,843) = 29 MB/s test delta_bit_pack_i32_1k_64 ... bench: 47,451 ns/iter (+/- 26,391) = 30 MB/s test delta_bit_pack_i32_1m_128 ... bench: 45,701,937 ns/iter (+/- 6,000,442) = 32 MB/s test delta_bit_pack_i32_1m_32 ... bench: 46,565,152 ns/iter (+/- 11,740,361) = 32 MB/s test delta_bit_pack_i32_1m_64 ... bench: 44,812,203 ns/iter (+/- 4,263,573) = 33 MB/s test dict_i32_1k_128 ... bench: 36,773 ns/iter (+/- 3,072) = 34 MB/s test dict_i32_1k_32 ... bench: 37,843 ns/iter (+/- 14,263) = 33 MB/s test dict_i32_1k_64 ... bench: 37,140 ns/iter (+/- 3,701) = 34 MB/s test dict_i32_1m_128 ... bench: 37,544,476 ns/iter (+/- 2,914,721) = 34 MB/s test dict_i32_1m_32 ... bench: 37,173,955 ns/iter (+/- 1,527,749) = 35 MB/s test dict_i32_1m_64 ... bench: 36,951,548 ns/iter (+/- 1,694,937) = 35 MB/s test dict_str_1m_128 ... bench: 39,279,048 ns/iter (+/- 4,481,945) = 13 MB/s After: test delta_bit_pack_i32_1k_128 ... bench: 6,274 ns/iter (+/- 989) = 233 MB/s test delta_bit_pack_i32_1k_32 ... bench: 6,295 ns/iter (+/- 722) = 232 MB/s test delta_bit_pack_i32_1k_64 ... bench: 6,340 ns/iter (+/- 325) = 230 MB/s test delta_bit_pack_i32_1m_128 ... bench: 5,981,495 ns/iter (+/- 333,282) = 249 MB/s test delta_bit_pack_i32_1m_32 ... bench: 6,073,897 ns/iter (+/- 334,358) = 245 MB/s test delta_bit_pack_i32_1m_64 ... bench: 6,311,875 ns/iter (+/- 317,241) = 236 MB/s test dict_i32_1k_128 ... bench: 3,060 ns/iter (+/- 331) = 419 MB/s test dict_i32_1k_32 ... bench: 6,734 ns/iter (+/- 590) = 190 MB/s test dict_i32_1k_64 ... bench: 4,321 ns/iter (+/- 195) = 297 MB/s test dict_i32_1m_128 ... bench: 3,879,224 ns/iter (+/- 200,880) = 338 MB/s test dict_i32_1m_32 ... bench: 9,877,528 ns/iter (+/- 405,837) = 132 MB/s test dict_i32_1m_64 ... bench: 5,911,826 ns/iter (+/- 265,232) = 222 MB/s test dict_str_1m_128 ... bench: 7,353,068 ns/iter (+/- 583,300) = 71 MB/s

coveralls · 2018-04-08T18:31:35Z

Coverage increased (+1.9%) to 94.783% when pulling 1fab4e7 on bit-packing into 266766f on master.

coveralls · 2018-04-08T18:31:35Z

Coverage increased (+1.9%) to 94.772% when pulling f2e380c on bit-packing into 266766f on master.

sadikovi · 2018-04-08T20:39:49Z

Great improvement! I will have a closer look later today to learn how it is done.

sadikovi

Looks good! Just left some minor comments.

Quite complicated code, is there some sort of summary that I could read to understand this?:)

sadikovi · 2018-04-09T05:15:38Z

Cargo.toml

@@ -23,4 +23,4 @@ brotli = "1.1.2"
 flate2 = "0.2"
 rand = "0.4"
 thrift = "0.0.4"
-x86intrin = "0.4.3"
+x86intrin = "0.4.3"


Looks like there is no new line at the end of file.

Yes, I'll revert this change.

sadikovi · 2018-04-09T05:18:14Z

src/encodings/decoding.rs

+      assert!(loaded == self.values_current_mini_block);
+    } else {
+      for _ in 0..self.values_current_mini_block {
+        // TODO: load one batch at a time similar to int32


What currently stops us from doing this?

bit_packing::unpack32 only handles values up to 32 bit-width, but this is int64, so cannot be processed.

sadikovi · 2018-04-09T05:23:42Z

src/util/bit_util.rs

+
+    unsafe {
+      let in_buf = &self.buffer.data()[self.byte_offset..];
+      let mut in_ptr = in_buf as *const [u8] as *const u8 as *const u32;


Ha, interesting cast.

sadikovi · 2018-04-09T05:24:43Z

src/util/mod.rs

@@ -21,3 +21,4 @@ pub mod test_common;
 #[macro_use]
 pub mod bit_util;
 pub mod hash_util;
+pub mod bit_packing;


Should this be just mod bit_packing?

Yes, will remove the pub.

sunchao · 2018-04-09T06:52:53Z

@sadikovi Thanks for the review. The core of this patch is to apply ideas in this: https://github.com/lemire/FrameOfReference, which unpack 32 values in a tight sequence at each time , with very few instructions for each. This is implemented in bit_packing.rs. The new get_batch method then uses this to return a batch of values at a time. This is much more efficient than calling get_value multiple times since the latter costs a lot of instructions, and is wrapped with steps such as computing trailing bits, reload buffered values, etc. The rest of the patch just replaces the get_value calls with get_batch whenever possible.

sadikovi · 2018-04-09T07:31:20Z

@sunchao Thanks for the explanation and your effort for improving performance with these code changes. We should merge it!

sunchao · 2018-04-10T04:57:08Z

Merged. Thanks @sadikovi for the review!

sadikovi · 2018-04-10T04:57:36Z

No problem, thanks for improvements!

sunchao added 2 commits April 8, 2018 12:23

tiny cleanup

a9b5b31

more comment

f2e380c

sadikovi approved these changes Apr 9, 2018

View reviewed changes

address comments

c5a9fa5

sunchao merged commit 1fd277a into master Apr 10, 2018

sunchao deleted the bit-packing branch April 10, 2018 04:57

sunchao mentioned this pull request Apr 14, 2018

Releasing 0.2.0? #88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read batch of values for RLE, DELTA_BINARY_PACKED and PLAIN decoding #80

Read batch of values for RLE, DELTA_BINARY_PACKED and PLAIN decoding #80

sunchao commented Apr 8, 2018 •

edited

Loading

coveralls commented Apr 8, 2018

coveralls commented Apr 8, 2018 •

edited

Loading

sadikovi commented Apr 8, 2018

sadikovi left a comment

sadikovi Apr 9, 2018

sunchao Apr 9, 2018

sadikovi Apr 9, 2018

sunchao Apr 9, 2018

sadikovi Apr 9, 2018

sadikovi Apr 9, 2018

sunchao Apr 9, 2018

sunchao commented Apr 9, 2018

sadikovi commented Apr 9, 2018

sunchao commented Apr 10, 2018

sadikovi commented Apr 10, 2018

Read batch of values for RLE, DELTA_BINARY_PACKED and PLAIN decoding #80

Read batch of values for RLE, DELTA_BINARY_PACKED and PLAIN decoding #80

Conversation

sunchao commented Apr 8, 2018 • edited Loading

coveralls commented Apr 8, 2018

coveralls commented Apr 8, 2018 • edited Loading

sadikovi commented Apr 8, 2018

sadikovi left a comment

Choose a reason for hiding this comment

sadikovi Apr 9, 2018

Choose a reason for hiding this comment

sunchao Apr 9, 2018

Choose a reason for hiding this comment

sadikovi Apr 9, 2018

Choose a reason for hiding this comment

sunchao Apr 9, 2018

Choose a reason for hiding this comment

sadikovi Apr 9, 2018

Choose a reason for hiding this comment

sadikovi Apr 9, 2018

Choose a reason for hiding this comment

sunchao Apr 9, 2018

Choose a reason for hiding this comment

sunchao commented Apr 9, 2018

sadikovi commented Apr 9, 2018

sunchao commented Apr 10, 2018

sadikovi commented Apr 10, 2018

sunchao commented Apr 8, 2018 •

edited

Loading

coveralls commented Apr 8, 2018 •

edited

Loading