Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): support run-length encoding #507

Merged
merged 11 commits into from
Feb 23, 2022
Merged

Conversation

ludics
Copy link
Member

@ludics ludics commented Feb 20, 2022

storage: support run-length encoding #255

  • add 3 BlockType with prefix Rle
  • add rle_block_builder, rle_block_iterator
  • update corresponding ColumnBuilder and ColumnIterator

close #255

Signed-off-by: Lu Di leonludics@gmail.com

@ludics ludics requested a review from skyzh February 20, 2022 10:18
@ludics ludics changed the title storage: support running-length encoding feat(storage): support running-length encoding Feb 20, 2022
@yingjunwu
Copy link
Collaborator

It's called run-length encoding (RLE), not running-length encoding.

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this feature! But before proceeding to review, I'd like to know why we need 2 separate RLEBuilder / RLEIterator for variable / fixed-length types? If we write bounds from the Array:

pub struct RleBlockIterator<A, B>
where
    A: Array,
    B: BlockIterator<A>,

This should cover all cases, including both primitive types and Bytes types. Did you meet any difficulty when implementing in this way?

(... in #255, I only considered the case for primitive types. But as long as you have implemented RLE for all types, I think it's reasonable to de-dup code.)

@skyzh
Copy link
Member

skyzh commented Feb 20, 2022

By the way, I've force enabled is_rle and run TPC-H tests on this branch. It seems that...

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TestError { kind: QueryResultMismatch { sql: "select     sum(l_extendedprice * l_discount) as revenue from     lineitem where     l_shipdate >= date '1994-01-01'     and l_shipdate < date '1994-01-01' + interval '1' year     and l_discount between 0.08 - 0.01 and 0.08 + 0.01     and l_quantity < 24;", expected: "163821038.3323", actual: "163823134.1083" }, loc: Location { file: "tests/sql/tpch-full/_q6.slt", line: 1 } }', src/main.rs:217:41

Some queries are producing wrong results. See https://github.com/risinglightdb/risinglight/blob/main/docs/01-tpch.md#developers-add-new-tpc-h-tests on how to run TPC-H tests with RisingLight.

Don't worry about that, I think I might find the root cause when I review.

@ludics ludics changed the title feat(storage): support running-length encoding feat(storage): support run-length encoding Feb 20, 2022
@xxchan
Copy link
Member

xxchan commented Feb 20, 2022

This should cover all cases, including both primitive types and Bytes types. Did you meet any difficulty when implementing in this way?

If the answer is yes, I think we'd better choose not to implement it to keep our codebase simple. Because RLE may not be very useful for non-primitive types?

@skyzh
Copy link
Member

skyzh commented Feb 21, 2022

Because RLE may not be very useful for non-primitive types?

There's a one-char column l_returnflag in lineitem table in TPC-H test. RLE is still helpful for those using char to store enum-like types.

Signed-off-by: ludics <leonludics@gmail.com>
Signed-off-by: ludics <leonludics@gmail.com>
Signed-off-by: ludics <leonludics@gmail.com>
Signed-off-by: ludics <leonludics@gmail.com>
Signed-off-by: ludics <leonludics@gmail.com>
@ludics
Copy link
Member Author

ludics commented Feb 21, 2022

This should cover all cases, including both primitive types and Bytes types. Did you meet any difficulty when implementing in this way?

In commit f163ed, I added a new trait RLETypeEncode and implemented only one builder/iterator for both primitive types and Bytes types. However, the RLEBlockBuilder is not very efficient as the previous_value is stored as Option<Vec<u8>> and the item to be appended will also be transformed to Vec<u8> for primitive types to make a comparison. Storing previous_value as Option<&[u8]> is more efficient, but there will be some lifetime problems...

@skyzh
Copy link
Member

skyzh commented Feb 21, 2022

We may store previous value as <A::Item as ToOwned>::Owned? This is equivalent to i32, String, etc.

…ator

Signed-off-by: ludics <leonludics@gmail.com>
Signed-off-by: ludics <leonludics@gmail.com>
Signed-off-by: ludics <leonludics@gmail.com>
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM, excellent work!

src/array/mod.rs Outdated Show resolved Hide resolved
src/array/utf8_array.rs Outdated Show resolved Hide resolved
src/storage/secondary/block.rs Outdated Show resolved Hide resolved
src/storage/secondary/block/rle_block_builder.rs Outdated Show resolved Hide resolved
src/storage/secondary/block/rle_block_builder.rs Outdated Show resolved Hide resolved
src/storage/secondary/block/rle_block_iterator.rs Outdated Show resolved Hide resolved
src/storage/secondary/column/primitive_column_builder.rs Outdated Show resolved Hide resolved
@ludics ludics force-pushed the ludi/rle branch 2 times, most recently from b394b96 to f4d06e4 Compare February 23, 2022 11:50
Signed-off-by: ludics <leonludics@gmail.com>
@ludics ludics enabled auto-merge (squash) February 23, 2022 12:03
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM! You may get it merged and fix the remaining small problems later. Thanks for your contribution.

I'm still investigating why enabling RLE will produce wrong result in TPC-H. I'll create an issue later about this problem.

@ludics ludics removed the request for review from likg227 February 23, 2022 12:21
@skyzh skyzh disabled auto-merge February 23, 2022 12:27
@skyzh
Copy link
Member

skyzh commented Feb 23, 2022

Well, I've finally found the issue with skip. I'll start another issue on this.

@skyzh skyzh enabled auto-merge (squash) February 23, 2022 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: support run-length encoding
4 participants