Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(array): introduce DataChunkBuilder and split chunks in HashJoinExecutor #466

Merged
merged 13 commits into from
Feb 23, 2022

Conversation

TennyZhuang
Copy link
Collaborator

@TennyZhuang TennyZhuang commented Feb 14, 2022

The PR introduces DataChunkBuilder, which is responsible for two things:

  1. Ensure that all columns are pushed with the same size.
  2. Ensure that the size of the DataChunk should not be greater than capacity.

Signed-off-by: TennyZhuang zty0826@gmail.com

…xecutor

Signed-off-by: TennyZhuang <zty0826@gmail.com>
Signed-off-by: TennyZhuang <zty0826@gmail.com>
Signed-off-by: TennyZhuang <zty0826@gmail.com>
Signed-off-by: TennyZhuang <zty0826@gmail.com>
Copy link
Member

@wangrunji0408 wangrunji0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two suggestions:

  1. Don't consume the builder when the size reaches the limit, so that users will no longer need to rebuild it by themselves.
impl DataChunkBuilder {
    #[must_use = "should yield the DataChunk when size limit is reached"]
    pub fn push_row(&mut self, row: impl IntoIterator<Item = DataValue>) -> Option<DataChunk>;

    /// Generate a [`DataChunk`] with the remaining rows, leaving the builder empty.
    #[must_use]
    pub fn take(&mut self) -> Option<DataChunk>;
}
  1. Provide a helper macro to simplify the usage.
macro_rules! yield_if_some {
    ($maybe_chunk:expr) => {
        if let Some(chunk) = $maybe_chunk {
            yield chunk;
        }
    }
}

Then the usage should be something like:

let mut builder = DataChunkBuilder::new(...);
loop {
    // ...
    yield_if_some!(builder.push_row(values));
    // if necessary
    yield_if_some!(builder.take());
}
yield_if_some!(builder.take());

@TennyZhuang
Copy link
Collaborator Author

Don't consume the builder when the size reaches the limit, so that users will no longer need to rebuild it by themselves.

But it can force users to handle the finished DataChunk correctly by compiler error instead of just lint (

@TennyZhuang
Copy link
Collaborator Author

Provide a helper macro to simplify the usage.

Unfortunately, macros does not work well with yield.

src/array/data_chunk_builder.rs Outdated Show resolved Hide resolved
src/array/data_chunk_builder.rs Outdated Show resolved Hide resolved
@TennyZhuang
Copy link
Collaborator Author

@wangrunji0408 PTAL

Signed-off-by: TennyZhuang <zty0826@gmail.com>
Signed-off-by: TennyZhuang <zty0826@gmail.com>
Signed-off-by: TennyZhuang <zty0826@gmail.com>
Copy link
Member

@wangrunji0408 wangrunji0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. Sorry for the late response ==

src/array/data_chunk_builder.rs Outdated Show resolved Hide resolved
src/executor/hash_join.rs Show resolved Hide resolved
src/executor/hash_join.rs Outdated Show resolved Hide resolved
Signed-off-by: TennyZhuang <zty0826@gmail.com>
Signed-off-by: TennyZhuang <zty0826@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants