Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add binder and executor for aggregation #69

Merged
merged 8 commits into from Nov 8, 2021
Merged

Conversation

pleiadesian
Copy link
Contributor

@pleiadesian pleiadesian commented Oct 28, 2021

TODO in aggregation

  • Avg
  • Count (only rowcount is supported yet)
  • wildcard (e.g., count(*))
  • distinct keyword
  • aggregation with multiple arguments (e.g., json_object_agg ( key "any", value "any" ) -> json)

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work!

src/binder/expression/agg.rs Outdated Show resolved Hide resolved
src/executor/aggregation/sum.rs Outdated Show resolved Hide resolved
src/executor/hash_agg.rs Outdated Show resolved Hide resolved
src/executor/hash_agg.rs Outdated Show resolved Hide resolved
src/executor/hash_agg.rs Outdated Show resolved Hide resolved
src/executor/hash_agg.rs Outdated Show resolved Hide resolved
src/types/mod.rs Show resolved Hide resolved
src/executor/hash_agg.rs Outdated Show resolved Hide resolved
@wangrunji0408
Copy link
Member

It seems that aggregation.slt is blocked by float number casting...
You may add simple test cases and leave these big tests for future.

Some tips:

  • You can add # before subtest command since it's not supported yet.
  • You can add halt command to stop the test early.

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! From my perspective, this PR is too large for a single review. I think it can be split at least into two parts:

  • add binder and executor for simple aggregation, and sum state
  • add group by support and HashAgg
  • (if you want to split further, ) add max/min and rowcount state

As this PR generally looks good to me and needs few modification before merging, I think it's okay to submit one big patch this time. For the next time, I think a PR of ~300 LoCs is of a reasonable size. Just add new functionalities little by little.

src/executor/aggregation/min_max.rs Outdated Show resolved Hide resolved
src/executor/aggregation/min_max.rs Outdated Show resolved Hide resolved
match (array, &self.input_datatype, self.is_min) {
(ArrayImpl::Int32(arr), DataTypeKind::Int, true) => {
let mut temp: Option<i32> = None;
temp = arr.iter().fold(temp, min_i32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not set the fold initial value to self.result? And we don't need the following match to do extra works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.result is DataValue. So, to get the result, we have
self.result = array.iter().fold(self.result.clone(), min_i32);
But we can not call iter() on ArrayImpl. And we have to refactor the min_i32 to calculate on ArrayImpl instead of i32.

src/executor/aggregation/min_max.rs Outdated Show resolved Hide resolved
src/executor/aggregation/rowcount.rs Show resolved Hide resolved
array.filter(visibility.iter().copied().collect::<Vec<_>>().into_iter())
}
};
match (array, &self.input_datatype, self.is_min) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, is_min should be some kind of const generic parameter, so that we can reduce runtime overhead. We may save the optimization later.

src/executor/hash_agg.rs Outdated Show resolved Hide resolved
for col in group_cols.iter() {
group_key.push(col.get(row_idx));
}
let vis_map = key_to_vis_maps.entry(group_key.clone()).or_insert_with(|| {
Copy link
Member

@skyzh skyzh Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So even if we do not need to insert a new group_key into the key_to_vis_maps, we still need to clone the group_key. I don't think this is necesssary.

Meanwhile, I suggest using Arc<str> instead of String everywhere when we need to handle user inputs, so as to reduce clone overhead. I'll draft a RFC about this later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest Cow<str>

src/executor/simple_agg.rs Outdated Show resolved Hide resolved
src/executor/simple_agg.rs Outdated Show resolved Hide resolved
Copy link
Member

@wangrunji0408 wangrunji0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. 👍

src/executor/hash_agg.rs Outdated Show resolved Hide resolved
src/array/mod.rs Outdated Show resolved Hide resolved
src/binder/expression/agg_call.rs Outdated Show resolved Hide resolved
Comment on lines +78 to +90
impl PartialEq for DataValue {
fn eq(&self, other: &Self) -> bool {
match (self, other) {
(Self::Null, Self::Null) => true,
(Self::Bool(left), Self::Bool(right)) => left == right,
(Self::Int32(left), Self::Int32(right)) => left == right,
(Self::Int64(left), Self::Int64(right)) => left == right,
(Self::String(left), Self::String(right)) => left == right,
(Self::Float64(left), Self::Float64(right)) => left == right,
_ => false,
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation seems to be equivalent to #[derive(PartialEq)]?

Copy link
Member

@skyzh skyzh Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartialEq is added because of clippy's warning. Firstly, we need a custom Hash function for DataValue, which doesn't take the concrete type into account, and only feed the actual value into the hasher. On that, we need to also implement PartialEq by ourselves, instead of derive(PartialEq). However, I don't think this is a good thing to do in our system. Maybe we can take another look and change the implementation.

src/executor/aggregation/min_max.rs Outdated Show resolved Hide resolved
src/executor/aggregation/rowcount.rs Outdated Show resolved Hide resolved
src/executor/aggregation/min_max.rs Outdated Show resolved Hide resolved
src/executor/simple_agg.rs Outdated Show resolved Hide resolved
for col in group_cols.iter() {
group_key.push(col.get(row_idx));
}
let vis_map = key_to_vis_maps.entry(group_key.clone()).or_insert_with(|| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest Cow<str>

src/executor/hash_agg.rs Outdated Show resolved Hide resolved
(ArrayImpl::Int32(arr), DataTypeKind::Int) => {
let temp = arr
.iter()
.fold(None, if self.is_min { min_i32 } else { max_i32 });
Copy link
Member

@skyzh skyzh Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving such branch if self.is_min into a single iteration of loop might be very inefficient. We can make it into a const generics parameter later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, just find out that the if is used to decide which function to use, instead of inner function. This looks reasonable to me.

src/executor/simple_agg.rs Outdated Show resolved Hide resolved
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM. I don't want this PR to be held for too many days, so we can get it merged for now, and resolve the minor issues in following PRs.

SelectItem::Wildcard => {
// TODO: support wildcard in aggregation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is not implemented, just panic.

(ArrayImpl::Int32(arr), DataTypeKind::Int) => {
let temp = arr
.iter()
.fold(None, if self.is_min { min_i32 } else { max_i32 });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, just find out that the if is used to decide which function to use, instead of inner function. This looks reasonable to me.


/// `AggregationState` records the state of an aggregation
pub trait AggregationState: 'static + Send + Sync {
fn update(&mut self, array: &ArrayImpl) -> Result<(), ExecutorError>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also document the trait functions.

pub trait AggregationState: 'static + Send + Sync {
fn update(&mut self, array: &ArrayImpl) -> Result<(), ExecutorError>;

fn update_single(&mut self, value: &DataValue) -> Result<(), ExecutorError>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to implement update_single?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it is used in HashAgg. I still prefer feeding a batch into the aggregator, and use a visibility bitmap or Iterator<Item = bool> to indicate valid entries. Let's do this later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#69 (comment)
As mentioned earlier, constructing a visibility bitmap for every unique group key incurs high time and space complexity. Instead, if we use the row-by-row update in the current implementation, we can avoid the cost from bitmap construction.

// Update states
let num_rows = chunk.cardinality();
for row_idx in 0..num_rows {
let mut group_key = HashKey::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Group key doesn't need to be constructed before checking its existence in state_entries. We can optimize this later.

builder.push(result);
builder.finish()
}
None => ArrayBuilderImpl::new(&DataType::new(DataTypeKind::Int, true)).finish(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will the data_type be null in SimpleAgg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When executing aggregation on an empty table, the result is None. Then the executor should return an empty array in the DataChunk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the type of the array should be determined in advance. e.g. SELECT sum(x) + 1.0 FROM table when x is f64. In this case, the SimpleAgg will return a I32Array, and the + 1.0 part might fail with mismatched expression type.

Ok(())
}

fn finish_agg(states: SmallVec<[Box<dyn AggregationState>; 16]>) -> DataChunk {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Box<dyn AggregationState> can be defined as a separate type. pub type BoxedAggregationState = Box<dyn AggregationState>;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants