-
Notifications
You must be signed in to change notification settings - Fork 19
Add statistics for column chunk metadata and data page #94
Conversation
Cheers! Yes, I was going to review #89 after your changes - will do asap. |
src/file/metadata.rs
Outdated
/// Optional column statistics that are used per row group and per page. | ||
/// | ||
/// Use `order` field to determine comparison order for a type. | ||
/// When statistics contain deprecated min/max fields, then comparison is always signed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove then
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do.
src/file/metadata.rs
Outdated
/// When statistics contain deprecated min/max fields, then comparison is always signed. | ||
#[derive(Clone, Debug, PartialEq)] | ||
pub enum Statistics { | ||
Boolean { order: ColumnOrder, min: Option<bool>, max: Option<bool>, nulls: u64 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if it would be better to have a Statistics
trait with methods such as null_count
, has_nulls
, etc., and a separate struct TypedStatistics<T>
that implements the former. The latter can implement more methods in future such as set_min_max
, etc., that are specific to the generic type T
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I am a bit confused. I chose not using T
, so it is easy to extract relevant statistics a type.
My main concern with typed statistics is that we would need a wrapper similar to column reader. That is why I chose to remove this part and directly implement as enum of different statistics for physical type.
Could you elaborate a little bit more your plan on adding typed statistics? Should we consider statistics as read-only? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was wondering if this could be similar to the RowGroupStatistics
in the C++ version: https://github.com/apache/parquet-cpp/blob/master/src/parquet/statistics.h#L83, which is readonly, and then TypedStatistics for different types.
I'm not very familiar with how statistics will be used in the whole codebase though. I think the current approach is fine as well - we can come back and revisit this if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting... let me think about this. I do want to expose min/max methods.
src/file/metadata.rs
Outdated
impl Statistics { | ||
/// Returns number of nulls for the column. | ||
/// Note that even though statistics are for leaf columns, null count also takes into | ||
// account null complex types, e.g. lists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a /
src/file/metadata.rs
Outdated
/// This is used when min_value/max_value are set for statistics. | ||
/// | ||
/// Copied from `parquet.thrift`, see the file for more details on order of values. | ||
fn column_order(physical_type: Type, logical_type: LogicalType) -> ColumnOrder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this can be a method for ColumnDescriptor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might have been the wrong name for ColumnOrder
. This is not a sort order, this is an order which should be used when comparing current value with min
and max
from statistics. I might nee to rename it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will rewrite it, so it will be a method on TypeDefinedOrder
struct as part of ColumnOrder
enum. Something like this:
// Sort order for statistics, to define `min <= value <= max`.
enum SortOrder {
SIGNED,
UNSIGNED,
UNKNOWN
}
// Column order for each leaf column.
enum ColumnOrder {
TypeDefinedOrder(SortOrder),
Undefined
}
impl ColumnOrder {
// Creates new column order for a leaf column.
pub fn new(field: &Type) -> Self {
...
}
// Returns sort order for a type.
fn sort_order(field: &Type) -> SortOrder {
...
}
// Returns default sort order based on physical type.
fn default_sort_order(field: &Type) -> SortOrder {
...
}
}
I just need to figure out how to handle legacy min
/max
values as opposed to min_value
/max_value
. I will follow the reference: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Maybe also take a look at how C++ handles this: https://github.com/apache/parquet-cpp/blob/master/src/parquet/metadata.cc#L53.
I'll take a look too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I had a look, and this makes clear when to use type defined order, because I was a bit confused with parquet-mr code. Will update accordingly.
src/file/metadata.rs
Outdated
|
||
/// Column order to compare values with statistics min/max. | ||
#[derive(Clone, Debug, PartialEq)] | ||
pub enum ColumnOrder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: change this to SortOrder
to be consistent with parquet-mr/cpp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will rename it.
src/file/metadata.rs
Outdated
/// Method to convert from Thrift definition. | ||
/// Note that column type should match statistics, otherwise, there is a risk of | ||
/// invalid conversion. | ||
pub fn from_thrift( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also need to process column_orders
from FileMetaData
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my comment above, FileMetaData
will have a method that returns column orders for leaf columns, and sort orders could be extracted from it.
fn column_orders(&self) -> &[ColumnOrder] {
...
}
fn column_order(&self, i: usize) -> &ColumnOrder {
...
}
@sunchao Would you mind taking a look at my replies? I would like to clarify the changes first. Thanks! |
@sadikovi : sure. I've left some comments. |
I will make changes ASAP and let you know. Thanks! |
@sunchao I added typed statistics and refactored Several issues that I found/introduced:
Could you advise on the changes/further steps? Thanks. |
@sadikovi Thanks for the update. Will take a look soon. |
I am planning to go back to a previous implementation with enum, sorry. I just can’t figure out how to work with traits and go back to typed impls. I mean no offence nor disrespect to anyone, but I feel like enum could be better when reading values from statistics. Plus, I am updating some other bits, did not do much today, promise to finish it ASAP. |
OK no worries. Sorry for the wrong suggestion 😳 . I think the enum approach is fine too. |
@sunchao it is all good, your suggestion was on point. I just got a bit carried away. I will spend some time coming up with a better solution. |
@sunchao I updated the code with a new iteration of I also designed Statistics updates, but I have not implemented them here, instead I put the code (not fully complete, but compiles) in gist: https://gist.github.com/sadikovi/a2a5d79f4e4368ea50a5b6b5a0e58cce The idea is treating |
Thanks @sadikovi . Will take a look soon. |
I think I need to update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sadikovi . This looks good. I left some comments.
src/basic.rs
Outdated
/// min/max. | ||
/// | ||
/// See reference in | ||
/// https://github.com/apache/parquet-cpp/blob/master/src/parquet/types.h#L120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't reference the line number as it may get changed soon.
|
||
/// Statistics for a column chunk and data page. | ||
#[derive(Debug, PartialEq)] | ||
pub enum Statistics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to the question of implementing this using trait. I think you use something like the following:
pub trait Statistics {
/// Returns `true` if statistics have old `min` and `max` fields set.
/// This means that the column order is likely to be undefined, which, for old files
/// could mean a signed sort order of values.
///
/// Refer to [`ColumnOrder`](`::basic::ColumnOrder`) and
/// [`SortOrder`](`::basic::SortOrder`) for more information.
fn is_min_max_deprecated(&self) -> bool;
/// Returns number of null values for the column.
/// Note that this includes all nulls when column is part of the complex type.
fn null_count(&self) -> u64;
/// Returns `true` if statistics collected any null values, `false` otherwise.
fn has_nulls(&self) -> bool {
self.null_count() == 0
}
/// Returns `true` if min value and max value are set.
fn has_min_max_set(&self) -> bool;
}
and then have TypedStatistics
to implement this:
impl<T: DataType> Statistics for TypedStatistics<T> {
/// Whether or not min and max values are set.
fn has_min_max_set(&self) -> bool {
self.min.is_some() && self.max.is_some()
}
/// Returns null count.
fn null_count(&self) -> u64 {
self.null_count
}
/// Returns `true` if statistics were created using old min/max fields.
fn is_min_max_deprecated(&self) -> bool {
self.is_min_max_deprecated
}
}
The from_thrift
function now needs to return Box<Statistics>
:
pub fn from_thrift(
physical_type: Type,
thrift_stats: Option<TStatistics>
) -> Option<Box<Statistics>> {
...
let res: Box<Statistics> = match physical_type {
Type::BOOLEAN => {
Box::new(TypedStatistics::<BoolType>::new(
min.map(|data| data[0] != 0),
max.map(|data| data[0] != 0),
null_count,
old_format
))
},
...
There are some other issues to solve with this approach though. For instance, handling assert_eq
for boxed traits.
Not sure if you have explored this approach - just my 2 cents. As I mentioned before, the current enum approach should be fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tried this approach (https://github.com/sadikovi/parquet-rs/blob/011532c2450b2c7355488066d65deef8dd50e207/src/file/statistics.rs), and it did compile and work, but then I failed to find a way of extracting min/max values from typed statistics, once it was Box<Statistics>
. Assertion in tests was also difficult to add. Those were the main reasons of switching to enum, so I could cast it properly (similar to column reader).
IMHO, enum also works, and I posted a code in gist that shows how to implement statistics collector for writes, which we will need, so it is still possible to do with enum.
I do not want anyone, including myself, to end up rewriting this implementation if we find it not suitable long term - so it might be better to discuss this further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. I didn't know the difficulty of extracting min/max values. Let's go with the enum approach.
} | ||
|
||
/// Typed implementation for [`Statistics`]. | ||
pub struct TypedStatistics<T: DataType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a TODO for the distinct count value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added distinct_count
field to statistics.
self.max.as_ref().unwrap() | ||
} | ||
|
||
/// Whether or not min and max values are set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we mention this will panic if min/max is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I update the comments.
column.physical_type() | ||
); | ||
res.push(ColumnOrder::TYPE_DEFINED_ORDER(sort_order)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: panic if it is some other TColumnOrder
? although not possible at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is not possible at the moment. Since it is compile-safe, we will update it, once there is another option added to column order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
for (i, column) in schema_descr.columns().iter().enumerate() { | ||
match orders[i] { | ||
TColumnOrder::TYPEORDER(_) => { | ||
let sort_order = ColumnOrder::get_sort_order( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe make get_sort_order
a method for ColumnDescriptor
? since it already have all the info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I added method to column order was emphasis that this is a column order feature. When adding to column descriptor, we should consider adding actual ColumnOrder
to it, because having method get_sort_order
might be a bit confusing, people would want to get sort order, but maybe it is a format that does not support column order, so the actual sort order might be different from the one returned by the method.
I was thinking about adding ColumnOrder
field to ColumnDescriptor
. But this will require some additional changes, that I can make in a separate PR. Let me know if you think it is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The current approach is fine too. Let's keep it this way then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Merged. Thanks @sadikovi for the nice work! |
@sunchao Thanks a lot! I do feel like I could have done a better job. I will open other issues if the changes to statistics are necessary. Let me know if there is anything you would like to update. |
This PR adds
Statistics
enum instatistics.rs
that tracksmin
,max
anddistinct_count
,nulls
, which are min value for a column in a row group, max value for a column in a row group, optional number of distinct values, and number of nulls respectively.Statistics
is enum that is mapped to all physical types we have, it is considered to be immutable. The actual implementation is inTypedStatistics<T: DataType>
. This should be used when implementing statistics updates (see the comments below related to the updates).To support new ordering feature in statistics we added
ColumnOrder
andSortOrder
. All legacy statistics should be treated asColumnOrder::UNDEFINED
andSortOrder::SIGNED
.Closes #83