Add statistics for column chunk metadata and data page #94

sadikovi · 2018-04-19T08:23:02Z

This PR adds Statistics enum in statistics.rs that tracks min, max and distinct_count, nulls, which are min value for a column in a row group, max value for a column in a row group, optional number of distinct values, and number of nulls respectively.

Statistics is enum that is mapped to all physical types we have, it is considered to be immutable. The actual implementation is in TypedStatistics<T: DataType>. This should be used when implementing statistics updates (see the comments below related to the updates).

To support new ordering feature in statistics we added ColumnOrder and SortOrder. All legacy statistics should be treated as ColumnOrder::UNDEFINED and SortOrder::SIGNED.

Closes #83

coveralls · 2018-04-19T08:39:45Z

Coverage decreased (-0.02%) to 94.982% when pulling cb4b79f on sadikovi:add-statistics into cdfca93 on sunchao:master.

sunchao · 2018-04-19T23:54:55Z

Thanks @sadikovi ! I'll take a look at this soon. Meanwhile, could you take a look at #89 ? I made some changes on that PR.

sadikovi · 2018-04-20T00:11:50Z

Cheers! Yes, I was going to review #89 after your changes - will do asap.

sunchao · 2018-04-20T04:34:25Z

src/file/metadata.rs

+/// Optional column statistics that are used per row group and per page.
+///
+/// Use `order` field to determine comparison order for a type.
+/// When statistics contain deprecated min/max fields, then comparison is always signed.


nit: remove then?

Yes, will do.

sunchao · 2018-04-20T05:16:19Z

src/file/metadata.rs

+/// When statistics contain deprecated min/max fields, then comparison is always signed.
+#[derive(Clone, Debug, PartialEq)]
+pub enum Statistics {
+  Boolean { order: ColumnOrder, min: Option<bool>, max: Option<bool>, nulls: u64 },


wonder if it would be better to have a Statistics trait with methods such as null_count, has_nulls, etc., and a separate struct TypedStatistics<T> that implements the former. The latter can implement more methods in future such as set_min_max, etc., that are specific to the generic type T.

Sorry, I am a bit confused. I chose not using T, so it is easy to extract relevant statistics a type.

My main concern with typed statistics is that we would need a wrapper similar to column reader. That is why I chose to remove this part and directly implement as enum of different statistics for physical type.

Could you elaborate a little bit more your plan on adding typed statistics? Should we consider statistics as read-only? Thanks.

Yes I was wondering if this could be similar to the RowGroupStatistics in the C++ version: https://github.com/apache/parquet-cpp/blob/master/src/parquet/statistics.h#L83, which is readonly, and then TypedStatistics for different types.

I'm not very familiar with how statistics will be used in the whole codebase though. I think the current approach is fine as well - we can come back and revisit this if needed.

Interesting... let me think about this. I do want to expose min/max methods.

sunchao · 2018-04-20T05:16:35Z

src/file/metadata.rs

+impl Statistics {
+  /// Returns number of nulls for the column.
+  /// Note that even though statistics are for leaf columns, null count also takes into
+  // account null complex types, e.g. lists.


nit: add a /

sunchao · 2018-04-20T05:39:32Z

src/file/metadata.rs

+  /// This is used when min_value/max_value are set for statistics.
+  ///
+  /// Copied from `parquet.thrift`, see the file for more details on order of values.
+  fn column_order(physical_type: Type, logical_type: LogicalType) -> ColumnOrder {


perhaps this can be a method for ColumnDescriptor?

This might have been the wrong name for ColumnOrder. This is not a sort order, this is an order which should be used when comparing current value with min and max from statistics. I might nee to rename it.

I will rewrite it, so it will be a method on TypeDefinedOrder struct as part of ColumnOrder enum. Something like this:

// Sort order for statistics, to define `min <= value <= max`. enum SortOrder { SIGNED, UNSIGNED, UNKNOWN } // Column order for each leaf column. enum ColumnOrder { TypeDefinedOrder(SortOrder), Undefined } impl ColumnOrder { // Creates new column order for a leaf column. pub fn new(field: &Type) -> Self { ... } // Returns sort order for a type. fn sort_order(field: &Type) -> SortOrder { ... } // Returns default sort order based on physical type. fn default_sort_order(field: &Type) -> SortOrder { ... } }

I just need to figure out how to handle legacy min/max values as opposed to min_value/max_value. I will follow the reference: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

Sounds good. Maybe also take a look at how C++ handles this: https://github.com/apache/parquet-cpp/blob/master/src/parquet/metadata.cc#L53.

I'll take a look too.

Thanks! I had a look, and this makes clear when to use type defined order, because I was a bit confused with parquet-mr code. Will update accordingly.

sunchao · 2018-04-20T05:53:46Z

src/file/metadata.rs

+
+/// Column order to compare values with statistics min/max.
+#[derive(Clone, Debug, PartialEq)]
+pub enum ColumnOrder {


nit: change this to SortOrder to be consistent with parquet-mr/cpp?

Yes, I will rename it.

sunchao · 2018-04-20T06:02:56Z

src/file/metadata.rs

+  /// Method to convert from Thrift definition.
+  /// Note that column type should match statistics, otherwise, there is a risk of
+  /// invalid conversion.
+  pub fn from_thrift(


do we also need to process column_orders from FileMetaData?

Yes, I will update.

Based on my comment above, FileMetaData will have a method that returns column orders for leaf columns, and sort orders could be extracted from it.

fn column_orders(&self) -> &[ColumnOrder] { ... } fn column_order(&self, i: usize) -> &ColumnOrder { ... }

sadikovi · 2018-04-21T09:11:31Z

@sunchao Would you mind taking a look at my replies? I would like to clarify the changes first. Thanks!

sunchao · 2018-04-21T20:46:25Z

@sadikovi : sure. I've left some comments.

sadikovi · 2018-04-21T23:01:53Z

I will make changes ASAP and let you know. Thanks!

sadikovi · 2018-04-22T08:56:20Z

@sunchao I added typed statistics and refactored ColumnOrder and SortOrder, and added column_orders to FileMetaData, to get a sense of what it would look like.

Several issues that I found/introduced:

Currently if column_orders is None, we still create vector with ColumnOrder::UNDEFINED. Should we do that, or should we just return None?
Difficult to extract values with typed statistics, meaning that one would have to downcast Box<Statistics> to TypedStatistics<T> to get min/max.
Also difficult to add test cases to check statistics, because of the point above.

Could you advise on the changes/further steps? Thanks.

sunchao · 2018-04-24T05:26:55Z

@sadikovi Thanks for the update. Will take a look soon.

sadikovi · 2018-04-24T05:51:50Z

I am planning to go back to a previous implementation with enum, sorry. I just can’t figure out how to work with traits and go back to typed impls. I mean no offence nor disrespect to anyone, but I feel like enum could be better when reading values from statistics.

Plus, I am updating some other bits, did not do much today, promise to finish it ASAP.

sunchao · 2018-04-24T06:05:48Z

OK no worries. Sorry for the wrong suggestion 😳 . I think the enum approach is fine too.

sadikovi · 2018-04-25T01:44:11Z

@sunchao it is all good, your suggestion was on point. I just got a bit carried away. I will spend some time coming up with a better solution.

sadikovi · 2018-04-26T08:41:56Z

@sunchao I updated the code with a new iteration of Statistics. It's change is minor, I updated the PR description, if you would like a bit of an overview.

I also designed Statistics updates, but I have not implemented them here, instead I put the code (not fully complete, but compiles) in gist: https://gist.github.com/sadikovi/a2a5d79f4e4368ea50a5b6b5a0e58cce

The idea is treating Statistics as immutable and have MutableStatisticsBuffer for updates. But underneath we update TypedStatistics with the following traits StatisticsUpdate and AsStatistics that allow us to update signed/unsigned values and convert back to Statistics.

sunchao · 2018-04-28T05:22:29Z

Thanks @sadikovi . Will take a look soon.

sadikovi · 2018-04-28T23:28:44Z

I think I need to update file/mod.rs and lib.rs to include a note about statistics. Will do it ASAP.

sunchao

Thanks @sadikovi . This looks good. I left some comments.

sunchao · 2018-04-29T04:40:58Z

src/basic.rs

+/// min/max.
+///
+/// See reference in
+/// https://github.com/apache/parquet-cpp/blob/master/src/parquet/types.h#L120


nit: don't reference the line number as it may get changed soon.

sunchao · 2018-04-29T06:10:57Z

src/file/statistics.rs

+
+/// Statistics for a column chunk and data page.
+#[derive(Debug, PartialEq)]
+pub enum Statistics {


Back to the question of implementing this using trait. I think you use something like the following:

pub trait Statistics { /// Returns `true` if statistics have old `min` and `max` fields set. /// This means that the column order is likely to be undefined, which, for old files /// could mean a signed sort order of values. /// /// Refer to [`ColumnOrder`](`::basic::ColumnOrder`) and /// [`SortOrder`](`::basic::SortOrder`) for more information. fn is_min_max_deprecated(&self) -> bool; /// Returns number of null values for the column. /// Note that this includes all nulls when column is part of the complex type. fn null_count(&self) -> u64; /// Returns `true` if statistics collected any null values, `false` otherwise. fn has_nulls(&self) -> bool { self.null_count() == 0 } /// Returns `true` if min value and max value are set. fn has_min_max_set(&self) -> bool; }

and then have TypedStatistics to implement this:

impl<T: DataType> Statistics for TypedStatistics<T> { /// Whether or not min and max values are set. fn has_min_max_set(&self) -> bool { self.min.is_some() && self.max.is_some() } /// Returns null count. fn null_count(&self) -> u64 { self.null_count } /// Returns `true` if statistics were created using old min/max fields. fn is_min_max_deprecated(&self) -> bool { self.is_min_max_deprecated } }

The from_thrift function now needs to return Box<Statistics>:

pub fn from_thrift( physical_type: Type, thrift_stats: Option<TStatistics> ) -> Option<Box<Statistics>> { ... let res: Box<Statistics> = match physical_type { Type::BOOLEAN => { Box::new(TypedStatistics::<BoolType>::new( min.map(|data| data[0] != 0), max.map(|data| data[0] != 0), null_count, old_format )) }, ...

There are some other issues to solve with this approach though. For instance, handling assert_eq for boxed traits.

Not sure if you have explored this approach - just my 2 cents. As I mentioned before, the current enum approach should be fine too.

Yes, I tried this approach (https://github.com/sadikovi/parquet-rs/blob/011532c2450b2c7355488066d65deef8dd50e207/src/file/statistics.rs), and it did compile and work, but then I failed to find a way of extracting min/max values from typed statistics, once it was Box<Statistics>. Assertion in tests was also difficult to add. Those were the main reasons of switching to enum, so I could cast it properly (similar to column reader).

IMHO, enum also works, and I posted a code in gist that shows how to implement statistics collector for writes, which we will need, so it is still possible to do with enum.

I do not want anyone, including myself, to end up rewriting this implementation if we find it not suitable long term - so it might be better to discuss this further.

No worries. I didn't know the difficulty of extracting min/max values. Let's go with the enum approach.

sunchao · 2018-04-29T06:11:20Z

src/file/statistics.rs

+}
+
+/// Typed implementation for [`Statistics`].
+pub struct TypedStatistics<T: DataType> {


maybe add a TODO for the distinct count value?

I added distinct_count field to statistics.

sunchao · 2018-04-29T06:14:56Z

src/file/statistics.rs

+    self.max.as_ref().unwrap()
+  }
+
+  /// Whether or not min and max values are set.


should we mention this will panic if min/max is not set?

Yes, I update the comments.

sunchao · 2018-04-29T06:19:19Z

src/file/reader.rs

+                column.physical_type()
+              );
+              res.push(ColumnOrder::TYPE_DEFINED_ORDER(sort_order));
+            }


nit: panic if it is some other TColumnOrder? although not possible at the moment.

No, it is not possible at the moment. Since it is compile-safe, we will update it, once there is another option added to column order.

Sounds good.

sunchao · 2018-04-29T06:20:25Z

src/file/reader.rs

+        for (i, column) in schema_descr.columns().iter().enumerate() {
+          match orders[i] {
+            TColumnOrder::TYPEORDER(_) => {
+              let sort_order = ColumnOrder::get_sort_order(


nit: maybe make get_sort_order a method for ColumnDescriptor? since it already have all the info.

The reason I added method to column order was emphasis that this is a column order feature. When adding to column descriptor, we should consider adding actual ColumnOrder to it, because having method get_sort_order might be a bit confusing, people would want to get sort order, but maybe it is a format that does not support column order, so the actual sort order might be different from the one returned by the method.

I was thinking about adding ColumnOrder field to ColumnDescriptor. But this will require some additional changes, that I can make in a separate PR. Let me know if you think it is a good idea.

I see. The current approach is fine too. Let's keep it this way then.

sunchao

LGTM!

sunchao · 2018-04-30T06:43:59Z

Merged. Thanks @sadikovi for the nice work!

sadikovi · 2018-04-30T06:53:55Z

@sunchao Thanks a lot! I do feel like I could have done a better job. I will open other issues if the changes to statistics are necessary. Let me know if there is anything you would like to update.

sadikovi added 8 commits April 17, 2018 16:52

add statistics

bd87d48

add column order

66af22f

update doc

dc878be

update doc

9640b50

minor updates, add tests

e05f26c

add more tests

65823aa

add statistics to data pages

670dfc6

minor test changes

81d709a

sunchao reviewed Apr 20, 2018

View reviewed changes

sadikovi mentioned this pull request Apr 21, 2018

Releasing 0.2.0? #88

Closed

sadikovi added 4 commits April 22, 2018 11:57

add column order to basic.rs

cf115ff

add tests for column order

6eb52b4

add typed statistics

9fc7055

update tests

3a8dd7f

return option in column_orders method

011532c

sadikovi added 4 commits April 25, 2018 15:53

Merge remote-tracking branch 'origin/master' into add-statistics

29cd1cc

update comments and minor code changes

868ec7b

another iteration on stats

b6bd5b1

add methods to create stats, update tests

1760734

sadikovi added 6 commits April 25, 2018 22:38

remove test file

372e01e

update methods and comments

0aee494

add tests

fdc9ec0

add tests

f70781d

update comments

7f3e907

update page test

85ea887

sadikovi added 3 commits April 27, 2018 09:15

update statistics imports

eda9dd3

add file reader tests

038f6f6

update comments

ed37464

sunchao reviewed Apr 29, 2018

View reviewed changes

sadikovi added 2 commits April 30, 2018 12:22

rebase branch

74ae715

address comments, add distinct_count

1f24daa

sunchao approved these changes Apr 30, 2018

View reviewed changes

update doc

cb4b79f

sunchao merged commit 132e2e2 into sunchao:master Apr 30, 2018

sadikovi deleted the add-statistics branch April 30, 2018 06:47

Add statistics for column chunk metadata and data page #94

Add statistics for column chunk metadata and data page #94

Conversation

sadikovi commented Apr 19, 2018 • edited Loading

coveralls commented Apr 19, 2018 • edited Loading

sunchao commented Apr 19, 2018

sadikovi commented Apr 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadikovi Apr 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadikovi Apr 21, 2018 • edited Loading

Choose a reason for hiding this comment

sadikovi commented Apr 21, 2018

sunchao commented Apr 21, 2018

sadikovi commented Apr 21, 2018

sadikovi commented Apr 22, 2018

sunchao commented Apr 24, 2018

sadikovi commented Apr 24, 2018

sunchao commented Apr 24, 2018

sadikovi commented Apr 25, 2018

sadikovi commented Apr 26, 2018

sunchao commented Apr 28, 2018

sadikovi commented Apr 28, 2018

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao Apr 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

sunchao commented Apr 30, 2018

sadikovi commented Apr 30, 2018

sadikovi commented Apr 19, 2018 •

edited

Loading

coveralls commented Apr 19, 2018 •

edited

Loading

sadikovi Apr 21, 2018 •

edited

Loading

sadikovi Apr 21, 2018 •

edited

Loading

sunchao Apr 29, 2018 •

edited

Loading