Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Add statistics for column chunk metadata and data page #94

Merged
merged 29 commits into from
Apr 30, 2018

Conversation

sadikovi
Copy link
Collaborator

@sadikovi sadikovi commented Apr 19, 2018

This PR adds Statistics enum in statistics.rs that tracks min, max and distinct_count, nulls, which are min value for a column in a row group, max value for a column in a row group, optional number of distinct values, and number of nulls respectively.

Statistics is enum that is mapped to all physical types we have, it is considered to be immutable. The actual implementation is in TypedStatistics<T: DataType>. This should be used when implementing statistics updates (see the comments below related to the updates).

To support new ordering feature in statistics we added ColumnOrder and SortOrder. All legacy statistics should be treated as ColumnOrder::UNDEFINED and SortOrder::SIGNED.

Closes #83

@coveralls
Copy link

coveralls commented Apr 19, 2018

Coverage Status

Coverage decreased (-0.02%) to 94.982% when pulling cb4b79f on sadikovi:add-statistics into cdfca93 on sunchao:master.

@sunchao
Copy link
Owner

sunchao commented Apr 19, 2018

Thanks @sadikovi ! I'll take a look at this soon. Meanwhile, could you take a look at #89 ? I made some changes on that PR.

@sadikovi
Copy link
Collaborator Author

Cheers! Yes, I was going to review #89 after your changes - will do asap.

/// Optional column statistics that are used per row group and per page.
///
/// Use `order` field to determine comparison order for a type.
/// When statistics contain deprecated min/max fields, then comparison is always signed.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do.

/// When statistics contain deprecated min/max fields, then comparison is always signed.
#[derive(Clone, Debug, PartialEq)]
pub enum Statistics {
Boolean { order: ColumnOrder, min: Option<bool>, max: Option<bool>, nulls: u64 },
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if it would be better to have a Statistics trait with methods such as null_count, has_nulls, etc., and a separate struct TypedStatistics<T> that implements the former. The latter can implement more methods in future such as set_min_max, etc., that are specific to the generic type T.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am a bit confused. I chose not using T, so it is easy to extract relevant statistics a type.

My main concern with typed statistics is that we would need a wrapper similar to column reader. That is why I chose to remove this part and directly implement as enum of different statistics for physical type.

Could you elaborate a little bit more your plan on adding typed statistics? Should we consider statistics as read-only? Thanks.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was wondering if this could be similar to the RowGroupStatistics in the C++ version: https://github.com/apache/parquet-cpp/blob/master/src/parquet/statistics.h#L83, which is readonly, and then TypedStatistics for different types.

I'm not very familiar with how statistics will be used in the whole codebase though. I think the current approach is fine as well - we can come back and revisit this if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... let me think about this. I do want to expose min/max methods.

impl Statistics {
/// Returns number of nulls for the column.
/// Note that even though statistics are for leaf columns, null count also takes into
// account null complex types, e.g. lists.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a /

/// This is used when min_value/max_value are set for statistics.
///
/// Copied from `parquet.thrift`, see the file for more details on order of values.
fn column_order(physical_type: Type, logical_type: LogicalType) -> ColumnOrder {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this can be a method for ColumnDescriptor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have been the wrong name for ColumnOrder. This is not a sort order, this is an order which should be used when comparing current value with min and max from statistics. I might nee to rename it.

Copy link
Collaborator Author

@sadikovi sadikovi Apr 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rewrite it, so it will be a method on TypeDefinedOrder struct as part of ColumnOrder enum. Something like this:

// Sort order for statistics, to define `min <= value <= max`.
enum SortOrder {
  SIGNED,
  UNSIGNED,
  UNKNOWN
}

// Column order for each leaf column.
enum ColumnOrder {
  TypeDefinedOrder(SortOrder),
  Undefined
}

impl ColumnOrder {
  // Creates new column order for a leaf column.
  pub fn new(field: &Type) -> Self {
    ...
  }

  // Returns sort order for a type.
  fn sort_order(field: &Type) -> SortOrder {
    ...
  }

  // Returns default sort order based on physical type.
  fn default_sort_order(field: &Type) -> SortOrder {
    ...
  }
}

I just need to figure out how to handle legacy min/max values as opposed to min_value/max_value. I will follow the reference: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Maybe also take a look at how C++ handles this: https://github.com/apache/parquet-cpp/blob/master/src/parquet/metadata.cc#L53.

I'll take a look too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I had a look, and this makes clear when to use type defined order, because I was a bit confused with parquet-mr code. Will update accordingly.


/// Column order to compare values with statistics min/max.
#[derive(Clone, Debug, PartialEq)]
pub enum ColumnOrder {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change this to SortOrder to be consistent with parquet-mr/cpp?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will rename it.

/// Method to convert from Thrift definition.
/// Note that column type should match statistics, otherwise, there is a risk of
/// invalid conversion.
pub fn from_thrift(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need to process column_orders from FileMetaData?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will update.

Copy link
Collaborator Author

@sadikovi sadikovi Apr 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my comment above, FileMetaData will have a method that returns column orders for leaf columns, and sort orders could be extracted from it.

fn column_orders(&self) -> &[ColumnOrder] {
  ...
}

fn column_order(&self, i: usize) -> &ColumnOrder {
  ...
}

@sadikovi
Copy link
Collaborator Author

@sunchao Would you mind taking a look at my replies? I would like to clarify the changes first. Thanks!

@sadikovi sadikovi mentioned this pull request Apr 21, 2018
@sunchao
Copy link
Owner

sunchao commented Apr 21, 2018

@sadikovi : sure. I've left some comments.

@sadikovi
Copy link
Collaborator Author

I will make changes ASAP and let you know. Thanks!

@sadikovi
Copy link
Collaborator Author

@sunchao I added typed statistics and refactored ColumnOrder and SortOrder, and added column_orders to FileMetaData, to get a sense of what it would look like.

Several issues that I found/introduced:

  • Currently if column_orders is None, we still create vector with ColumnOrder::UNDEFINED. Should we do that, or should we just return None?
  • Difficult to extract values with typed statistics, meaning that one would have to downcast Box<Statistics> to TypedStatistics<T> to get min/max.
  • Also difficult to add test cases to check statistics, because of the point above.

Could you advise on the changes/further steps? Thanks.

@sunchao
Copy link
Owner

sunchao commented Apr 24, 2018

@sadikovi Thanks for the update. Will take a look soon.

@sadikovi
Copy link
Collaborator Author

I am planning to go back to a previous implementation with enum, sorry. I just can’t figure out how to work with traits and go back to typed impls. I mean no offence nor disrespect to anyone, but I feel like enum could be better when reading values from statistics.

Plus, I am updating some other bits, did not do much today, promise to finish it ASAP.

@sunchao
Copy link
Owner

sunchao commented Apr 24, 2018

OK no worries. Sorry for the wrong suggestion 😳 . I think the enum approach is fine too.

@sadikovi
Copy link
Collaborator Author

@sunchao it is all good, your suggestion was on point. I just got a bit carried away. I will spend some time coming up with a better solution.

@sadikovi
Copy link
Collaborator Author

@sunchao I updated the code with a new iteration of Statistics. It's change is minor, I updated the PR description, if you would like a bit of an overview.

I also designed Statistics updates, but I have not implemented them here, instead I put the code (not fully complete, but compiles) in gist: https://gist.github.com/sadikovi/a2a5d79f4e4368ea50a5b6b5a0e58cce

The idea is treating Statistics as immutable and have MutableStatisticsBuffer for updates. But underneath we update TypedStatistics with the following traits StatisticsUpdate and AsStatistics that allow us to update signed/unsigned values and convert back to Statistics.

@sunchao
Copy link
Owner

sunchao commented Apr 28, 2018

Thanks @sadikovi . Will take a look soon.

@sadikovi
Copy link
Collaborator Author

I think I need to update file/mod.rs and lib.rs to include a note about statistics. Will do it ASAP.

Copy link
Owner

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sadikovi . This looks good. I left some comments.

src/basic.rs Outdated
/// min/max.
///
/// See reference in
/// https://github.com/apache/parquet-cpp/blob/master/src/parquet/types.h#L120
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't reference the line number as it may get changed soon.


/// Statistics for a column chunk and data page.
#[derive(Debug, PartialEq)]
pub enum Statistics {
Copy link
Owner

@sunchao sunchao Apr 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to the question of implementing this using trait. I think you use something like the following:

pub trait Statistics {
  /// Returns `true` if statistics have old `min` and `max` fields set.
  /// This means that the column order is likely to be undefined, which, for old files
  /// could mean a signed sort order of values.
  ///
  /// Refer to [`ColumnOrder`](`::basic::ColumnOrder`) and
  /// [`SortOrder`](`::basic::SortOrder`) for more information.
  fn is_min_max_deprecated(&self) -> bool;

  /// Returns number of null values for the column.
  /// Note that this includes all nulls when column is part of the complex type.
  fn null_count(&self) -> u64;

  /// Returns `true` if statistics collected any null values, `false` otherwise.
  fn has_nulls(&self) -> bool {
    self.null_count() == 0
  }

  /// Returns `true` if min value and max value are set.
  fn has_min_max_set(&self) -> bool;
}

and then have TypedStatistics to implement this:

impl<T: DataType> Statistics for TypedStatistics<T> {
  /// Whether or not min and max values are set.
  fn has_min_max_set(&self) -> bool {
    self.min.is_some() && self.max.is_some()
  }

  /// Returns null count.
  fn null_count(&self) -> u64 {
    self.null_count
  }

  /// Returns `true` if statistics were created using old min/max fields.
  fn is_min_max_deprecated(&self) -> bool {
    self.is_min_max_deprecated
  }
}

The from_thrift function now needs to return Box<Statistics>:

pub fn from_thrift(
  physical_type: Type,
  thrift_stats: Option<TStatistics>
) -> Option<Box<Statistics>> {
   ...
      let res: Box<Statistics> = match physical_type {
        Type::BOOLEAN => {
          Box::new(TypedStatistics::<BoolType>::new(
            min.map(|data| data[0] != 0),
            max.map(|data| data[0] != 0),
            null_count,
            old_format
          ))
        },
   ...

There are some other issues to solve with this approach though. For instance, handling assert_eq for boxed traits.

Not sure if you have explored this approach - just my 2 cents. As I mentioned before, the current enum approach should be fine too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried this approach (https://github.com/sadikovi/parquet-rs/blob/011532c2450b2c7355488066d65deef8dd50e207/src/file/statistics.rs), and it did compile and work, but then I failed to find a way of extracting min/max values from typed statistics, once it was Box<Statistics>. Assertion in tests was also difficult to add. Those were the main reasons of switching to enum, so I could cast it properly (similar to column reader).

IMHO, enum also works, and I posted a code in gist that shows how to implement statistics collector for writes, which we will need, so it is still possible to do with enum.

I do not want anyone, including myself, to end up rewriting this implementation if we find it not suitable long term - so it might be better to discuss this further.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I didn't know the difficulty of extracting min/max values. Let's go with the enum approach.

}

/// Typed implementation for [`Statistics`].
pub struct TypedStatistics<T: DataType> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a TODO for the distinct count value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added distinct_count field to statistics.

self.max.as_ref().unwrap()
}

/// Whether or not min and max values are set.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mention this will panic if min/max is not set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I update the comments.

column.physical_type()
);
res.push(ColumnOrder::TYPE_DEFINED_ORDER(sort_order));
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: panic if it is some other TColumnOrder? although not possible at the moment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not possible at the moment. Since it is compile-safe, we will update it, once there is another option added to column order.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

for (i, column) in schema_descr.columns().iter().enumerate() {
match orders[i] {
TColumnOrder::TYPEORDER(_) => {
let sort_order = ColumnOrder::get_sort_order(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe make get_sort_order a method for ColumnDescriptor? since it already have all the info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I added method to column order was emphasis that this is a column order feature. When adding to column descriptor, we should consider adding actual ColumnOrder to it, because having method get_sort_order might be a bit confusing, people would want to get sort order, but maybe it is a format that does not support column order, so the actual sort order might be different from the one returned by the method.

I was thinking about adding ColumnOrder field to ColumnDescriptor. But this will require some additional changes, that I can make in a separate PR. Let me know if you think it is a good idea.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The current approach is fine too. Let's keep it this way then.

Copy link
Owner

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sunchao sunchao merged commit 132e2e2 into sunchao:master Apr 30, 2018
@sunchao
Copy link
Owner

sunchao commented Apr 30, 2018

Merged. Thanks @sadikovi for the nice work!

@sadikovi sadikovi deleted the add-statistics branch April 30, 2018 06:47
@sadikovi
Copy link
Collaborator Author

@sunchao Thanks a lot! I do feel like I could have done a better job. I will open other issues if the changes to statistics are necessary. Let me know if there is anything you would like to update.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants