Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Add WriterProperties to configure file writer #128

Merged
merged 6 commits into from
Jul 11, 2018

Conversation

sadikovi
Copy link
Collaborator

@sadikovi sadikovi commented Jul 8, 2018

This PR adds writer properties that allow to configure file writer and column writers.

User can create easily create properties with default values, or specify different configuration either globally or for an individual column.

I also added build.rs (again!). This is used to generate created_by string that includes current version and the latest git commit hash.

Notes:

  • We allow setting created_by.
  • We do not use PLAIN_DICTIONARY, we use RLE_DICTIONARY and PLAIN for dictionary encoding.
  • We do not force PLAIN encoding as fallback encoding, technically any encoding can be specified as fallback (parquet-cpp explicitly sets PLAIN).

@coveralls
Copy link

coveralls commented Jul 8, 2018

Pull Request Test Coverage Report for Build 541

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.07%) to 95.177%

Totals Coverage Status
Change from base Build 533: -0.07%
Covered Lines: 10735
Relevant Lines: 11279

💛 - Coveralls

@sadikovi
Copy link
Collaborator Author

sadikovi commented Jul 8, 2018

@sunchao could you review this PR? Let me know if I need to write more tests, also to increase coverage. Thanks!

const DEFAULT_STATISTICS_ENABLED: bool = true;
const DEFAULT_MAX_STATISTICS_SIZE: usize = 4096;
const DEFAULT_MAX_ROW_GROUP_LENGTH: usize = 64 * 1024 * 1024;
const DEFAULT_CREATED_BY: &str = env!("PARQUET_CREATED_BY");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rely on the fact that this will be substituted on actual &str when compiling, so that it will be part of the binary.

pub fn dictionary_data_page_encoding(&self) -> Encoding {
// PLAIN_DICTIONARY encoding is deprecated in writer version 1.
// Dictionary values are encoded using RLE_DICTIONARY encoding.
Encoding::RLE_DICTIONARY
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we should be using deprecated PLAIN_DICTIONARY. Note that parquet-cpp and parquet-mr specify them when using writer version 1.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

if value == Encoding::PLAIN_DICTIONARY || value == Encoding::RLE_DICTIONARY {
panic!("Dictionary encoding can not be used as fallback encoding");
}
// TODO: Check if we need to force PLAIN encoding as dictionary fallback.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting question. Parquet-cpp forces PLAIN encoding as fallback, but I could not find any mentioning of having fallback as PLAIN only - I guess, technically any encoding could be used for fallback, except dictionary encoding. Right now it is possible to set any encoding for fallback.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think parquet-cpp supports other types of decoding but haven't used them in the writer path, so currently they only support PLAIN in fallback as a side effect. In parquet-mr though, other types of fallback encoding schemas are used so I think we are on the correct track.

Copy link
Owner

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sadikovi ! left some comments.

pub fn dictionary_data_page_encoding(&self) -> Encoding {
// PLAIN_DICTIONARY encoding is deprecated in writer version 1.
// Dictionary values are encoded using RLE_DICTIONARY encoding.
Encoding::RLE_DICTIONARY
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

if value == Encoding::PLAIN_DICTIONARY || value == Encoding::RLE_DICTIONARY {
panic!("Dictionary encoding can not be used as fallback encoding");
}
// TODO: Check if we need to force PLAIN encoding as dictionary fallback.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think parquet-cpp supports other types of decoding but haven't used them in the writer path, so currently they only support PLAIN in fallback as a side effect. In parquet-mr though, other types of fallback encoding schemas are used so I think we are on the correct track.


/// Sets encoding for this column.
///
/// This field is treated as fallback encoding, and should not be dictionary encoding.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the dictionary encoding is disabled for this column, then the parameter will be used as the primary encoding, correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. If dictionary is enabled, this acts as fallback encoding. If dictionary is disabled, we will use this encoding as a primary. I will update the comment to make it clear.


/// Sets (fallback) encoding for any column.
pub fn set_encoding(mut self, value: Encoding) -> Self {
self.default_column_properties.set_encoding(value);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if value is dictionary encoding here? what if people pass RLE_DICTIONARY as value and then enable dictionary encoding for some column?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! This is a bug, I will test for it. Yes, we should have some place where we check all of the configurations. Will update.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add all checks when I build WriterProperties - that should solve the problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the code again. It seems that we do not allow setting RLE_DICTIONARY or PLAIN_DICTIONARY, regardless of whether a dictionary is enabled or not, so someone wants to enable dictionary encoding they should use another method. I will add a test for this situation though.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

/// so some default value must be used.
#[derive(Debug, Clone, PartialEq)]
struct ColumnProperties {
encoding: Option<Encoding>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these are Options? can they be initialized with the global default settings?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so. The reason is this:

Let's assume that global default compression is UNCOMPRESSED and encoding is PLAIN. I set column A to have encoding RLE, while it has all others as globals, then I set compression for all columns to be SNAPPY, it is not very convenient to propagate this value to the per-column value, e.g. SNAPPY instead of UNCOMPRESSED.

This is how it works now:

For non-column settings, e.g. page size => initially we set them to the global defaults, when value changes, we just set it in a builder and propagate to properties. For column settings, e.g. encoding or compression => per column settings -> global (all columns) settings -> global defaults

For example, I can set global encoding as RLE, but for column A I set DELTA_BINARY_PACKED. So it
will look like the following:

Column Per column settings Global (all columns) settings Global defaults
Column A DELTA_BINARY_PACKED RLE PLAIN
Column B RLE PLAIN

The reason I have Options in ColumnProperties, so I know whether or not the value has been set and
I can fall back to the global settings. And because I want to reuse ColumnProperties for per-column
and default settings, that is why we have another column "Global defaults".

This is how I extract the value:
If per column value is set, extract that value; otherwise, check the global all columns defaults;
if that is not set, return global default.

self.column_properties.get(col).and_then(|c| c.encoding())
  .or_else(|| self.default_column_properties.encoding())
  .unwrap_or(DEFAULT_ENCODING)

But it does look like I can initialise global (all columns) settings to global defaults, so we will have only 2 columns in the table above. I can do something like this:

self.column_properties.get(col).and_then(|c| c.encoding())
  .or_else(|| self.default_column_properties.encoding())
  .unwrap()

So all properties in self.default_column_properties will have Some(DEFAULT_XYZ).

Let me know what you think, or how you would do it differently, potentially simpler than this approach. Thanks!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. This makes sense to me. Thanks for the explanation!

Copy link
Owner

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sadikovi
Copy link
Collaborator Author

Thanks @sunchao! I hope this will work:).

@sadikovi
Copy link
Collaborator Author

@sunchao Feel free to leave more comments - I am more than happy to address them, so we can make parquet-rs the best and the fastest Parquet implementation.

@sunchao
Copy link
Owner

sunchao commented Jul 11, 2018

@sadikovi Thanks. This looks good to me 👍 . Will merge now.

@sunchao sunchao merged commit dc2f8c6 into sunchao:master Jul 11, 2018
@sadikovi sadikovi deleted the add-writer-properties branch July 13, 2018 20:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants