-
Notifications
You must be signed in to change notification settings - Fork 20
Add WriterProperties to configure file writer #128
Conversation
@sunchao could you review this PR? Let me know if I need to write more tests, also to increase coverage. Thanks! |
src/file/properties.rs
Outdated
const DEFAULT_STATISTICS_ENABLED: bool = true; | ||
const DEFAULT_MAX_STATISTICS_SIZE: usize = 4096; | ||
const DEFAULT_MAX_ROW_GROUP_LENGTH: usize = 64 * 1024 * 1024; | ||
const DEFAULT_CREATED_BY: &str = env!("PARQUET_CREATED_BY"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rely on the fact that this will be substituted on actual &str
when compiling, so that it will be part of the binary.
pub fn dictionary_data_page_encoding(&self) -> Encoding { | ||
// PLAIN_DICTIONARY encoding is deprecated in writer version 1. | ||
// Dictionary values are encoded using RLE_DICTIONARY encoding. | ||
Encoding::RLE_DICTIONARY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should be using deprecated PLAIN_DICTIONARY. Note that parquet-cpp and parquet-mr specify them when using writer version 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
src/file/properties.rs
Outdated
if value == Encoding::PLAIN_DICTIONARY || value == Encoding::RLE_DICTIONARY { | ||
panic!("Dictionary encoding can not be used as fallback encoding"); | ||
} | ||
// TODO: Check if we need to force PLAIN encoding as dictionary fallback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting question. Parquet-cpp forces PLAIN encoding as fallback, but I could not find any mentioning of having fallback as PLAIN only - I guess, technically any encoding could be used for fallback, except dictionary encoding. Right now it is possible to set any encoding for fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think parquet-cpp supports other types of decoding but haven't used them in the writer path, so currently they only support PLAIN in fallback as a side effect. In parquet-mr though, other types of fallback encoding schemas are used so I think we are on the correct track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sadikovi ! left some comments.
pub fn dictionary_data_page_encoding(&self) -> Encoding { | ||
// PLAIN_DICTIONARY encoding is deprecated in writer version 1. | ||
// Dictionary values are encoded using RLE_DICTIONARY encoding. | ||
Encoding::RLE_DICTIONARY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
src/file/properties.rs
Outdated
if value == Encoding::PLAIN_DICTIONARY || value == Encoding::RLE_DICTIONARY { | ||
panic!("Dictionary encoding can not be used as fallback encoding"); | ||
} | ||
// TODO: Check if we need to force PLAIN encoding as dictionary fallback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think parquet-cpp supports other types of decoding but haven't used them in the writer path, so currently they only support PLAIN in fallback as a side effect. In parquet-mr though, other types of fallback encoding schemas are used so I think we are on the correct track.
src/file/properties.rs
Outdated
|
||
/// Sets encoding for this column. | ||
/// | ||
/// This field is treated as fallback encoding, and should not be dictionary encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the dictionary encoding is disabled for this column, then the parameter will be used as the primary encoding, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. If dictionary is enabled, this acts as fallback encoding. If dictionary is disabled, we will use this encoding as a primary. I will update the comment to make it clear.
src/file/properties.rs
Outdated
|
||
/// Sets (fallback) encoding for any column. | ||
pub fn set_encoding(mut self, value: Encoding) -> Self { | ||
self.default_column_properties.set_encoding(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check if value
is dictionary encoding here? what if people pass RLE_DICTIONARY as value
and then enable dictionary encoding for some column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! This is a bug, I will test for it. Yes, we should have some place where we check all of the configurations. Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add all checks when I build WriterProperties
- that should solve the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the code again. It seems that we do not allow setting RLE_DICTIONARY or PLAIN_DICTIONARY, regardless of whether a dictionary is enabled or not, so someone wants to enable dictionary encoding they should use another method. I will add a test for this situation though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
src/file/properties.rs
Outdated
/// so some default value must be used. | ||
#[derive(Debug, Clone, PartialEq)] | ||
struct ColumnProperties { | ||
encoding: Option<Encoding>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these are Option
s? can they be initialized with the global default settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think so. The reason is this:
Let's assume that global default compression is UNCOMPRESSED and encoding is PLAIN. I set column A to have encoding RLE, while it has all others as globals, then I set compression for all columns to be SNAPPY, it is not very convenient to propagate this value to the per-column value, e.g. SNAPPY instead of UNCOMPRESSED.
This is how it works now:
For non-column settings, e.g. page size => initially we set them to the global defaults, when value changes, we just set it in a builder and propagate to properties. For column settings, e.g. encoding or compression => per column settings -> global (all columns) settings -> global defaults
For example, I can set global encoding as RLE, but for column A I set DELTA_BINARY_PACKED. So it
will look like the following:
Column | Per column settings | Global (all columns) settings | Global defaults |
---|---|---|---|
Column A | DELTA_BINARY_PACKED | RLE | PLAIN |
Column B | RLE | PLAIN |
The reason I have Options in ColumnProperties, so I know whether or not the value has been set and
I can fall back to the global settings. And because I want to reuse ColumnProperties for per-column
and default settings, that is why we have another column "Global defaults".
This is how I extract the value:
If per column value is set, extract that value; otherwise, check the global all columns defaults;
if that is not set, return global default.
self.column_properties.get(col).and_then(|c| c.encoding())
.or_else(|| self.default_column_properties.encoding())
.unwrap_or(DEFAULT_ENCODING)
But it does look like I can initialise global (all columns) settings to global defaults, so we will have only 2 columns in the table above. I can do something like this:
self.column_properties.get(col).and_then(|c| c.encoding())
.or_else(|| self.default_column_properties.encoding())
.unwrap()
So all properties in self.default_column_properties
will have Some(DEFAULT_XYZ).
Let me know what you think, or how you would do it differently, potentially simpler than this approach. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. This makes sense to me. Thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @sunchao! I hope this will work:). |
@sunchao Feel free to leave more comments - I am more than happy to address them, so we can make parquet-rs the best and the fastest Parquet implementation. |
@sadikovi Thanks. This looks good to me 👍 . Will merge now. |
This PR adds writer properties that allow to configure file writer and column writers.
User can create easily create properties with default values, or specify different configuration either globally or for an individual column.
I also added
build.rs
(again!). This is used to generatecreated_by
string that includes current version and the latest git commit hash.Notes:
created_by
.PLAIN_DICTIONARY
, we useRLE_DICTIONARY
andPLAIN
for dictionary encoding.PLAIN
encoding as fallback encoding, technically any encoding can be specified as fallback (parquet-cpp explicitly setsPLAIN
).