-
Notifications
You must be signed in to change notification settings - Fork 20
Add EncodingWriteSupport for column writer #196
Conversation
I think we should create a follow up ticket to handle default encodings depending on column type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@sadikovi can you update the PR with rustfmt? |
I'd like to discuss the approach:
Let me know if you have any other suggestions or objections to this approach. Thanks! |
I noticed another issue though - it seems we use |
Yes, I was thinking about it. I can add it as follow up work.
…On Thu, 29 Nov 2018 at 7:41 AM, Chao Sun ***@***.***> wrote:
@sadikovi <https://github.com/sadikovi>
1. the added trait looks fine to me. I think it clearly expresses the
idea that we want to handle boolean type specially. Replacing it with a
function sounds good to me as well but I don't have strong preference on
either.
2. I'm fine with the silent fallback. We should definitely not panic.
Returning Result seems not nice too and it's also not quite compatible
with how properties are implemented at the moment.
I noticed another issue though - it seems we use PLAIN_ENCODING as the
default fallback encoding for all encoding types, which may not be the best
choice. We may want to use different encodings based on types and parquet
version (i.e., v1 or v2). This is how parquet-mr is doing.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3kkJGRGE2UDmlphs7inQ-zymZdKgks5uz4GmgaJpZM4Y4drt>
.
|
Actually, I can do all of that here, let me know if I should include default encodings depending on a column type work in this PR. |
Pull Request Test Coverage Report for Build 680
💛 - Coveralls |
I think it may be better to open a new PR since these two are not so closely related. |
Thanks. I looked at the changes, and it is not difficult to fix as part of this PR, and the changes are relatively small. I can push commit and then we can decide if we need to open another one. Will that be okay? |
Yes. Sounds good.
…On Thu, Nov 29, 2018 at 12:24 AM Ivan ***@***.***> wrote:
Thanks. I looked at the changes, and it is not difficult to fix as part of
this PR, and the changes are relatively small. I can push commit and then
we can decide if we need to open another one. Will that be okay?
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#196 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAe7NxCjBnkD1x-I8dC0m2jikxLXJcVHks5uz5mhgaJpZM4Y4drt>
.
|
@sunchao I updated the PR with encoding write support. Can you have a look when you have time? Thanks! |
src/column/writer.rs
Outdated
fn has_dictionary_support(_props: &WriterProperties) -> bool { true } | ||
} | ||
|
||
impl EncodingWriteSupport for ColumnWriterImpl<Int96Type> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: these seem unnecessary since they are the same as default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
Could you take a quick look at tests as well? I feel like they could be implemented better. |
Thanks for merging! |
This PR addresses the issue when we would enable dictionary encoding for boolean values regardless of the parquet format v1 or v2. This prevents file to be read by other systems that use parquet-mr.
In parquet-mr column writer use either PLAIN (v1) or RLE (v2) when writing for boolean values. In our library we would just use an alternative encoding that was set for a column, which defaults to PLAIN.
This PR changes the behaviour. Now if no encoding explicitly set for a column, we will use EncodingWriteSupport trait to check if column type supports dictionary encoding and what alternative fallback encoding we can use. The values match parquet-mr implementation for both V1 and V2. It also changes API for WriterProperties - now
encoding(column)
returns None, if no encoding is set (used to return default PLAIN encoding).Closes #195