Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Add EncodingWriteSupport for column writer #196

Merged
merged 6 commits into from
Dec 3, 2018

Conversation

sadikovi
Copy link
Collaborator

@sadikovi sadikovi commented Nov 28, 2018

This PR addresses the issue when we would enable dictionary encoding for boolean values regardless of the parquet format v1 or v2. This prevents file to be read by other systems that use parquet-mr.

In parquet-mr column writer use either PLAIN (v1) or RLE (v2) when writing for boolean values. In our library we would just use an alternative encoding that was set for a column, which defaults to PLAIN.

This PR changes the behaviour. Now if no encoding explicitly set for a column, we will use EncodingWriteSupport trait to check if column type supports dictionary encoding and what alternative fallback encoding we can use. The values match parquet-mr implementation for both V1 and V2. It also changes API for WriterProperties - now encoding(column) returns None, if no encoding is set (used to return default PLAIN encoding).

Closes #195

@sadikovi
Copy link
Collaborator Author

I think we should create a follow up ticket to handle default encodings depending on column type.

Copy link
Owner

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@sunchao
Copy link
Owner

sunchao commented Nov 28, 2018

@sadikovi can you update the PR with rustfmt?

@sadikovi
Copy link
Collaborator Author

I'd like to discuss the approach:

  • I added trait to handle dictionary enabled or disabled. It could be an overkill since we can just use a simple method for it, though trait does make it easy to check the value type.
  • I decided to silently fall back to an alternative encoding when the type is boolean. My other approaches included returning Result, which would become Err if boolean has dictionary encoding, or panicking. The first one requires API changes, and the second is just not nice.

Let me know if you have any other suggestions or objections to this approach. Thanks!

@sunchao
Copy link
Owner

sunchao commented Nov 29, 2018

@sadikovi

  1. the added trait looks fine to me. I think it clearly expresses the idea that we want to handle boolean type specially. Replacing it with a function sounds good to me as well but I don't have strong preference on either.
  2. I'm fine with the silent fallback. We should definitely not panic. Returning Result seems not nice too and it's also not quite compatible with how properties are implemented at the moment.

I noticed another issue though - it seems we use PLAIN_ENCODING as the default fallback encoding for all encoding types, which may not be the best choice. We may want to use different encodings based on types and parquet version (i.e., v1 or v2). This is how parquet-mr is doing.

@sadikovi
Copy link
Collaborator Author

sadikovi commented Nov 29, 2018 via email

@sadikovi
Copy link
Collaborator Author

Actually, I can do all of that here, let me know if I should include default encodings depending on a column type work in this PR.

@coveralls
Copy link

coveralls commented Nov 29, 2018

Pull Request Test Coverage Report for Build 680

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 34 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.09%) to 95.783%

Files with Coverage Reduction New Missed Lines %
file/properties.rs 16 91.04%
column/writer.rs 18 97.6%
Totals Coverage Status
Change from base Build 670: 0.09%
Covered Lines: 13196
Relevant Lines: 13777

💛 - Coveralls

@sunchao
Copy link
Owner

sunchao commented Nov 29, 2018

I think it may be better to open a new PR since these two are not so closely related.

@sadikovi
Copy link
Collaborator Author

Thanks. I looked at the changes, and it is not difficult to fix as part of this PR, and the changes are relatively small. I can push commit and then we can decide if we need to open another one. Will that be okay?

@sunchao
Copy link
Owner

sunchao commented Nov 29, 2018 via email

@sadikovi sadikovi changed the title Disable dictionary encoding for boolean values in writes Add EncodingWriteSupport for column writer Nov 30, 2018
@sadikovi
Copy link
Collaborator Author

@sunchao I updated the PR with encoding write support. Can you have a look when you have time? Thanks!

fn has_dictionary_support(_props: &WriterProperties) -> bool { true }
}

impl EncodingWriteSupport for ColumnWriterImpl<Int96Type> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these seem unnecessary since they are the same as default value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

@sadikovi
Copy link
Collaborator Author

sadikovi commented Dec 2, 2018

Could you take a quick look at tests as well? I feel like they could be implemented better.

@sunchao sunchao merged commit b7913af into sunchao:master Dec 3, 2018
@sadikovi
Copy link
Collaborator Author

sadikovi commented Dec 3, 2018

Thanks for merging!

@sadikovi sadikovi deleted the boolean-write-encoding branch December 3, 2018 07:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants