Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Arrow schema converter. #185

Closed
wants to merge 4 commits into from

Conversation

liurenjie1024
Copy link
Contributor

This is the first step of adding an arrow reader and writer for parquet-rs.
This commit contains a converter which converts parquet schema to arrow schema.

@coveralls
Copy link

coveralls commented Nov 5, 2018

Pull Request Test Coverage Report for Build 665

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 27 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.09%) to 95.628%

Files with Coverage Reduction New Missed Lines %
encodings/encoding.rs 1 94.36%
file/properties.rs 4 90.97%
errors.rs 4 26.67%
schema/types.rs 18 96.84%
Totals Coverage Status
Change from base Build 661: -0.09%
Covered Lines: 13276
Relevant Lines: 13883

💛 - Coveralls

@sunchao
Copy link
Owner

sunchao commented Nov 6, 2018

Thanks @liurenjie1024 . Will take a look soon.

Meanwhile, I created #186 to track the overall progress of adding Arrow support. Please feel free to create more tasks in it. I'm also planning to take up some tasks. :)

@liurenjie1024
Copy link
Contributor Author

@sunchao Thanks for opening the issue and comment with my thoughts about other tasks. Current I'm interested in the reader part and working on the reader that converts parquet to arrow.

Copy link
Owner

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liurenjie1024 . Left some comments.

src/arrow_format/mod.rs Outdated Show resolved Hide resolved
src/arrow_format/schema.rs Outdated Show resolved Hide resolved

fn to_list(&self) -> Result<Option<DataType>> {
match &*self.schema {
Type::PrimitiveType {..} => panic!("This should not happen."),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just use assertion here for the check of primitive type and field.len == 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think panic when it's primitive type is same.
  • I don't think we should panic when the fields.len() != 1 because this is an error format and should not cause our function to panic.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. For the latter maybe we can throw "not yet supported" error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we return Err here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if this happens, it means our code has bug.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To start with, returning an Err does not necessarily mean the code has bug. Err is just a way of expressing an invalid state or a bug and returning it without panicking. With Err it is easier to add a unit test to show that we return a proper error message.

By the way, there is no test for this, either with panic or Err. Also, what does it mean "It should not happen". What should not happen? Can you update the error message?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to go with panic, I would suggest including something like the actual schema that was compared, and some other metadata as part of the error message, so it is easier to debug later on when it happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an implementation detail and the arguments are passed by our own code, rather than the user, so I think it's sensible to panic. But I agree that we should add more detailed message for it.

src/arrow_format/schema.rs Outdated Show resolved Hide resolved
basic_info: _,
fields
} if fields.len()==1 && list_item.name()!="array" &&
list_item.name()!=format!("{}_tuple", self.schema.name()) => {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use list_item.name().ends_with("_tuple").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not the same. The list item name must be equal to {schema.name()}_tuple

Copy link
Owner

@sunchao sunchao Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Looks good. I think we may need to fix the check in another place. It will also be good if the code can be shared..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't notice the check you mentioned before. I think we need another refactor so that we can share the code.

src/arrow_format/schema.rs Outdated Show resolved Hide resolved

item_type.map(|opt| opt.map(|dt| DataType::List(Box::new(dt))))
},
_ => Err(ParquetError::ArrowError("Unrecognized list type.".to_string()))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - can we improve this error message?

src/arrow_format/schema.rs Outdated Show resolved Hide resolved
/// Convert parquet schema to arrow schema, only preserving some leaf columns.
pub fn parquet_to_arrow_schema_by_columns<T>(
parquet_schema: SchemaDescPtr, column_indices: T) -> Result<Schema>
where T: IntoIterator<Item=usize> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this has to be IntoIterator but not just a normal iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that the user can just pass a vec![1,2,3] without calling .iter(). BTW, if T implements Iterator, it also implements IntoIterator.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I was thinking that we don't necessarily need to consume the column_indices with IntoIterator, but just need to borrow them. However, it seems much harder to replace it with Iterator in this case. It's all good.

src/arrow_format/schema.rs Outdated Show resolved Hide resolved
@liurenjie1024
Copy link
Contributor Author

@sunchao Thanks for the review and I've fixed the comments. As for the format part, how do you think about a rustfmt file so that other contributors can follow?

@sunchao
Copy link
Owner

sunchao commented Nov 8, 2018

As for the format part, how do you think about a rustfmt file so that other contributors can follow?

Yes I think it is a good idea. We discussed about this before but later give up since couldn't achieve what we want with rustfmt. Let me try it once more.

Copy link
Collaborator

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I left a few comments. Could you also add doc for the functions even if they are trivial, it is unclear what some of them do and why they do it, and will be difficult to maintain the code later.

src/errors.rs Show resolved Hide resolved
basic_info.has_repetition() && basic_info.repetition()==Repetition::REPEATED
}

fn is_self_included(&self) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this function? How could self be part of self.leaves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is used to test if the schema included in the leaves. We need it when the converter is converting a primitive schema. The leaves are not leaves of the schema, but columns that need to convert.

src/arrow/schema.rs Outdated Show resolved Hide resolved
src/arrow/mod.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
src/arrow/schema.rs Outdated Show resolved Hide resolved
src/arrow/schema.rs Outdated Show resolved Hide resolved
src/arrow/schema.rs Outdated Show resolved Hide resolved
src/arrow/schema.rs Outdated Show resolved Hide resolved
src/schema/types.rs Show resolved Hide resolved
@liurenjie1024
Copy link
Contributor Author

@sadikovi Sorry for late reply. I've fixed some comments.

@liurenjie1024
Copy link
Contributor Author

@sunchao @sadikovi Could you help to review this?

@sunchao
Copy link
Owner

sunchao commented Nov 22, 2018

@liurenjie1024 sure, will take a look soon. Sorry for the delay.

Copy link
Collaborator

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

//! [Apache Arrow](http://arrow.apache.org/) is a cross-language development platform for in-memory
//! data.
//!
//! This mod provides API for converting between arrow and parquet.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but let's not forget to do that.

Copy link
Owner

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review (was on vacation). LGTM.

@liurenjie1024
Copy link
Contributor Author

@sunchao @sadikovi We have two LGTM, help to merge this?

@sunchao
Copy link
Owner

sunchao commented Nov 29, 2018

@liurenjie1024 I'm in the process of donating parquet-rs to Apache arrow, and think it might be better to do this after the merge is done (we can do a clean start of the arrow-parquet integration after that). I can merge this now too if you have follow-up work that depends on it. It shouldn't matter that much. What do you think?

@liurenjie1024
Copy link
Contributor Author

Yes, I think it would be better to merge it after merging with arrow.

@sunchao
Copy link
Owner

sunchao commented Nov 29, 2018

Great. We can create an umbrella JIRA after the merge, to track the parquet-arrow integration. I can help to get this PR in very soon after that.

@liurenjie1024
Copy link
Contributor Author

Cool.

@sunchao
Copy link
Owner

sunchao commented Dec 18, 2018

@liurenjie1024 I created a JIRA here: https://issues.apache.org/jira/browse/ARROW-4060. Could you file a PR in the arrow repo for this? Thanks.

andygrove pushed a commit to apache/arrow that referenced this pull request Jan 8, 2019
This is the first step of adding an arrow reader and writer for parquet-rs.
This commit contains a converter which converts parquet schema to arrow schema.
Copied from this pr sunchao/parquet-rs#185.

Author: Renjie Liu <liurenjie2008@gmail.com>

Closes #3279 from liurenjie1024/rust-arrow-schema-converter and squashes the following commits:

1bfa00f <Renjie Liu> Resolve conflict
8806b16 <Renjie Liu> Add parquet arrow converter
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Jan 10, 2019
This is the first step of adding an arrow reader and writer for parquet-rs.
This commit contains a converter which converts parquet schema to arrow schema.
Copied from this pr sunchao/parquet-rs#185.

Author: Renjie Liu <liurenjie2008@gmail.com>

Closes apache#3279 from liurenjie1024/rust-arrow-schema-converter and squashes the following commits:

1bfa00f <Renjie Liu> Resolve conflict
8806b16 <Renjie Liu> Add parquet arrow converter
@sunchao
Copy link
Owner

sunchao commented Jan 24, 2019

Closing this as ARROW-4060 is already merged into Arrow.

@sunchao sunchao closed this Jan 24, 2019
alecmocatta pushed a commit to constellation-rs/amadeus that referenced this pull request Sep 7, 2019
This is the first step of adding an arrow reader and writer for parquet-rs.
This commit contains a converter which converts parquet schema to arrow schema.
Copied from this pr sunchao/parquet-rs#185.

Author: Renjie Liu <liurenjie2008@gmail.com>

Closes #3279 from liurenjie1024/rust-arrow-schema-converter and squashes the following commits:

1bfa00f <Renjie Liu> Resolve conflict
8806b16 <Renjie Liu> Add parquet arrow converter
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants