-
Notifications
You must be signed in to change notification settings - Fork 20
Conversation
Pull Request Test Coverage Report for Build 665
💛 - Coveralls |
Thanks @liurenjie1024 . Will take a look soon. Meanwhile, I created #186 to track the overall progress of adding Arrow support. Please feel free to create more tasks in it. I'm also planning to take up some tasks. :) |
@sunchao Thanks for opening the issue and comment with my thoughts about other tasks. Current I'm interested in the reader part and working on the reader that converts parquet to arrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liurenjie1024 . Left some comments.
src/arrow_format/schema.rs
Outdated
|
||
fn to_list(&self) -> Result<Option<DataType>> { | ||
match &*self.schema { | ||
Type::PrimitiveType {..} => panic!("This should not happen."), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just use assertion here for the check of primitive type and field.len == 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think panic when it's primitive type is same.
- I don't think we should panic when the
fields.len() != 1
because this is an error format and should not cause our function to panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. For the latter maybe we can throw "not yet supported" error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we return Err here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because if this happens, it means our code has bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To start with, returning an Err does not necessarily mean the code has bug. Err is just a way of expressing an invalid state or a bug and returning it without panicking. With Err it is easier to add a unit test to show that we return a proper error message.
By the way, there is no test for this, either with panic or Err. Also, what does it mean "It should not happen". What should not happen? Can you update the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to go with panic, I would suggest including something like the actual schema that was compared, and some other metadata as part of the error message, so it is easier to debug later on when it happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an implementation detail and the arguments are passed by our own code, rather than the user, so I think it's sensible to panic. But I agree that we should add more detailed message for it.
src/arrow_format/schema.rs
Outdated
basic_info: _, | ||
fields | ||
} if fields.len()==1 && list_item.name()!="array" && | ||
list_item.name()!=format!("{}_tuple", self.schema.name()) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use list_item.name().ends_with("_tuple")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not the same. The list item name must be equal to {schema.name()}_tuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Looks good. I think we may need to fix the check in another place. It will also be good if the code can be shared..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I didn't notice the check you mentioned before. I think we need another refactor so that we can share the code.
src/arrow_format/schema.rs
Outdated
|
||
item_type.map(|opt| opt.map(|dt| DataType::List(Box::new(dt)))) | ||
}, | ||
_ => Err(ParquetError::ArrowError("Unrecognized list type.".to_string())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above - can we improve this error message?
src/arrow_format/schema.rs
Outdated
/// Convert parquet schema to arrow schema, only preserving some leaf columns. | ||
pub fn parquet_to_arrow_schema_by_columns<T>( | ||
parquet_schema: SchemaDescPtr, column_indices: T) -> Result<Schema> | ||
where T: IntoIterator<Item=usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why this has to be IntoIterator
but not just a normal iterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that the user can just pass a vec![1,2,3] without calling .iter(). BTW, if T implements Iterator, it also implements IntoIterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I was thinking that we don't necessarily need to consume the column_indices
with IntoIterator
, but just need to borrow them. However, it seems much harder to replace it with Iterator
in this case. It's all good.
@sunchao Thanks for the review and I've fixed the comments. As for the format part, how do you think about a rustfmt file so that other contributors can follow? |
Yes I think it is a good idea. We discussed about this before but later give up since couldn't achieve what we want with rustfmt. Let me try it once more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I left a few comments. Could you also add doc for the functions even if they are trivial, it is unclear what some of them do and why they do it, and will be difficult to maintain the code later.
basic_info.has_repetition() && basic_info.repetition()==Repetition::REPEATED | ||
} | ||
|
||
fn is_self_included(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this function? How could self be part of self.leaves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is used to test if the schema included in the leaves. We need it when the converter is converting a primitive schema. The leaves are not leaves of the schema, but columns that need to convert.
3e813d6
to
d37cf83
Compare
@sadikovi Sorry for late reply. I've fixed some comments. |
@liurenjie1024 sure, will take a look soon. Sorry for the delay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
//! [Apache Arrow](http://arrow.apache.org/) is a cross-language development platform for in-memory | ||
//! data. | ||
//! | ||
//! This mod provides API for converting between arrow and parquet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but let's not forget to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review (was on vacation). LGTM.
@liurenjie1024 I'm in the process of donating parquet-rs to Apache arrow, and think it might be better to do this after the merge is done (we can do a clean start of the arrow-parquet integration after that). I can merge this now too if you have follow-up work that depends on it. It shouldn't matter that much. What do you think? |
Yes, I think it would be better to merge it after merging with arrow. |
Great. We can create an umbrella JIRA after the merge, to track the parquet-arrow integration. I can help to get this PR in very soon after that. |
Cool. |
@liurenjie1024 I created a JIRA here: https://issues.apache.org/jira/browse/ARROW-4060. Could you file a PR in the arrow repo for this? Thanks. |
This is the first step of adding an arrow reader and writer for parquet-rs. This commit contains a converter which converts parquet schema to arrow schema. Copied from this pr sunchao/parquet-rs#185. Author: Renjie Liu <liurenjie2008@gmail.com> Closes #3279 from liurenjie1024/rust-arrow-schema-converter and squashes the following commits: 1bfa00f <Renjie Liu> Resolve conflict 8806b16 <Renjie Liu> Add parquet arrow converter
This is the first step of adding an arrow reader and writer for parquet-rs. This commit contains a converter which converts parquet schema to arrow schema. Copied from this pr sunchao/parquet-rs#185. Author: Renjie Liu <liurenjie2008@gmail.com> Closes apache#3279 from liurenjie1024/rust-arrow-schema-converter and squashes the following commits: 1bfa00f <Renjie Liu> Resolve conflict 8806b16 <Renjie Liu> Add parquet arrow converter
Closing this as ARROW-4060 is already merged into Arrow. |
This is the first step of adding an arrow reader and writer for parquet-rs. This commit contains a converter which converts parquet schema to arrow schema. Copied from this pr sunchao/parquet-rs#185. Author: Renjie Liu <liurenjie2008@gmail.com> Closes #3279 from liurenjie1024/rust-arrow-schema-converter and squashes the following commits: 1bfa00f <Renjie Liu> Resolve conflict 8806b16 <Renjie Liu> Add parquet arrow converter
This is the first step of adding an arrow reader and writer for parquet-rs.
This commit contains a converter which converts parquet schema to arrow schema.