Arrow schema converter. #185

liurenjie1024 · 2018-11-05T02:44:17Z

This is the first step of adding an arrow reader and writer for parquet-rs.
This commit contains a converter which converts parquet schema to arrow schema.

coveralls · 2018-11-05T03:19:45Z

Pull Request Test Coverage Report for Build 665

0 of 0 changed or added relevant lines in 0 files are covered.
27 unchanged lines in 4 files lost coverage.
Overall coverage decreased (-0.09%) to 95.628%

Files with Coverage Reduction	New Missed Lines	%
encodings/encoding.rs	1	94.36%
file/properties.rs	4	90.97%
errors.rs	4	26.67%
schema/types.rs	18	96.84%

Totals
Change from base Build 661:	-0.09%
Covered Lines:	13276
Relevant Lines:	13883

💛 - Coveralls

sunchao · 2018-11-06T07:07:28Z

Thanks @liurenjie1024 . Will take a look soon.

Meanwhile, I created #186 to track the overall progress of adding Arrow support. Please feel free to create more tasks in it. I'm also planning to take up some tasks. :)

liurenjie1024 · 2018-11-06T08:24:40Z

@sunchao Thanks for opening the issue and comment with my thoughts about other tasks. Current I'm interested in the reader part and working on the reader that converts parquet to arrow.

sunchao

Thanks @liurenjie1024 . Left some comments.

src/arrow_format/mod.rs

src/arrow_format/schema.rs

sunchao · 2018-11-07T07:23:21Z

src/arrow_format/schema.rs

+
+  fn to_list(&self) -> Result<Option<DataType>> {
+    match &*self.schema {
+      Type::PrimitiveType {..} => panic!("This should not happen."),


I think you can just use assertion here for the check of primitive type and field.len == 1.

I think panic when it's primitive type is same.

I don't think we should panic when the fields.len() != 1 because this is an error format and should not cause our function to panic.

You're right. For the latter maybe we can throw "not yet supported" error.

Why can't we return Err here?

Because if this happens, it means our code has bug.

To start with, returning an Err does not necessarily mean the code has bug. Err is just a way of expressing an invalid state or a bug and returning it without panicking. With Err it is easier to add a unit test to show that we return a proper error message.

By the way, there is no test for this, either with panic or Err. Also, what does it mean "It should not happen". What should not happen? Can you update the error message?

If you want to go with panic, I would suggest including something like the actual schema that was compared, and some other metadata as part of the error message, so it is easier to debug later on when it happens.

This is an implementation detail and the arguments are passed by our own code, rather than the user, so I think it's sensible to panic. But I agree that we should add more detailed message for it.

src/arrow_format/schema.rs

sunchao · 2018-11-07T07:43:53Z

src/arrow_format/schema.rs

+            basic_info: _,
+            fields
+          } if fields.len()==1 && list_item.name()!="array" &&
+            list_item.name()!=format!("{}_tuple", self.schema.name()) => {


You can just use list_item.name().ends_with("_tuple").

They are not the same. The list item name must be equal to {schema.name()}_tuple

I see. Looks good. I think we may need to fix the check in another place. It will also be good if the code can be shared..

Yes, I didn't notice the check you mentioned before. I think we need another refactor so that we can share the code.

src/arrow_format/schema.rs

sunchao · 2018-11-07T07:49:42Z

src/arrow_format/schema.rs

+
+        item_type.map(|opt| opt.map(|dt| DataType::List(Box::new(dt))))
+      },
+      _ => Err(ParquetError::ArrowError("Unrecognized list type.".to_string()))


Same as above - can we improve this error message?

src/arrow_format/schema.rs

sunchao · 2018-11-07T07:54:11Z

src/arrow_format/schema.rs

+/// Convert parquet schema to arrow schema, only preserving some leaf columns.
+pub fn parquet_to_arrow_schema_by_columns<T>(
+  parquet_schema: SchemaDescPtr, column_indices: T) -> Result<Schema>
+  where T: IntoIterator<Item=usize> {


Any reason why this has to be IntoIterator but not just a normal iterator?

So that the user can just pass a vec![1,2,3] without calling .iter(). BTW, if T implements Iterator, it also implements IntoIterator.

Thanks. I was thinking that we don't necessarily need to consume the column_indices with IntoIterator, but just need to borrow them. However, it seems much harder to replace it with Iterator in this case. It's all good.

src/arrow_format/schema.rs

liurenjie1024 · 2018-11-08T03:50:32Z

@sunchao Thanks for the review and I've fixed the comments. As for the format part, how do you think about a rustfmt file so that other contributors can follow?

sunchao · 2018-11-08T04:24:37Z

As for the format part, how do you think about a rustfmt file so that other contributors can follow?

Yes I think it is a good idea. We discussed about this before but later give up since couldn't achieve what we want with rustfmt. Let me try it once more.

sadikovi

Thanks for the PR. I left a few comments. Could you also add doc for the functions even if they are trivial, it is unclear what some of them do and why they do it, and will be difficult to maintain the code later.

src/errors.rs

sadikovi · 2018-11-08T12:24:40Z

src/arrow/schema.rs

+    basic_info.has_repetition() && basic_info.repetition()==Repetition::REPEATED
+  }
+
+  fn is_self_included(&self) -> bool {


Why do you need this function? How could self be part of self.leaves?

This function is used to test if the schema included in the leaves. We need it when the converter is converting a primitive schema. The leaves are not leaves of the schema, but columns that need to convert.

src/arrow/schema.rs

src/arrow/mod.rs

Cargo.toml

src/arrow/schema.rs

src/schema/types.rs

liurenjie1024 · 2018-11-16T06:35:13Z

@sadikovi Sorry for late reply. I've fixed some comments.

liurenjie1024 · 2018-11-22T02:41:46Z

@sunchao @sadikovi Could you help to review this?

sunchao · 2018-11-22T05:56:13Z

@liurenjie1024 sure, will take a look soon. Sorry for the delay.

sadikovi

Looks good.

sadikovi · 2018-11-22T07:33:23Z

src/arrow/mod.rs

+//! [Apache Arrow](http://arrow.apache.org/) is a cross-language development platform for in-memory
+//! data.
+//!
+//! This mod provides API for converting between arrow and parquet.


Okay, but let's not forget to do that.

sunchao

Sorry for the late review (was on vacation). LGTM.

liurenjie1024 · 2018-11-29T02:01:26Z

@sunchao @sadikovi We have two LGTM, help to merge this?

sunchao · 2018-11-29T03:27:42Z

@liurenjie1024 I'm in the process of donating parquet-rs to Apache arrow, and think it might be better to do this after the merge is done (we can do a clean start of the arrow-parquet integration after that). I can merge this now too if you have follow-up work that depends on it. It shouldn't matter that much. What do you think?

liurenjie1024 · 2018-11-29T03:45:59Z

Yes, I think it would be better to merge it after merging with arrow.

sunchao · 2018-11-29T03:58:09Z

Great. We can create an umbrella JIRA after the merge, to track the parquet-arrow integration. I can help to get this PR in very soon after that.

liurenjie1024 · 2018-11-29T05:03:16Z

Cool.

sunchao · 2018-12-18T07:19:25Z

@liurenjie1024 I created a JIRA here: https://issues.apache.org/jira/browse/ARROW-4060. Could you file a PR in the arrow repo for this? Thanks.

This is the first step of adding an arrow reader and writer for parquet-rs. This commit contains a converter which converts parquet schema to arrow schema. Copied from this pr sunchao/parquet-rs#185. Author: Renjie Liu <liurenjie2008@gmail.com> Closes #3279 from liurenjie1024/rust-arrow-schema-converter and squashes the following commits: 1bfa00f <Renjie Liu> Resolve conflict 8806b16 <Renjie Liu> Add parquet arrow converter

This is the first step of adding an arrow reader and writer for parquet-rs. This commit contains a converter which converts parquet schema to arrow schema. Copied from this pr sunchao/parquet-rs#185. Author: Renjie Liu <liurenjie2008@gmail.com> Closes apache#3279 from liurenjie1024/rust-arrow-schema-converter and squashes the following commits: 1bfa00f <Renjie Liu> Resolve conflict 8806b16 <Renjie Liu> Add parquet arrow converter

sunchao · 2019-01-24T07:18:21Z

Closing this as ARROW-4060 is already merged into Arrow.

This is the first step of adding an arrow reader and writer for parquet-rs. This commit contains a converter which converts parquet schema to arrow schema. Copied from this pr sunchao/parquet-rs#185. Author: Renjie Liu <liurenjie2008@gmail.com> Closes #3279 from liurenjie1024/rust-arrow-schema-converter and squashes the following commits: 1bfa00f <Renjie Liu> Resolve conflict 8806b16 <Renjie Liu> Add parquet arrow converter

liurenjie1024 mentioned this pull request Nov 6, 2018

Add support for reading columns as Apache Arrow arrays #79

Open

sunchao mentioned this pull request Nov 6, 2018

Add Arrow Support #186

Open

6 tasks

sunchao reviewed Nov 7, 2018

View reviewed changes

sadikovi reviewed Nov 8, 2018

View reviewed changes

liurenjie1024 added 3 commits November 16, 2018 14:31

Arrow schema converter.

ecbbca8

Fix comments

c5d0bb8

fix comments

d37cf83

liurenjie1024 force-pushed the arrow-schema2 branch from 3e813d6 to d37cf83 Compare November 16, 2018 06:34

Formatting code.

7790fb4

sadikovi reviewed Nov 22, 2018

View reviewed changes

sunchao approved these changes Nov 26, 2018

View reviewed changes

liurenjie1024 mentioned this pull request Dec 28, 2018

ARROW-4060: [Rust] Add parquet arrow converter. apache/arrow#3279

Closed

sunchao closed this Jan 24, 2019

Arrow schema converter. #185

Arrow schema converter. #185

Conversation

liurenjie1024 commented Nov 5, 2018

coveralls commented Nov 5, 2018 • edited Loading

Pull Request Test Coverage Report for Build 665

💛 - Coveralls

sunchao commented Nov 6, 2018

liurenjie1024 commented Nov 6, 2018

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao Nov 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Nov 8, 2018

sunchao commented Nov 8, 2018

sadikovi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 commented Nov 16, 2018

liurenjie1024 commented Nov 22, 2018

sunchao commented Nov 22, 2018

sadikovi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

liurenjie1024 commented Nov 29, 2018

sunchao commented Nov 29, 2018

liurenjie1024 commented Nov 29, 2018

sunchao commented Nov 29, 2018

liurenjie1024 commented Nov 29, 2018

sunchao commented Dec 18, 2018

sunchao commented Jan 24, 2019

coveralls commented Nov 5, 2018 •

edited

Loading

sunchao Nov 8, 2018 •

edited

Loading