-
Notifications
You must be signed in to change notification settings - Fork 20
Panic when reading parquet files #178
Comments
Interesting. Let me check the file. |
Well, this is a funny problem. Our reader fails to compare number of column fields with the same value in row group - they should match obviously, except they don't. The reason is Thrift schema element sets number of children for the primitive type, see below:
Which results in the following schema, notice that there are no primitive types:
The reason this is happening is
But in this file, it is set to 0! Again, it is a very simple fix: diff --git a/src/schema/types.rs b/src/schema/types.rs
index 8bc6d64..e3f7fa3 100644
--- a/src/schema/types.rs
+++ b/src/schema/types.rs
@@ -828,7 +828,7 @@ fn from_thrift_helper(
let logical_type = LogicalType::from(elements[index].converted_type);
let field_id = elements[index].field_id;
match elements[index].num_children {
- None => {
+ None | Some(0) => {
// primitive type
if elements[index].repetition_type.is_none() {
return Err(general_err!( It looks like parquet-cpp 1.3.2 that was used to write the file actually violates the Thrift definition!
By the way, we would not be able to read the file with
ping @sunchao |
Thanks for reporting @gnieto! Let me know if you would like to fix this problem(s), otherwise, I will open a PR. |
Hmm this is interesting finding. Thanks for identifying the issue so quickly @sadikovi ! Yes we should fix it as well as support the extra types. We can also use the files in antirez-redis for testing purpose. |
Thanks! Yes, I will open PRs today or tomorrow. Do you mean we should add those files to the repository? I was thinking if we could maintain a separate repository with all of the test files we plan to use and run parquet-schema and parquet-read on them to make sure we did not break anything. What do you think? |
No I meant to test them manually. Yes we can have a separate repo just for the test files, as long as it is convenient to pull from |
No, I do not want to move those files into a separate repo - that was just an idea, maybe do it in the future. Yes, I will check other files, see if we can read those. |
It looks like there is more than one problem with the file. First is the one with num_children, the second is the root message type has a repetition of From Thrift definition:
But it does have in this file. @sunchao I am happy to patch that as well? I am not sure why parquet-cpp is different. I will open PR in a couple of days. |
It looks like our schema Thrift deserialisation code is not as robust as parquet-cpp. I will work on that. |
This is interesting. Seems the above definition already exist since parquet-format 1.0.0, which is 5 years ago.. not quite sure why parquet-cpp is different and whether parquet-mr also does the same thing. Thanks for working on this @sadikovi ! |
It is all good. I sent an email on dev list with these questions, I will
try patching it meanwhile.
…On Thu, 1 Nov 2018 at 6:05 PM, Chao Sun ***@***.***> wrote:
This is interesting. Seems the above definition already exist since parquet-format
1.0.0
<https://github.com/apache/parquet-format/blob/parquet-format-1.0.0/src/thrift/parquet.thrift>,
which is 5 years ago.. not quite sure why parquet-cpp is different and
whether parquet-mr also does the same thing.
Thanks for working on this @sadikovi <https://github.com/sadikovi> !
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#178 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3gVRol0GK0wLDODhHPwpknvXckInks5uqynSgaJpZM4YDBXb>
.
|
I tried reading files from the s3 bucket. One file can't be read due to the issue with Int96, looks like our code things the value is invalid, but it could be our conversion. I will have a closer look. Relates to #148 in a sense there are issues with Int96. |
Well, we can close it, it will work with the example file. But there is
another file timestamps.parquet which has Int96 values that we can’t
convert to dates to print.
I suggest we close it, and open another issue if needed.
…On Wed, 7 Nov 2018 at 7:54 PM, Chao Sun ***@***.***> wrote:
@sadikovi <https://github.com/sadikovi> , @gnieto
<https://github.com/gnieto> : let me know if this can be closed now. :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#178 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3igxPoe-qosdvvoZVwKUenPJGaJVks5usyxxgaJpZM4YDBXb>
.
|
It seems solved on the last version |
Is that related to the dates before 1970? we can open another issue for that. |
Yes, it is. It is the same problem as Int96(0, 0, 0).
…On Wed, 7 Nov 2018 at 8:02 PM, Chao Sun ***@***.***> wrote:
But there is
another file timestamps.parquet which has Int96 values that we can’t
convert to dates to print.
Is that related to the dates before 1970? we can open another issue for
that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#178 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHbY3gPGY9tasdW1ImqIJtLHjvht0jTAks5usy5RgaJpZM4YDBXb>
.
|
I downloaded some sample files from https://github.com/gitential/datasets/tree/master/antirez-redis and I'm not able to load the schema or read the file.
Example file: https://s3.amazonaws.com/gitential-datasets/antirez-redis/tags.parquet
Branch: master
Command: RUST_BACKTRACE=1 cargo run --bin parquet-schema -- tags.parquet
Output:
The text was updated successfully, but these errors were encountered: