Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Panic when reading parquet files #178

Closed
gnieto opened this issue Oct 30, 2018 · 17 comments
Closed

Panic when reading parquet files #178

gnieto opened this issue Oct 30, 2018 · 17 comments
Assignees
Labels

Comments

@gnieto
Copy link
Contributor

gnieto commented Oct 30, 2018

I downloaded some sample files from https://github.com/gitential/datasets/tree/master/antirez-redis and I'm not able to load the schema or read the file.

Example file: https://s3.amazonaws.com/gitential-datasets/antirez-redis/tags.parquet
Branch: master
Command: RUST_BACKTRACE=1 cargo run --bin parquet-schema -- tags.parquet
Output:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `6`', src/file/metadata.rs:234:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:211
   3: std::panicking::default_hook
             at libstd/panicking.rs:227
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:476
   5: std::panicking::continue_panic_fmt
             at libstd/panicking.rs:390
   6: std::panicking::begin_panic_fmt
             at libstd/panicking.rs:345
   7: parquet::file::metadata::RowGroupMetaData::from_thrift
             at src/file/metadata.rs:234
   8: <parquet::file::reader::SerializedFileReader<R>>::parse_metadata
             at src/file/reader.rs:197
   9: <parquet::file::reader::SerializedFileReader<R>>::new
             at src/file/reader.rs:150
  10: parquet_schema::main
             at src/bin/parquet-schema.rs:85
  11: std::rt::lang_start::{{closure}}
             at libstd/rt.rs:74
  12: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:310
  13: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:102
  14: std::rt::lang_start_internal
             at libstd/panicking.rs:289
             at libstd/panic.rs:392
             at libstd/rt.rs:58
  15: std::rt::lang_start
             at libstd/rt.rs:74
  16: main
  17: __libc_start_main
  18: _start
@sadikovi
Copy link
Collaborator

Interesting. Let me check the file.

@sadikovi
Copy link
Collaborator

Well, this is a funny problem. Our reader fails to compare number of column fields with the same value in row group - they should match obviously, except they don't. The reason is Thrift schema element sets number of children for the primitive type, see below:

SchemaElement { type_: None, type_length: None, repetition_type: Some(REQUIRED), name: "schema", num_children: Some(6), converted_type: None, scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(BYTE_ARRAY), type_length: None, repetition_type: Some(OPTIONAL), name: "id", num_children: Some(0), converted_type: Some(UTF8), scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(BYTE_ARRAY), type_length: None, repetition_type: Some(OPTIONAL), name: "name", num_children: Some(0), converted_type: Some(UTF8), scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(BYTE_ARRAY), type_length: None, repetition_type: Some(OPTIONAL), name: "message", num_children: Some(0), converted_type: Some(UTF8), scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(INT32), type_length: None, repetition_type: Some(OPTIONAL), name: "type", num_children: Some(0), converted_type: Some(UINT_8), scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(INT64), type_length: None, repetition_type: Some(OPTIONAL), name: "author_time", num_children: Some(0), converted_type: Some(TIMESTAMP_MILLIS), scale: None, precision: None, field_id: None, logical_type: None }, 
SchemaElement { type_: Some(INT64), type_length: None, repetition_type: Some(OPTIONAL), name: "__index_level_0__", num_children: Some(0), converted_type: None, scale: None, precision: None, field_id: None, logical_type: None }

Which results in the following schema, notice that there are no primitive types:

GroupType { 
  basic_info: BasicTypeInfo { name: "schema", repetition: Some(REQUIRED), logical_type: NONE, id: None }, 
  fields: [
    GroupType { 
      basic_info: BasicTypeInfo { name: "id", repetition: Some(OPTIONAL), logical_type: UTF8, id: None }, 
      fields: [] 
    }, 
    GroupType { 
      basic_info: BasicTypeInfo { name: "name", repetition: Some(OPTIONAL), logical_type: UTF8, id: None }, 
      fields: [] 
    }, 
    GroupType { 
      basic_info: BasicTypeInfo { name: "message", repetition: Some(OPTIONAL), logical_type: UTF8, id: None }, 
      fields: [] 
    }, 
    GroupType { 
      basic_info: BasicTypeInfo { name: "type", repetition: Some(OPTIONAL), logical_type: UINT_8, id: None }, 
      fields: []
    }, 
    GroupType { 
      basic_info: BasicTypeInfo { name: "author_time", repetition: Some(OPTIONAL), logical_type: TIMESTAMP_MILLIS, id: None }, 
      fields: [] 
    }, 
    GroupType { 
      basic_info: BasicTypeInfo { name: "__index_level_0__", repetition: Some(OPTIONAL), logical_type: NONE, id: None }, 
      fields: [] 
    }
  ] 
}

The reason this is happening is num_children: Some(0) part in SchemaElement, our code confirms to the Thrift definition and thinks that this is group type. Strictly speaking, our code does the right thing, see the definition of the field below:

/** Nested fields.  Since thrift does not support nested fields,
   * the nesting is flattened to a single list by a depth-first traversal.
   * The children count is used to construct the nested relationship.
   * This field is not set when the element is a primitive type
   */
  5: optional i32 num_children;

But in this file, it is set to 0! Again, it is a very simple fix:

diff --git a/src/schema/types.rs b/src/schema/types.rs
index 8bc6d64..e3f7fa3 100644
--- a/src/schema/types.rs
+++ b/src/schema/types.rs
@@ -828,7 +828,7 @@ fn from_thrift_helper(
   let logical_type = LogicalType::from(elements[index].converted_type);
   let field_id = elements[index].field_id;
   match elements[index].num_children {
-    None => {
+    None | Some(0) => {
       // primitive type
       if elements[index].repetition_type.is_none() {
         return Err(general_err!(

It looks like parquet-cpp 1.3.2 that was used to write the file actually violates the Thrift definition!
Here is the actual metadata of the file:

version: 1
num of rows: 172
created by: parquet-cpp version 1.3.2-SNAPSHOT
REQUIRED group schema {
  OPTIONAL BYTE_ARRAY id (UTF8);
  OPTIONAL BYTE_ARRAY name (UTF8);
  OPTIONAL BYTE_ARRAY message (UTF8);
  OPTIONAL INT32 type (UINT_8);
  OPTIONAL INT64 author_time (TIMESTAMP_MILLIS);
  OPTIONAL INT64 __index_level_0__;
}

By the way, we would not be able to read the file with parquet-read because we do not have display of INT32(UINT_8) fields yet, so we would have to add that as well.

  • Fix schema parsing for parquet-cpp 1.3.2 and others that would write 0 there.
  • Add display support in record reader for UINT8 and TIMESTAMP_MILLIS

ping @sunchao

@sadikovi
Copy link
Collaborator

Thanks for reporting @gnieto! Let me know if you would like to fix this problem(s), otherwise, I will open a PR.

@sunchao
Copy link
Owner

sunchao commented Oct 31, 2018

Hmm this is interesting finding. Thanks for identifying the issue so quickly @sadikovi ! Yes we should fix it as well as support the extra types. We can also use the files in antirez-redis for testing purpose.

@sadikovi
Copy link
Collaborator

Thanks! Yes, I will open PRs today or tomorrow. Do you mean we should add those files to the repository? I was thinking if we could maintain a separate repository with all of the test files we plan to use and run parquet-schema and parquet-read on them to make sure we did not break anything. What do you think?

@sunchao
Copy link
Owner

sunchao commented Oct 31, 2018

No I meant to test them manually. Yes we can have a separate repo just for the test files, as long as it is convenient to pull from parquet-rs and test. Do you want to move all the files under data/ to there too?

@sadikovi
Copy link
Collaborator

No, I do not want to move those files into a separate repo - that was just an idea, maybe do it in the future. Yes, I will check other files, see if we can read those.

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 1, 2018

It looks like there is more than one problem with the file. First is the one with num_children, the second is the root message type has a repetition of Some(REQUIRED).

From Thrift definition:

/** repetition of the field. The root of the schema does not have a repetition_type.
   * All other nodes must have one */
  3: optional FieldRepetitionType repetition_type;

But it does have in this file. @sunchao I am happy to patch that as well? I am not sure why parquet-cpp is different. I will open PR in a couple of days.

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 1, 2018

It looks like our schema Thrift deserialisation code is not as robust as parquet-cpp. I will work on that.

@sadikovi sadikovi self-assigned this Nov 1, 2018
@sunchao
Copy link
Owner

sunchao commented Nov 1, 2018

This is interesting. Seems the above definition already exist since parquet-format 1.0.0, which is 5 years ago.. not quite sure why parquet-cpp is different and whether parquet-mr also does the same thing.

Thanks for working on this @sadikovi !

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 1, 2018 via email

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 3, 2018

I tried reading files from the s3 bucket. One file can't be read due to the issue with Int96, looks like our code things the value is invalid, but it could be our conversion. I will have a closer look.

Relates to #148 in a sense there are issues with Int96.

@sunchao
Copy link
Owner

sunchao commented Nov 7, 2018

@sadikovi , @gnieto : let me know if this can be closed now. :)

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 7, 2018 via email

@gnieto
Copy link
Contributor Author

gnieto commented Nov 7, 2018

It seems solved on the last version

@gnieto gnieto closed this as completed Nov 7, 2018
@sunchao
Copy link
Owner

sunchao commented Nov 7, 2018

But there is
another file timestamps.parquet which has Int96 values that we can’t
convert to dates to print.

Is that related to the dates before 1970? we can open another issue for that.

@sadikovi
Copy link
Collaborator

sadikovi commented Nov 7, 2018 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants