Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while creating a dataset (ArrowIOError: Invalid parquet file. Corrupt footer) #321

Closed
fralik opened this issue Feb 28, 2019 · 5 comments

Comments

@fralik
Copy link

fralik commented Feb 28, 2019

I am trying to make a dataset from existing data, but the process fails with the error message rrowIOError: Invalid parquet file. Corrupt footer.

Here is my setup. I am using Spark on databricks and have loaded a dataframe into variable a with the following schema:

a.printSchema()
root
 |-- DataPointId: long (nullable = true)
 |-- Label: integer (nullable = true)
 |-- wtc_TORGear_counts: double (nullable = true)
 |-- wtc_UPSLowBt_counts: double (nullable = true)
 |-- wtc_IO1TRef2_mean: double (nullable = true)

And my code:

import petastorm
from petastorm.codecs import ScalarCodec, CompressedImageCodec, NdarrayCodec
from petastorm.etl.dataset_metadata import materialize_dataset
from petastorm.unischema import dict_to_spark_row, Unischema, UnischemaField

ds_url = 'file:///dbfs' + working_dir_spark + 'petastorm-training'

HelloWorldSchema = petastorm.unischema.Unischema('HelloWorldSchema', [
   petastorm.unischema.UnischemaField('DataPointId', np.long, (), ScalarCodec(LongType()), False),
   petastorm.unischema.UnischemaField('Label', np.int64, (), ScalarCodec(IntegerType()), False),
   UnischemaField('wtc_TORGear_counts', np.double, (), ScalarCodec(DoubleType()), False),
   UnischemaField('wtc_UPSLowBt_counts', np.double, (), ScalarCodec(DoubleType()), False),
   UnischemaField('wtc_IO1TRef2_mean', np.double, (), ScalarCodec(DoubleType()), False),
])

with petastorm.etl.dataset_metadata.materialize_dataset(spark, ds_url, HelloWorldSchema):
  a.write.parquet(ds_url, mode='overwrite')

I got the same error if I try to use make_batch_reader. I am using Python 3.6, petastorm version is 0.6.0, pyarrow version is 0.12.1.

Does anyone know how to make it working?

@selitvin
Copy link
Collaborator

selitvin commented Mar 1, 2019

Your code looks good. Couple of questions:

  1. Are you planning to have tensors in your datasets (you show only scalars in your example)? If not, then I suggest you store your data without materialize_dataset context, then open it with make_batch_reader. make_batch_reader is much faster when the underlying dataset has only Apache Parquet native data types.
  2. Are you able to read your store using standard spark tools and it has some data? I think I saw these kind of error messages when your store is actually empty.
spark.read.parquet(ds_url).count()

Up until now we were testing with pyarrow 0.11. I see now there are some failures with pyarrow 0.12, They seem to be unrelated to your scenario however.

@fralik
Copy link
Author

fralik commented Mar 1, 2019

  1. I plan to use 2D matrices. Have used scalars to make it more simple for reporting the issue. I tried to use make_batch_reader, but got the same error message.
  2. Yes, I can read the source perfectly fine. See that I use data frame in variable a as a source.
a.count()
501

and I do not have any empty values either. Count for all columns is 501.

For the reference, I went on and run the HelloWorld example. It gave me the same error. So, I guess, it is something in my setup.

@selitvin
Copy link
Collaborator

@fralik , did you figure out if this is an issue in your local setup or petastorm?

@fralik
Copy link
Author

fralik commented Mar 14, 2019

I only use Spark on databricks, so not much options to test it further. I'll close the issue.
Thanks for looking into it.

@sgvarsh
Copy link

sgvarsh commented Sep 10, 2019

@fralik I am also having same issue on Databricks - Spark, did you find the solution or work-around

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants