Error while creating a dataset (ArrowIOError: Invalid parquet file. Corrupt footer) #321

fralik · 2019-02-28T21:14:50Z

I am trying to make a dataset from existing data, but the process fails with the error message rrowIOError: Invalid parquet file. Corrupt footer.

Here is my setup. I am using Spark on databricks and have loaded a dataframe into variable a with the following schema:

a.printSchema()
root
 |-- DataPointId: long (nullable = true)
 |-- Label: integer (nullable = true)
 |-- wtc_TORGear_counts: double (nullable = true)
 |-- wtc_UPSLowBt_counts: double (nullable = true)
 |-- wtc_IO1TRef2_mean: double (nullable = true)

And my code:

import petastorm
from petastorm.codecs import ScalarCodec, CompressedImageCodec, NdarrayCodec
from petastorm.etl.dataset_metadata import materialize_dataset
from petastorm.unischema import dict_to_spark_row, Unischema, UnischemaField

ds_url = 'file:///dbfs' + working_dir_spark + 'petastorm-training'

HelloWorldSchema = petastorm.unischema.Unischema('HelloWorldSchema', [
   petastorm.unischema.UnischemaField('DataPointId', np.long, (), ScalarCodec(LongType()), False),
   petastorm.unischema.UnischemaField('Label', np.int64, (), ScalarCodec(IntegerType()), False),
   UnischemaField('wtc_TORGear_counts', np.double, (), ScalarCodec(DoubleType()), False),
   UnischemaField('wtc_UPSLowBt_counts', np.double, (), ScalarCodec(DoubleType()), False),
   UnischemaField('wtc_IO1TRef2_mean', np.double, (), ScalarCodec(DoubleType()), False),
])

with petastorm.etl.dataset_metadata.materialize_dataset(spark, ds_url, HelloWorldSchema):
  a.write.parquet(ds_url, mode='overwrite')

I got the same error if I try to use make_batch_reader. I am using Python 3.6, petastorm version is 0.6.0, pyarrow version is 0.12.1.

Does anyone know how to make it working?

The text was updated successfully, but these errors were encountered:

selitvin · 2019-03-01T02:56:04Z

Your code looks good. Couple of questions:

Are you planning to have tensors in your datasets (you show only scalars in your example)? If not, then I suggest you store your data without materialize_dataset context, then open it with make_batch_reader. make_batch_reader is much faster when the underlying dataset has only Apache Parquet native data types.
Are you able to read your store using standard spark tools and it has some data? I think I saw these kind of error messages when your store is actually empty.

spark.read.parquet(ds_url).count()

Up until now we were testing with pyarrow 0.11. I see now there are some failures with pyarrow 0.12, They seem to be unrelated to your scenario however.

fralik · 2019-03-01T09:46:16Z

I plan to use 2D matrices. Have used scalars to make it more simple for reporting the issue. I tried to use make_batch_reader, but got the same error message.
Yes, I can read the source perfectly fine. See that I use data frame in variable a as a source.

a.count()
501

and I do not have any empty values either. Count for all columns is 501.

For the reference, I went on and run the HelloWorld example. It gave me the same error. So, I guess, it is something in my setup.

selitvin · 2019-03-13T16:24:30Z

@fralik , did you figure out if this is an issue in your local setup or petastorm?

fralik · 2019-03-14T12:38:13Z

I only use Spark on databricks, so not much options to test it further. I'll close the issue.
Thanks for looking into it.

sgvarsh · 2019-09-10T21:55:23Z

@fralik I am also having same issue on Databricks - Spark, did you find the solution or work-around

fralik closed this as completed Mar 14, 2019

brookewenig mentioned this issue Apr 20, 2019

Reading CSV/Parquet files written by Spark Job from Pandas databricks/koalas#111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while creating a dataset (ArrowIOError: Invalid parquet file. Corrupt footer) #321

Error while creating a dataset (ArrowIOError: Invalid parquet file. Corrupt footer) #321

fralik commented Feb 28, 2019

selitvin commented Mar 1, 2019

fralik commented Mar 1, 2019

selitvin commented Mar 13, 2019

fralik commented Mar 14, 2019

sgvarsh commented Sep 10, 2019

Error while creating a dataset (ArrowIOError: Invalid parquet file. Corrupt footer) #321

Error while creating a dataset (ArrowIOError: Invalid parquet file. Corrupt footer) #321

Comments

fralik commented Feb 28, 2019

selitvin commented Mar 1, 2019

fralik commented Mar 1, 2019

selitvin commented Mar 13, 2019

fralik commented Mar 14, 2019

sgvarsh commented Sep 10, 2019