Hoodie dataset not queryable because of invalid parquet files. All invalid parquet files are of 4b in length. #58

prazanna · 2017-01-11T05:28:03Z

Root cause:

20170110210127 commit succeded with all files having valid content, archived and cleaned.
Komondor gets the RDD[WriteStatus] and calculates the count() on this RDD to update num of bad records etc
Hoodie persists the RDD to avoid recomputation
Because of DataNode restarts, some of the persisted is not available and Spark re-executes the upsert DAG for the partitions missing
This kicks off the DAG again with the same commit time and if this DAG tried 3 times to re-compute and failed (again most likely because of data nodes restarting)
We delete the data file path is already existing to account for partial failures in the update task, so a bunch of data files are deleted and recreated
The files that were open and task failed were all 4b files (just the parquet header or magic block)

Resolution:

Should not auto-commit by default. Commit should be called after all the processing and publish the data files atomically.

prazanna · 2017-01-11T05:30:54Z

Related issue: #59

…issue #58 (#60)

prazanna added this to the 0.2.5 milestone Jan 11, 2017

prazanna self-assigned this Jan 11, 2017

prazanna added the bug label Jan 11, 2017

prazanna mentioned this issue Jan 11, 2017

Make commit a public method. Introduce a auto-commit config. Relates … #60

Merged

vinothchandar pushed a commit that referenced this issue Jan 11, 2017

Make commit a public method. Introduce a auto-commit config. Relates …

e4e3395

…issue #58 (#60)

prazanna closed this as completed Jan 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hoodie dataset not queryable because of invalid parquet files. All invalid parquet files are of 4b in length. #58

Hoodie dataset not queryable because of invalid parquet files. All invalid parquet files are of 4b in length. #58

prazanna commented Jan 11, 2017

prazanna commented Jan 11, 2017

Hoodie dataset not queryable because of invalid parquet files. All invalid parquet files are of 4b in length. #58

Hoodie dataset not queryable because of invalid parquet files. All invalid parquet files are of 4b in length. #58

Comments

prazanna commented Jan 11, 2017

Root cause:

Resolution:

prazanna commented Jan 11, 2017