Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hoodie dataset not queryable because of invalid parquet files. All invalid parquet files are of 4b in length. #58

Closed
prazanna opened this issue Jan 11, 2017 · 1 comment
Assignees

Comments

@prazanna
Copy link
Contributor

Root cause:

  • 20170110210127 commit succeded with all files having valid content, archived and cleaned.
  • Komondor gets the RDD[WriteStatus] and calculates the count() on this RDD to update num of bad records etc
  • Hoodie persists the RDD to avoid recomputation
  • Because of DataNode restarts, some of the persisted is not available and Spark re-executes the upsert DAG for the partitions missing
  • This kicks off the DAG again with the same commit time and if this DAG tried 3 times to re-compute and failed (again most likely because of data nodes restarting)
  • We delete the data file path is already existing to account for partial failures in the update task, so a bunch of data files are deleted and recreated
  • The files that were open and task failed were all 4b files (just the parquet header or magic block)

Resolution:

  • Should not auto-commit by default. Commit should be called after all the processing and publish the data files atomically.
@prazanna prazanna added this to the 0.2.5 milestone Jan 11, 2017
@prazanna prazanna self-assigned this Jan 11, 2017
@prazanna prazanna added the bug label Jan 11, 2017
@prazanna
Copy link
Contributor Author

Related issue: #59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant