-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage Collector- validate that the written report DataFrame isn't empty #4239
Garbage Collector- validate that the written report DataFrame isn't empty #4239
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This definitely solves the issue, but it introduces an inconsistency that will be hard for automated programs to follow.
concatToGCLogsPrefix(storageNSForHadoopFS, s"deleted_objects/$time/deleted.parquet") | ||
) | ||
|
||
spark.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer closing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
closes in line 345
) | ||
|
||
spark.close() | ||
if(!removed.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But: this finishes running silently if it succeeded. So any program downstream that needs to process the list of deleted objects now has to have a special-case for a GC run that succeeded without deleting anything.
I would very much prefer a consistent output, of an empty object. There do exist empty Parquet files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat, thanks!
Mon blocking, but can the String be non-nullable? That will better reflect what it really is.
Also, I think this is worth releasing today any which way. It's a serious boost to usability! |
Check that the written DataFrame is not empty before writing it, so that it won't throw an exception.
How was this tested?
I ran the GC in dry-run mode against a repo with ~850K objects to be deleted. It used to fail on that repo but now it finishes successfully.
Fixes #4238