Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage Collector- validate that the written report DataFrame isn't empty #4239

Merged

Conversation

Jonathan-Rosenberg
Copy link
Contributor

@Jonathan-Rosenberg Jonathan-Rosenberg commented Sep 22, 2022

Check that the written DataFrame is not empty before writing it, so that it won't throw an exception.

How was this tested?

I ran the GC in dry-run mode against a repo with ~850K objects to be deleted. It used to fail on that repo but now it finishes successfully.

Fixes #4238

@Jonathan-Rosenberg Jonathan-Rosenberg added the exclude-changelog PR description should not be included in next release changelog label Sep 22, 2022
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This definitely solves the issue, but it introduces an inconsistency that will be hard for automated programs to follow.

concatToGCLogsPrefix(storageNSForHadoopFS, s"deleted_objects/$time/deleted.parquet")
)

spark.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer closing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closes in line 345

)

spark.close()
if(!removed.isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But: this finishes running silently if it succeeded. So any program downstream that needs to process the list of deleted objects now has to have a special-case for a GC run that succeeded without deleting anything.

I would very much prefer a consistent output, of an empty object. There do exist empty Parquet files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, thanks!
Mon blocking, but can the String be non-nullable? That will better reflect what it really is.

@arielshaqed
Copy link
Contributor

arielshaqed commented Sep 22, 2022

Also, I think this is worth releasing today any which way. It's a serious boost to usability!

@Jonathan-Rosenberg Jonathan-Rosenberg merged commit e486325 into master Sep 25, 2022
@Jonathan-Rosenberg Jonathan-Rosenberg deleted the fix/spark-metaclient/ignore-empty-dataframe branch September 25, 2022 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-changelog PR description should not be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running GC in dry run mode always fails
3 participants