Garbage Collector- validate that the written report DataFrame isn't empty #4239

Jonathan-Rosenberg · 2022-09-22T13:09:15Z

Check that the written DataFrame is not empty before writing it, so that it won't throw an exception.

How was this tested?

I ran the GC in dry-run mode against a repo with ~850K objects to be deleted. It used to fail on that repo but now it finishes successfully.

Fixes #4238

arielshaqed

Thanks! This definitely solves the issue, but it introduces an inconsistency that will be hard for automated programs to follow.

arielshaqed · 2022-09-22T13:09:53Z

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

-        concatToGCLogsPrefix(storageNSForHadoopFS, s"deleted_objects/$time/deleted.parquet")
-      )
-
-    spark.close()


No longer closing?

closes in line 345

arielshaqed · 2022-09-22T13:11:25Z

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

-      )
-
-    spark.close()
+    if(!removed.isEmpty) {


But: this finishes running silently if it succeeded. So any program downstream that needs to process the list of deleted objects now has to have a special-case for a GC run that succeeded without deleting anything.

I would very much prefer a consistent output, of an empty object. There do exist empty Parquet files.

@arielshaqed

arielshaqed

Neat, thanks!
Mon blocking, but can the String be non-nullable? That will better reflect what it really is.

arielshaqed · 2022-09-22T15:35:55Z

Also, I think this is worth releasing today any which way. It's a serious boost to usability!

check that the written dataframe is not empty before writing it

3cfe2d2

Jonathan-Rosenberg requested review from johnnyaug and arielshaqed September 22, 2022 13:09

Jonathan-Rosenberg added the exclude-changelog PR description should not be included in next release changelog label Sep 22, 2022

arielshaqed requested changes Sep 22, 2022

View reviewed changes

johnnyaug approved these changes Sep 22, 2022

View reviewed changes

write empty dataframe

a171546

Jonathan-Rosenberg requested a review from arielshaqed September 22, 2022 14:50

fix scalafix

8915c3e

arielshaqed approved these changes Sep 22, 2022

View reviewed changes

turn "addresses" to non-nullable

ef5f003

Jonathan-Rosenberg merged commit e486325 into master Sep 25, 2022

Jonathan-Rosenberg deleted the fix/spark-metaclient/ignore-empty-dataframe branch September 25, 2022 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage Collector- validate that the written report DataFrame isn't empty #4239

Garbage Collector- validate that the written report DataFrame isn't empty #4239

Jonathan-Rosenberg commented Sep 22, 2022 •

edited

Loading

arielshaqed left a comment

arielshaqed Sep 22, 2022

Jonathan-Rosenberg Sep 22, 2022

arielshaqed Sep 22, 2022

Jonathan-Rosenberg Sep 22, 2022

arielshaqed left a comment

arielshaqed commented Sep 22, 2022 •

edited

Loading

Garbage Collector- validate that the written report DataFrame isn't empty #4239

Garbage Collector- validate that the written report DataFrame isn't empty #4239

Conversation

Jonathan-Rosenberg commented Sep 22, 2022 • edited Loading

How was this tested?

arielshaqed left a comment

Choose a reason for hiding this comment

arielshaqed Sep 22, 2022

Choose a reason for hiding this comment

Jonathan-Rosenberg Sep 22, 2022

Choose a reason for hiding this comment

arielshaqed Sep 22, 2022

Choose a reason for hiding this comment

Jonathan-Rosenberg Sep 22, 2022

Choose a reason for hiding this comment

arielshaqed left a comment

Choose a reason for hiding this comment

arielshaqed commented Sep 22, 2022 • edited Loading

Jonathan-Rosenberg commented Sep 22, 2022 •

edited

Loading

arielshaqed commented Sep 22, 2022 •

edited

Loading