-
Notifications
You must be signed in to change notification settings - Fork 492
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Splits sometimes need to be deleted:
- because they were replaced after a merge.
- because they were uploaded, but some failure prevented their publication. (In that case they are in the Staged state)
- because they only contain documents out of the retention period.
We do never delete splits right away, to avoid interferring with in-flight queries. Instead we change their state to MarkForDelete. The GC then periodically delete the splits that have been in the MarkForDelete splits for more than a given grace period.
Delete here means "delete them from the storage", then "delete them from the metastore".
The current GC is pretty much stateless. Periodically it wakes up, and queries the metastore.
On a large volume (50 indexes) with partitioning, our current approach to GC has proved to be too inefficient.
There is a lot of room for improvement for our GC:
- push down the predicates that allow computing the list of splits candidate for deletion, add indexes,
SELECT split_idetc. LIMIT 100?- incremental GC
- batch delete request at the storage level
Also it would be very helpful to improve our observability.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request