Improve GC performance

Splits sometimes need to be deleted:
- because they were replaced after a merge.
- because they were uploaded, but some failure prevented their publication. (In that case they are in the Staged state)
- because they only contain documents out of the retention period.

We do never delete splits right away, to avoid interferring with in-flight queries. Instead we change their state to MarkForDelete. The GC then periodically delete the splits that have been in the MarkForDelete splits for more than a given grace period.

Delete here means "delete them from the storage", then "delete them from the metastore".
The current GC is pretty much stateless. Periodically it wakes up, and queries the metastore. 

On a large volume (50 indexes) with partitioning, our current approach to GC has proved to be too inefficient.

There is a lot of room for improvement for our GC:
- push down the predicates that allow computing the list of splits candidate for deletion, add indexes, `SELECT split_id` etc.
- `LIMIT 100`?
- incremental GC
- batch delete request at the storage level

Also it would be very helpful to improve our observability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve GC performance #2126

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve GC performance #2126

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions