Fast and safe tombstone GC for group0-manged tables #15607

tgrabiec · 2023-10-02T19:57:04Z

This solution mainly applies to schema tables, but the same mechanism works for other group0 tables.

Schema tables have gc grace period of 7 days which allows replicas lagging behind schema changes to catch up. In some scenarios, this can cause accumulation of tombstones which then slows down processing of schema changes, which need to read the schema.

Also, if one of the nodes doesn't catch up within 7 days, we may end up with corrupted group0 state because tombstones will not be propagated to that node.

This period could be greatly reduced in raft mode if expiration is not time-based but coupled with group0 members catching up. Replicas already support expiring based on external clock, which is part of the repair-based tombstone gc. We should switch group0 tables to use this mode (schema option tombstone_gc = {'mode': 'repair'}) and implement a mechanism which treats group0 catch up as repair. This group0 mechanism should simply call tombstone_gc_state::update_repair_time(table_id, token_range, repair_time) for all group0 tables and full token range to indicated that tombstones whose deletion time is <= repair_time can be expired.

The mechanism in group0 to determine this can work as follows. Each command is assigned a monotonically increasing gc_clock::time_point (del_time) which will be the deletion time used by any tombstone in mutations contained by the command. Similar to how we assign a timestamp. We can define per-state-machine del_time to be equal to that of the last command. It's easy to track it by having each group0 command store it in some field as part of its mutations. When all followers catch up to a state machine whose del_time is x, then all tombstones whose deletion time is <= x can be expired. group0 leader can propagate this information to members by piggy backing on group0 commands, which are propagated everywhere, and store the expiration del_time in a new per-state-machine field called exp_time. When followers receive exp_time, they call update_repair_time().

The leader doesn't have to be very diligent in tracking exp_time, it can work in cycles like this:

get the local state machine's del_time (d) and mark the current raft apply index (i). We know that when members catch up with this index, they received all tombstone with deletion time <= d so we can set exp_time to d.
wait for all members to catch up with i
set in-memory exp_time to d, next group0 command will pick it up.
if del_time != d goto 1 else wait for notification.

Piggy backing on next command rather than sending info instantly is good enough because tombstone accumulation is created by sending commands, so the two processes will be coupled. If there are no group0 commands, it's fine to delay expiry.

When node restarts, it recovers exp_time from group0 state on disk, and calls update_repair_time().

\cc @kbr-scylla @kostja @gleb-cloudius

The text was updated successfully, but these errors were encountered:

kbr-scylla · 2023-12-05T12:23:34Z

I would strongly prefer a solution that does not require modifying Raft's internals. Anything that uses the concept of "Raft leader" or "follower" or "apply index" would require that, I think.

But that part seems simple to do without going through Raft internals. If we know that every replica has applied changes up to some command X (which we can evaluate based on group 0 state_ids for example, not necessarily Raft's internal apply_index), we can then send a command (or include some extra data on the next command that something else sends - piggy-backing as you said) that would GC the tombstones up to that point.

However, what is non-obvious is how to do this:

wait for all members to catch up with i

Or more generally, how to check that every replica has reached a certain point.

The idea of (periodically?) sending RPC to every group 0 member is what comes to my mind immediately. Easy to do with topology coordinator for example. But maybe there's something smarter we could do.

avikivity · 2023-12-05T12:27:54Z

Don't we already have this? truncating the log == acknowledgement that all members have caught up to some point

So we need to extend the log truncation message with the corresponding timestamp (establishing an equivalence between term/index and timestamp)

kbr-scylla · 2023-12-05T12:29:55Z

truncating the log == acknowledgement that all members have caught up to some point

That's not what Raft does (neither our implementation, or the paper's description, IIRC).
Decision to truncate the log is done locally by each replica independently at the moment of taking a snapshot, which depends on the length / size of the replica's log -- not on any remote information.

gleb-cloudius · 2023-12-05T12:33:38Z

Isn't it enough to allow locally to gc only those tombstones that are older then current local snapshot?

kbr-scylla · 2023-12-05T12:37:29Z

Isn't it enough to allow locally to gc only those tombstones that are older then current local snapshot?

No, scenario:

command 1 causes nodes X and Y to write live cell to table
command 2 causes node X to delete the cell with a tombstone (command 2 is not applied to Y due to network partition or whatever)
node X makes a snapshot, GCs the tombstone
X transfers snapshot to Y, Y does not get the tombstone because it was GCd on node X. So it keeps the live cell

avikivity · 2023-12-05T13:11:50Z

truncating the log == acknowledgement that all members have caught up to some point

That's not what Raft does (neither our implementation, or the paper's description, IIRC). Decision to truncate the log is done locally by each replica independently at the moment of taking a snapshot, which depends on the length / size of the replica's log -- not on any remote information.

Right, I misremembered, there's always the option to transfer the entire state.

gleb-cloudius · 2023-12-05T13:18:31Z

Isn't it enough to allow locally to gc only those tombstones that are older then current local snapshot?

No, scenario:

1. command 1 causes nodes X and Y to write live cell to table

2. command 2 causes node X to delete the cell with a tombstone (command 2 is not applied to Y due to network partition or whatever)

3. node X makes a snapshot, GCs the tombstone

4. X transfers snapshot to Y, Y does not get the tombstone because it was GCd on node X. So it keeps the live cell

Hmm, yes. Decision needs to be global, not local.

tgrabiec · 2023-12-05T13:20:01Z

On Tue, Dec 5, 2023 at 1:23 PM Kamil Braun ***@***.***> wrote: I would strongly prefer a solution that does not require modifying Raft's internals. Anything that uses the concept of "Raft leader" or "follower" or "apply index" would require that, I think. But that part seems simple to do without going through Raft internals. If we know that every replica has applied changes up to some command X (which we can evaluate based on group 0 state_ids for example, not necessarily Raft's internal apply_index), we can then send a command (or include some extra data on the next command that something else sends - piggy-backing as you said) that would GC the tombstone. However, what is non-obvious is how to do this: wait for all members to catch up with i Or more generally, how to check that every replica has reached a certain point.

Raft leader can easily do it, when computes commit index, it could also compute catch up index.

…

Message ID: ***@***.***>

gleb-cloudius · 2023-12-05T13:24:10Z

Raft leader can easily do it, when computes commit index, it could also compute catch up index.

Commit index is irrelevant. Apply index is the relevant one.

tgrabiec · 2023-12-05T13:28:03Z

Raft leader can easily do it, when computes commit index, it could also compute catch up index.

Commit index is irrelevant. Apply index is the relevant one.

It's enough if we base catch up index on the replica's stable index, up to leader's commit index. That's because even though replicas didn't apply the command yet, it's in the log and can't be uncommitted. So the command which tells to change the expiry time will be applied after replicas apply all the commands up to the catch up index.

kbr-scylla · 2023-12-05T13:39:08Z

It's enough if we base catch up index on the replica's stable index, up to leader's commit index. That's because even though replicas didn't apply the command yet, it's in the log and can't be uncommitted.

Can't it happen that even though replica knows some command is committed, it can still receive a snapshot before applying it which will fast-forward the replica state, skipping that command?

gleb-cloudius · 2023-12-05T13:40:04Z

Raft leader can easily do it, when computes commit index, it could also compute catch up index.

Commit index is irrelevant. Apply index is the relevant one.

It's enough if we base catch up index on the replica's stable index, up to leader's commit index. That's because even though replicas didn't apply the command yet, it's in the log and can't be uncommitted. So the command which tells to change the expiry time will be applied after replicas apply all the commands up to the catch up index.

Having it in the log does not mean it will be applied through the log IIRC. The following may theoretically happen:

A C1 command is in the followers log, but not applied yet
The follower missed C2, C3
A leader snapshoted till C3 included
A leader tries to replicate C3 to the follower, finds mismatch and transfer the snapshot instead. C1 is never applied directly though it was part of the log at some point.

kbr-scylla · 2023-12-05T13:40:07Z

And let me state again I don't like where this discussion is going -- modifying Raft internals or depending on Raft internals ("leader", "stable index", "apply index", etc.) to work around the limits of our particular state machine implementation.

gleb-cloudius · 2023-12-05T13:44:39Z

And let me state again I don't like where this discussion is going -- modifying Raft internals or depending on Raft internals ("leader", "stable index", "apply index", etc.) to work around the limits of our particular state machine implementation.

Absolutely. Maximum we can depend on is state's machine state. (well may be we can depend on a leader in the same way we do with topology coordinator). Thing like apply index is not even known to the leader and even commit index it figures out as it goes, so it can go back.

tgrabiec · 2023-12-05T13:48:58Z

On Tue, Dec 5, 2023 at 2:40 PM Gleb Natapov ***@***.***> wrote: Raft leader can easily do it, when computes commit index, it could also compute catch up index. Commit index is irrelevant. Apply index is the relevant one. It's enough if we base catch up index on the replica's stable index, up to leader's commit index. That's because even though replicas didn't apply the command yet, it's in the log and can't be uncommitted. So the command which tells to change the expiry time will be applied after replicas apply all the commands up to the catch up index. Having it in the log does not mean it will be applied through the log IIRC. The following may theoretically happen: 1. A C1 command is in the followers log, but not applied yet 2. The follower missed C2, C3 3. A leader snapshoted till C3 included 4. A leader tries to replicate C3 to the follower, finds mismatch and transfer the snapshot instead. C1 is never applied directly though it was part of the log at some point. Why is it a problem? Still, the expiry command will be applied after the

snapshot transfer completes, so after the replica receives the writes which were in C1.

…

Message ID: ***@***.***>

gleb-cloudius · 2023-12-05T13:52:18Z

Why is it a problem? Still, the expiry command will be applied after the
snapshot transfer completes, so after the replica receives the writes which were in C1.

I do not know what is "expiry command", but scenario above shows how you can miss the tombstone entirely even though it was committed, replicated everywhere but not applied.

tgrabiec · 2023-12-05T13:54:04Z

On Tue, Dec 5, 2023 at 2:44 PM Gleb Natapov ***@***.***> wrote: And let me state again I don't like where this discussion is going -- modifying Raft internals or depending on Raft internals ("leader", "stable index", "apply index", etc.) to work around the limits of our particular state machine implementation. Absolutely. Maximum we can depend on is state's machine state. (well may be we can depend on a leader in the same way we do with topology coordinator). Thing like apply index is not even known to the leader and even commit index it figures out as it goes, so it can go back.

If you can figure out a more elegant way, I'm all for it. But we already rely on raft to propagate state, so something like catch-up tracking fits into raft's responsibilities.

…

Message ID: ***@***.***>

tgrabiec · 2023-12-05T14:09:17Z

On Tue, Dec 5, 2023 at 2:52 PM Gleb Natapov ***@***.***> wrote: Why is it a problem? Still, the expiry command will be applied after the snapshot transfer completes, so after the replica receives the writes which were in C1. I do not know what is "expiry command", but scenario above shows how you can miss the tombstone entirely even though it was committed but not applied.

Right, we need to base it on the apply index.

…

Message ID: ***@***.***>

kostja · 2023-12-05T18:04:09Z

I think with group0 based schema version the impact of the problem is reduced.
Is there any specific scenario @tgrabiec which you're concerned with?

kostja · 2023-12-05T18:08:32Z

Going forward there are more radical solutions to this problem. One, is we change the snapshot transfer for Raft snapshots to only carry the live data & avoid having to merge the snapshot with the existing state.
We could change the storage layer for Raft tables completely.
Either way it seems to be a follow up to Raft based tablets, not an independent problem that is only relevant for group0 schema.

tgrabiec · 2023-12-05T19:07:42Z

I think with group0 based schema version the impact of the problem is reduced. Is there any specific scenario @tgrabiec which you're concerned with?

The performance impact on digest calculation is reduced because we no longer do digest calculation which would have to scan the tombstones. We still do reads of the schema though (read_table_names_of_keyspace()). I don't have specific concerns, but recall several issues in our tests which were tracked down to this problem.

Another thing is that relying on time-based expiry is a correctness problem, which can theoretically manifest in corrupted group0 state if for some reason one of the nodes doesn't catch up within 7 days.

kbr-scylla · 2024-02-19T10:20:22Z

Customer case where they were affected by accumulating tombstones: https://github.com/scylladb/scylla-enterprise/issues/3741

this could help with such cases by clearing the tombstones faster

mykaul · 2024-02-19T10:36:32Z

Customer case where they were affected by accumulating tombstones: scylladb/scylla-enterprise#3741

this could help with such cases by clearing the tombstones faster

Should the milestone be changed then?

nuivall · 2024-02-21T08:39:09Z

This will also affect auth-v2 in the future (see #16578 (comment))

mykaul · 2024-02-21T09:36:41Z

@avikivity - this somehow resembles what we've discussed yesterday, on faster clearing of tombstones with Raft - probably easier/better with tablets.

avikivity · 2024-02-21T11:36:15Z

Yes it's similar, but different (and easier). Raft managed tables (doesn't have to be just group 0) already know if they are missing any updates, and know when they catch up.

kbr-scylla · 2024-02-21T14:53:13Z

Set milestone to 6.1

tgrabiec added enhancement area/raft labels Oct 2, 2023

kbr-scylla mentioned this issue Dec 5, 2023

Don't calculate hashes for schema versions in Raft mode #16242

Merged

tgrabiec changed the title ~~Fast tombstone GC for group0-manged tables~~ Fast and safe tombstone GC for group0-manged tables Dec 5, 2023

kostja added this to the Yellowfin milestone Dec 5, 2023

kostja self-assigned this Dec 5, 2023

tgrabiec mentioned this issue Jan 10, 2024

Bootstrap fails due to inconsistent schema after group0 catch up when table was recreated in the past #16683

Closed

piodul mentioned this issue Feb 5, 2024

Migrate system_auth to raft group0 #16578

Merged

kbr-scylla modified the milestones: Yellowfin, 6.1 Feb 21, 2024

dani-tweig assigned denesb Apr 24, 2024

denesb removed their assignment May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast and safe tombstone GC for group0-manged tables #15607

Fast and safe tombstone GC for group0-manged tables #15607

tgrabiec commented Oct 2, 2023 •

edited

kbr-scylla commented Dec 5, 2023 •

edited

avikivity commented Dec 5, 2023

kbr-scylla commented Dec 5, 2023

gleb-cloudius commented Dec 5, 2023

kbr-scylla commented Dec 5, 2023

avikivity commented Dec 5, 2023

gleb-cloudius commented Dec 5, 2023

tgrabiec commented Dec 5, 2023 via email

gleb-cloudius commented Dec 5, 2023

tgrabiec commented Dec 5, 2023

kbr-scylla commented Dec 5, 2023

gleb-cloudius commented Dec 5, 2023

kbr-scylla commented Dec 5, 2023

gleb-cloudius commented Dec 5, 2023

tgrabiec commented Dec 5, 2023 via email

gleb-cloudius commented Dec 5, 2023 •

edited

tgrabiec commented Dec 5, 2023 via email

tgrabiec commented Dec 5, 2023 via email

kostja commented Dec 5, 2023

kostja commented Dec 5, 2023

tgrabiec commented Dec 5, 2023

kbr-scylla commented Feb 19, 2024

mykaul commented Feb 19, 2024

nuivall commented Feb 21, 2024

mykaul commented Feb 21, 2024

avikivity commented Feb 21, 2024

kbr-scylla commented Feb 21, 2024

Fast and safe tombstone GC for group0-manged tables #15607

Fast and safe tombstone GC for group0-manged tables #15607

Comments

tgrabiec commented Oct 2, 2023 • edited

kbr-scylla commented Dec 5, 2023 • edited

avikivity commented Dec 5, 2023

kbr-scylla commented Dec 5, 2023

gleb-cloudius commented Dec 5, 2023

kbr-scylla commented Dec 5, 2023

avikivity commented Dec 5, 2023

gleb-cloudius commented Dec 5, 2023

tgrabiec commented Dec 5, 2023 via email

gleb-cloudius commented Dec 5, 2023

tgrabiec commented Dec 5, 2023

kbr-scylla commented Dec 5, 2023

gleb-cloudius commented Dec 5, 2023

kbr-scylla commented Dec 5, 2023

gleb-cloudius commented Dec 5, 2023

tgrabiec commented Dec 5, 2023 via email

gleb-cloudius commented Dec 5, 2023 • edited

tgrabiec commented Dec 5, 2023 via email

tgrabiec commented Dec 5, 2023 via email

kostja commented Dec 5, 2023

kostja commented Dec 5, 2023

tgrabiec commented Dec 5, 2023

kbr-scylla commented Feb 19, 2024

mykaul commented Feb 19, 2024

nuivall commented Feb 21, 2024

mykaul commented Feb 21, 2024

avikivity commented Feb 21, 2024

kbr-scylla commented Feb 21, 2024

tgrabiec commented Oct 2, 2023 •

edited

kbr-scylla commented Dec 5, 2023 •

edited

gleb-cloudius commented Dec 5, 2023 •

edited