Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast and safe tombstone GC for group0-manged tables #15607

Open
tgrabiec opened this issue Oct 2, 2023 · 27 comments
Open

Fast and safe tombstone GC for group0-manged tables #15607

tgrabiec opened this issue Oct 2, 2023 · 27 comments
Assignees
Milestone

Comments

@tgrabiec
Copy link
Contributor

tgrabiec commented Oct 2, 2023

This solution mainly applies to schema tables, but the same mechanism works for other group0 tables.

Schema tables have gc grace period of 7 days which allows replicas lagging behind schema changes to catch up. In some scenarios, this can cause accumulation of tombstones which then slows down processing of schema changes, which need to read the schema.

Also, if one of the nodes doesn't catch up within 7 days, we may end up with corrupted group0 state because tombstones will not be propagated to that node.

This period could be greatly reduced in raft mode if expiration is not time-based but coupled with group0 members catching up. Replicas already support expiring based on external clock, which is part of the repair-based tombstone gc. We should switch group0 tables to use this mode (schema option tombstone_gc = {'mode': 'repair'}) and implement a mechanism which treats group0 catch up as repair. This group0 mechanism should simply call tombstone_gc_state::update_repair_time(table_id, token_range, repair_time) for all group0 tables and full token range to indicated that tombstones whose deletion time is <= repair_time can be expired.

The mechanism in group0 to determine this can work as follows. Each command is assigned a monotonically increasing gc_clock::time_point (del_time) which will be the deletion time used by any tombstone in mutations contained by the command. Similar to how we assign a timestamp. We can define per-state-machine del_time to be equal to that of the last command. It's easy to track it by having each group0 command store it in some field as part of its mutations. When all followers catch up to a state machine whose del_time is x, then all tombstones whose deletion time is <= x can be expired. group0 leader can propagate this information to members by piggy backing on group0 commands, which are propagated everywhere, and store the expiration del_time in a new per-state-machine field called exp_time. When followers receive exp_time, they call update_repair_time().

The leader doesn't have to be very diligent in tracking exp_time, it can work in cycles like this:

  1. get the local state machine's del_time (d) and mark the current raft apply index (i). We know that when members catch up with this index, they received all tombstone with deletion time <= d so we can set exp_time to d.
  2. wait for all members to catch up with i
  3. set in-memory exp_time to d, next group0 command will pick it up.
  4. if del_time != d goto 1 else wait for notification.

Piggy backing on next command rather than sending info instantly is good enough because tombstone accumulation is created by sending commands, so the two processes will be coupled. If there are no group0 commands, it's fine to delay expiry.

When node restarts, it recovers exp_time from group0 state on disk, and calls update_repair_time().

\cc @kbr-scylla @kostja @gleb-cloudius

@kbr-scylla
Copy link
Contributor

kbr-scylla commented Dec 5, 2023

I would strongly prefer a solution that does not require modifying Raft's internals. Anything that uses the concept of "Raft leader" or "follower" or "apply index" would require that, I think.

But that part seems simple to do without going through Raft internals. If we know that every replica has applied changes up to some command X (which we can evaluate based on group 0 state_ids for example, not necessarily Raft's internal apply_index), we can then send a command (or include some extra data on the next command that something else sends - piggy-backing as you said) that would GC the tombstones up to that point.

However, what is non-obvious is how to do this:

wait for all members to catch up with i

Or more generally, how to check that every replica has reached a certain point.

The idea of (periodically?) sending RPC to every group 0 member is what comes to my mind immediately. Easy to do with topology coordinator for example. But maybe there's something smarter we could do.

@avikivity
Copy link
Member

Don't we already have this? truncating the log == acknowledgement that all members have caught up to some point

So we need to extend the log truncation message with the corresponding timestamp (establishing an equivalence between term/index and timestamp)

@kbr-scylla
Copy link
Contributor

truncating the log == acknowledgement that all members have caught up to some point

That's not what Raft does (neither our implementation, or the paper's description, IIRC).
Decision to truncate the log is done locally by each replica independently at the moment of taking a snapshot, which depends on the length / size of the replica's log -- not on any remote information.

@gleb-cloudius
Copy link
Contributor

Isn't it enough to allow locally to gc only those tombstones that are older then current local snapshot?

@kbr-scylla
Copy link
Contributor

Isn't it enough to allow locally to gc only those tombstones that are older then current local snapshot?

No, scenario:

  1. command 1 causes nodes X and Y to write live cell to table
  2. command 2 causes node X to delete the cell with a tombstone (command 2 is not applied to Y due to network partition or whatever)
  3. node X makes a snapshot, GCs the tombstone
  4. X transfers snapshot to Y, Y does not get the tombstone because it was GCd on node X. So it keeps the live cell

@avikivity
Copy link
Member

truncating the log == acknowledgement that all members have caught up to some point

That's not what Raft does (neither our implementation, or the paper's description, IIRC). Decision to truncate the log is done locally by each replica independently at the moment of taking a snapshot, which depends on the length / size of the replica's log -- not on any remote information.

Right, I misremembered, there's always the option to transfer the entire state.

@gleb-cloudius
Copy link
Contributor

Isn't it enough to allow locally to gc only those tombstones that are older then current local snapshot?

No, scenario:

1. command 1 causes nodes X and Y to write live cell to table

2. command 2 causes node X to delete the cell with a tombstone (command 2 is not applied to Y due to network partition or whatever)

3. node X makes a snapshot, GCs the tombstone

4. X transfers snapshot to Y, Y does not get the tombstone because it was GCd on node X. So it keeps the live cell

Hmm, yes. Decision needs to be global, not local.

@tgrabiec
Copy link
Contributor Author

tgrabiec commented Dec 5, 2023 via email

@gleb-cloudius
Copy link
Contributor

Raft leader can easily do it, when computes commit index, it could also compute catch up index.

Commit index is irrelevant. Apply index is the relevant one.

@tgrabiec
Copy link
Contributor Author

tgrabiec commented Dec 5, 2023

Raft leader can easily do it, when computes commit index, it could also compute catch up index.

Commit index is irrelevant. Apply index is the relevant one.

It's enough if we base catch up index on the replica's stable index, up to leader's commit index. That's because even though replicas didn't apply the command yet, it's in the log and can't be uncommitted. So the command which tells to change the expiry time will be applied after replicas apply all the commands up to the catch up index.

@kbr-scylla
Copy link
Contributor

It's enough if we base catch up index on the replica's stable index, up to leader's commit index. That's because even though replicas didn't apply the command yet, it's in the log and can't be uncommitted.

Can't it happen that even though replica knows some command is committed, it can still receive a snapshot before applying it which will fast-forward the replica state, skipping that command?

@gleb-cloudius
Copy link
Contributor

Raft leader can easily do it, when computes commit index, it could also compute catch up index.

Commit index is irrelevant. Apply index is the relevant one.

It's enough if we base catch up index on the replica's stable index, up to leader's commit index. That's because even though replicas didn't apply the command yet, it's in the log and can't be uncommitted. So the command which tells to change the expiry time will be applied after replicas apply all the commands up to the catch up index.

Having it in the log does not mean it will be applied through the log IIRC. The following may theoretically happen:

  1. A C1 command is in the followers log, but not applied yet
  2. The follower missed C2, C3
  3. A leader snapshoted till C3 included
  4. A leader tries to replicate C3 to the follower, finds mismatch and transfer the snapshot instead. C1 is never applied directly though it was part of the log at some point.

@kbr-scylla
Copy link
Contributor

And let me state again I don't like where this discussion is going -- modifying Raft internals or depending on Raft internals ("leader", "stable index", "apply index", etc.) to work around the limits of our particular state machine implementation.

@gleb-cloudius
Copy link
Contributor

And let me state again I don't like where this discussion is going -- modifying Raft internals or depending on Raft internals ("leader", "stable index", "apply index", etc.) to work around the limits of our particular state machine implementation.

Absolutely. Maximum we can depend on is state's machine state. (well may be we can depend on a leader in the same way we do with topology coordinator). Thing like apply index is not even known to the leader and even commit index it figures out as it goes, so it can go back.

@tgrabiec
Copy link
Contributor Author

tgrabiec commented Dec 5, 2023 via email

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Dec 5, 2023

Why is it a problem? Still, the expiry command will be applied after the
snapshot transfer completes, so after the replica receives the writes which were in C1.

I do not know what is "expiry command", but scenario above shows how you can miss the tombstone entirely even though it was committed, replicated everywhere but not applied.

@tgrabiec
Copy link
Contributor Author

tgrabiec commented Dec 5, 2023 via email

@tgrabiec
Copy link
Contributor Author

tgrabiec commented Dec 5, 2023 via email

@kostja
Copy link
Contributor

kostja commented Dec 5, 2023

I think with group0 based schema version the impact of the problem is reduced.
Is there any specific scenario @tgrabiec which you're concerned with?

@kostja
Copy link
Contributor

kostja commented Dec 5, 2023

Going forward there are more radical solutions to this problem. One, is we change the snapshot transfer for Raft snapshots to only carry the live data & avoid having to merge the snapshot with the existing state.
We could change the storage layer for Raft tables completely.
Either way it seems to be a follow up to Raft based tablets, not an independent problem that is only relevant for group0 schema.

@tgrabiec
Copy link
Contributor Author

tgrabiec commented Dec 5, 2023

I think with group0 based schema version the impact of the problem is reduced. Is there any specific scenario @tgrabiec which you're concerned with?

The performance impact on digest calculation is reduced because we no longer do digest calculation which would have to scan the tombstones. We still do reads of the schema though (read_table_names_of_keyspace()). I don't have specific concerns, but recall several issues in our tests which were tracked down to this problem.

Another thing is that relying on time-based expiry is a correctness problem, which can theoretically manifest in corrupted group0 state if for some reason one of the nodes doesn't catch up within 7 days.

@tgrabiec tgrabiec changed the title Fast tombstone GC for group0-manged tables Fast and safe tombstone GC for group0-manged tables Dec 5, 2023
@kostja kostja added this to the Yellowfin milestone Dec 5, 2023
@kostja kostja self-assigned this Dec 5, 2023
@kbr-scylla
Copy link
Contributor

Customer case where they were affected by accumulating tombstones: https://github.com/scylladb/scylla-enterprise/issues/3741

this could help with such cases by clearing the tombstones faster

@mykaul
Copy link
Contributor

mykaul commented Feb 19, 2024

Customer case where they were affected by accumulating tombstones: scylladb/scylla-enterprise#3741

this could help with such cases by clearing the tombstones faster

Should the milestone be changed then?

@nuivall
Copy link
Member

nuivall commented Feb 21, 2024

This will also affect auth-v2 in the future (see #16578 (comment))

@mykaul
Copy link
Contributor

mykaul commented Feb 21, 2024

@avikivity - this somehow resembles what we've discussed yesterday, on faster clearing of tombstones with Raft - probably easier/better with tablets.

@avikivity
Copy link
Member

Yes it's similar, but different (and easier). Raft managed tables (doesn't have to be just group 0) already know if they are missing any updates, and know when they catch up.

@kbr-scylla kbr-scylla modified the milestones: Yellowfin, 6.1 Feb 21, 2024
@kbr-scylla
Copy link
Contributor

Set milestone to 6.1

@denesb denesb removed their assignment May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants