Repair should "compact" data on the receiving side before writing it to disk #13308

vladzcloudius · 2023-03-23T16:27:15Z

Currently a (regular) repair is going to bring expired data (tombstones or ttled data) to the replica that doesn't have it. This can cause such data live forever in the cluster.

Instead we should send such data but "compact" it on the receiver end: if there is no data expired data would shadow - drop that expired data.

Unlike compaction repair can afford to check the actual data item in question and not just a corresponding sstable metadata. This is because it reads it anyway to create a digest.

This proposal is orthogonal to #3561

vladzcloudius · 2023-03-23T16:28:21Z

cc @asias @raphaelsc @tgrabiec

tgrabiec · 2023-03-23T20:54:03Z

If we do #3561, then we don't need this, right?

vladzcloudius · 2023-03-23T21:13:51Z

If we do #3561, then we don't need this, right?

IMO we should not implement #3561 : #3561 (comment)

...and implement this instead.

#3561 is dangerous and it breaks the semantics of the repair as has been explained in #3561 in length.

This proposal is going to both address main issues #3561 wanted to address and preserve the current repair semantics - I'm referring the promise of data consistency after the repair.

tgrabiec · 2023-03-23T21:28:13Z

#3561 is dangerous and it breaks the semantics of the repair ...

You're talking about replicating data which expired, which may help for the case when user didn't run repair on time but there is still data which was not garbage collected.

We (@asias) implemented the tombstone_gc = repair thing which helps for that, but also ensures that the data is not garbage collected, so helps in more cases. Maybe we should work towards making that the default mode, instead of adding more crutches for corner cases.

vladzcloudius · 2023-03-23T22:00:38Z

#3561 is dangerous and it breaks the semantics of the repair ...

You're talking about replicating data which expired, which may help for the case when user didn't run repair on time but there is still data which was not garbage collected.

Correct.
Imagine RF=3, 3 nodes,

Node A and Node B have a tombstone
Node C doesn't have a tombstone but has a data that that tombstone was supposed to shadow.
The tombstone is expired but not evicted yet.
Your reads with QUORUM always return a correct result.
You run a repair and expect that after that the read result is going to be consistent.
However Streaming should compact data before sending #3561 is going to break that promise because if tombstones are evicted after the repair the data on Node C is going to be resurrected. This way reads with QUORUM before the eviction are going to return a different result than reads after the eviction.

#13308 is going to avoid such a resurrection.

We (@asias) implemented the tombstone_gc = repair thing which helps for that, but also ensures that the data is not garbage collected, so helps in more cases. Maybe we should work towards making that the default mode, instead of adding more crutches for corner cases.

This is not going to help if #3561 is implemented - in the situation as in the example above tombstones on A and B are going to be evicted and data on C is going to be resurrected.

Like I said, #3561 is plain dangerous.

tgrabiec · 2023-03-24T08:48:47Z

This is not going to help if #3561 is implemented - in the situation as in the example above tombstones on A and B are going to be evicted and data on C is going to be resurrected.

With tombstone_gc = repair, the tombstone will not be marked as expired until node C gets it. If something is expired it implies that it was replicated everywhere, so we can also always compact it.

tgrabiec · 2023-03-24T08:51:22Z

tombstone_gc = repair also helps for scylladb/scylla-manager#3324 (comment), because we don't need to change gc grace period, and there's no need to restore it afterwards.

vladzcloudius · 2023-03-24T15:29:01Z

This is not going to help if #3561 is implemented - in the situation as in the example above tombstones on A and B are going to be evicted and data on C is going to be resurrected.

With tombstone_gc = repair, the tombstone will not be marked as expired until node C gets it. If something is expired it implies that it was replicated everywhere, so we can also always compact it.

Oh, right.
Anyway, tombstone_gc = not repair is still going to be used because it makes more sense than tombstone_gc = repair in many cases like Time Series where repairs should only be run when nodes are lost. You can not and should not assume a specific configuration.

I suggest we don't discuss why tombstone_gc = repair may not suite everybody elsewhere.

The bottom line for the current default tombstone_gc mode we need a solution for the real problem which haunts us. And this proposal represents a safe solution while #3561 doesn't.

This solution will also work correctly with tombstone_gc = repair.

Consider the following example: - Node n1, n2 in the cluster - Create a table with RF 2 - Insert rows with 'using ttl 10' - After the rows are expired - Compaction runs on n1 or n2 - Run repair on n1 After the rows are expired, it is possible that compaction on n1 or n2 could purge the expired rows at different times. As a result, when repair runs, it is possible that n1 has the expired data but n2 does not. Repair will sync the expired data to n2 from n1. Node n1 could not just ignore the expired rows, because the expired rows might shadow rows on n2. This patch improves the handling of the expired rows by checking on the receiver node. If the receiver does not have any rows the expired row might shadow, it skips writing the expired row to disk. This prevents compaction and repair purging and bringing back the expired rows back and forth. For example: Before - n1: expired rows, n2: expired rows - compaction - n1: expired rows, n2: no expired rows - repair - n1: expired rows, n2: expired rows - compaction - n1: no expired rows, n2: expired rows - repair - n1: expired rows, n2: expired rows After - n1: expired rows, n2: expired rows - compaction - n1: expired rows, n2: no expired rows - repair - n1: expired rows, n2: no expired rows - compaction - n1: no expired rows, n2: no expired rows - repair - n1: no expired rows, n2: no expired rows Fixes scylladb#13308 Tests dtest/test_skip_expired_rows_on_repair

asias · 2023-06-06T01:25:40Z

A PR is sent for this issue: #14148

vladzcloudius · 2023-07-20T04:13:41Z

A PR is sent for this issue: #14148

@avikivity @denesb here is the "receiving side" patchwork I was referring on Botond's PR earlier today.

mykaul added area/repair enhancement labels Mar 26, 2023

DoronArazii added this to the 5.x milestone May 16, 2023

mykaul mentioned this issue May 17, 2023

nodetool rebuild operation on a non empty node, with off-strategy compaction causes the amount of disk space occupied by scylla to increase dramatically #13580

Closed

2 tasks

asias mentioned this issue Jun 6, 2023

repair: Skip expired rows that do not shadow local data on the receiver #14148

Closed

mykaul modified the milestones: 5.x, 5.4 Jun 6, 2023

mykaul added the symptom/data resurrection label Jun 6, 2023

mykaul added the P2 High Priority label Aug 23, 2023

DoronArazii modified the milestones: 5.4, 6.0 Sep 3, 2023

mykaul modified the milestones: 6.0, Backlog Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair should "compact" data on the receiving side before writing it to disk #13308

Repair should "compact" data on the receiving side before writing it to disk #13308

vladzcloudius commented Mar 23, 2023 •

edited

vladzcloudius commented Mar 23, 2023

tgrabiec commented Mar 23, 2023

vladzcloudius commented Mar 23, 2023 •

edited

tgrabiec commented Mar 23, 2023

vladzcloudius commented Mar 23, 2023 •

edited

tgrabiec commented Mar 24, 2023

tgrabiec commented Mar 24, 2023 •

edited

vladzcloudius commented Mar 24, 2023 •

edited

asias commented Jun 6, 2023

vladzcloudius commented Jul 20, 2023

Repair should "compact" data on the receiving side before writing it to disk #13308

Repair should "compact" data on the receiving side before writing it to disk #13308

Comments

vladzcloudius commented Mar 23, 2023 • edited

vladzcloudius commented Mar 23, 2023

tgrabiec commented Mar 23, 2023

vladzcloudius commented Mar 23, 2023 • edited

tgrabiec commented Mar 23, 2023

vladzcloudius commented Mar 23, 2023 • edited

tgrabiec commented Mar 24, 2023

tgrabiec commented Mar 24, 2023 • edited

vladzcloudius commented Mar 24, 2023 • edited

asias commented Jun 6, 2023

vladzcloudius commented Jul 20, 2023

vladzcloudius commented Mar 23, 2023 •

edited

vladzcloudius commented Mar 23, 2023 •

edited

vladzcloudius commented Mar 23, 2023 •

edited

tgrabiec commented Mar 24, 2023 •

edited

vladzcloudius commented Mar 24, 2023 •

edited