-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repair should "compact" data on the receiving side before writing it to disk #13308
Comments
If we do #3561, then we don't need this, right? |
IMO we should not implement #3561 : #3561 (comment) ...and implement this instead. #3561 is dangerous and it breaks the semantics of the repair as has been explained in #3561 in length. This proposal is going to both address main issues #3561 wanted to address and preserve the current repair semantics - I'm referring the promise of data consistency after the repair. |
You're talking about replicating data which expired, which may help for the case when user didn't run repair on time but there is still data which was not garbage collected. We (@asias) implemented the |
Correct.
#13308 is going to avoid such a resurrection.
This is not going to help if #3561 is implemented - in the situation as in the example above tombstones on A and B are going to be evicted and data on C is going to be resurrected. Like I said, #3561 is plain dangerous. |
With |
|
Oh, right. I suggest we don't discuss why The bottom line for the current default This solution will also work correctly with |
Consider the following example: - Node n1, n2 in the cluster - Create a table with RF 2 - Insert rows with 'using ttl 10' - After the rows are expired - Compaction runs on n1 or n2 - Run repair on n1 After the rows are expired, it is possible that compaction on n1 or n2 could purge the expired rows at different times. As a result, when repair runs, it is possible that n1 has the expired data but n2 does not. Repair will sync the expired data to n2 from n1. Node n1 could not just ignore the expired rows, because the expired rows might shadow rows on n2. This patch improves the handling of the expired rows by checking on the receiver node. If the receiver does not have any rows the expired row might shadow, it skips writing the expired row to disk. This prevents compaction and repair purging and bringing back the expired rows back and forth. For example: Before - n1: expired rows, n2: expired rows - compaction - n1: expired rows, n2: no expired rows - repair - n1: expired rows, n2: expired rows - compaction - n1: no expired rows, n2: expired rows - repair - n1: expired rows, n2: expired rows After - n1: expired rows, n2: expired rows - compaction - n1: expired rows, n2: no expired rows - repair - n1: expired rows, n2: no expired rows - compaction - n1: no expired rows, n2: no expired rows - repair - n1: no expired rows, n2: no expired rows Fixes scylladb#13308 Tests dtest/test_skip_expired_rows_on_repair
A PR is sent for this issue: #14148 |
@avikivity @denesb here is the "receiving side" patchwork I was referring on Botond's PR earlier today. |
Currently a (regular) repair is going to bring expired data (tombstones or ttled data) to the replica that doesn't have it. This can cause such data live forever in the cluster.
Instead we should send such data but "compact" it on the receiver end: if there is no data expired data would shadow - drop that expired data.
Unlike compaction repair can afford to check the actual data item in question and not just a corresponding sstable metadata. This is because it reads it anyway to create a digest.
This proposal is orthogonal to #3561
The text was updated successfully, but these errors were encountered: