Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repair should "compact" data on the receiving side before writing it to disk #13308

Open
vladzcloudius opened this issue Mar 23, 2023 · 10 comments

Comments

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Mar 23, 2023

Currently a (regular) repair is going to bring expired data (tombstones or ttled data) to the replica that doesn't have it. This can cause such data live forever in the cluster.

Instead we should send such data but "compact" it on the receiver end: if there is no data expired data would shadow - drop that expired data.

Unlike compaction repair can afford to check the actual data item in question and not just a corresponding sstable metadata. This is because it reads it anyway to create a digest.

This proposal is orthogonal to #3561

@vladzcloudius
Copy link
Contributor Author

cc @asias @raphaelsc @tgrabiec

@tgrabiec
Copy link
Contributor

If we do #3561, then we don't need this, right?

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Mar 23, 2023

If we do #3561, then we don't need this, right?

IMO we should not implement #3561 : #3561 (comment)

...and implement this instead.

#3561 is dangerous and it breaks the semantics of the repair as has been explained in #3561 in length.

This proposal is going to both address main issues #3561 wanted to address and preserve the current repair semantics - I'm referring the promise of data consistency after the repair.

@tgrabiec
Copy link
Contributor

#3561 is dangerous and it breaks the semantics of the repair ...

You're talking about replicating data which expired, which may help for the case when user didn't run repair on time but there is still data which was not garbage collected.

We (@asias) implemented the tombstone_gc = repair thing which helps for that, but also ensures that the data is not garbage collected, so helps in more cases. Maybe we should work towards making that the default mode, instead of adding more crutches for corner cases.

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Mar 23, 2023

#3561 is dangerous and it breaks the semantics of the repair ...

You're talking about replicating data which expired, which may help for the case when user didn't run repair on time but there is still data which was not garbage collected.

Correct.
Imagine RF=3, 3 nodes,

  • Node A and Node B have a tombstone
  • Node C doesn't have a tombstone but has a data that that tombstone was supposed to shadow.
  • The tombstone is expired but not evicted yet.
  • Your reads with QUORUM always return a correct result.
  • You run a repair and expect that after that the read result is going to be consistent.
  • However Streaming should compact data before sending #3561 is going to break that promise because if tombstones are evicted after the repair the data on Node C is going to be resurrected. This way reads with QUORUM before the eviction are going to return a different result than reads after the eviction.

#13308 is going to avoid such a resurrection.

We (@asias) implemented the tombstone_gc = repair thing which helps for that, but also ensures that the data is not garbage collected, so helps in more cases. Maybe we should work towards making that the default mode, instead of adding more crutches for corner cases.

This is not going to help if #3561 is implemented - in the situation as in the example above tombstones on A and B are going to be evicted and data on C is going to be resurrected.

Like I said, #3561 is plain dangerous.

@tgrabiec
Copy link
Contributor

This is not going to help if #3561 is implemented - in the situation as in the example above tombstones on A and B are going to be evicted and data on C is going to be resurrected.

With tombstone_gc = repair, the tombstone will not be marked as expired until node C gets it. If something is expired it implies that it was replicated everywhere, so we can also always compact it.

@tgrabiec
Copy link
Contributor

tgrabiec commented Mar 24, 2023

tombstone_gc = repair also helps for scylladb/scylla-manager#3324 (comment), because we don't need to change gc grace period, and there's no need to restore it afterwards.

@vladzcloudius
Copy link
Contributor Author

vladzcloudius commented Mar 24, 2023

This is not going to help if #3561 is implemented - in the situation as in the example above tombstones on A and B are going to be evicted and data on C is going to be resurrected.

With tombstone_gc = repair, the tombstone will not be marked as expired until node C gets it. If something is expired it implies that it was replicated everywhere, so we can also always compact it.

Oh, right.
Anyway, tombstone_gc = not repair is still going to be used because it makes more sense than tombstone_gc = repair in many cases like Time Series where repairs should only be run when nodes are lost. You can not and should not assume a specific configuration.

I suggest we don't discuss why tombstone_gc = repair may not suite everybody elsewhere.

The bottom line for the current default tombstone_gc mode we need a solution for the real problem which haunts us. And this proposal represents a safe solution while #3561 doesn't.

This solution will also work correctly with tombstone_gc = repair.

@DoronArazii DoronArazii added this to the 5.x milestone May 16, 2023
asias added a commit to asias/scylla that referenced this issue Jun 6, 2023
Consider the following example:

- Node n1, n2 in the cluster
- Create a table with RF 2
- Insert rows with 'using ttl 10'
- After the rows are expired
- Compaction runs on n1 or n2
- Run repair on n1

After the rows are expired, it is possible that compaction on n1 or n2
could purge the expired rows at different times. As a result, when
repair runs, it is possible that n1 has the expired data but n2 does
not. Repair will sync the expired data to n2 from n1. Node n1 could not
just ignore the expired rows, because the expired rows might shadow rows
on n2.

This patch improves the handling of the expired rows by checking on the
receiver node. If the receiver does not have any rows the expired row
might shadow, it skips writing the expired row to disk. This prevents
compaction and repair purging and bringing back the expired rows back
and forth. For example:

Before

- n1: expired rows, n2: expired rows
- compaction
- n1: expired rows, n2: no expired rows
- repair
- n1: expired rows, n2: expired rows
- compaction
- n1: no expired rows, n2: expired rows
- repair
- n1: expired rows, n2: expired rows

After

- n1: expired rows, n2: expired rows
- compaction
- n1: expired rows, n2: no expired rows
- repair
- n1: expired rows, n2: no expired rows
- compaction
- n1: no expired rows, n2: no expired rows
- repair
- n1: no expired rows, n2: no expired rows

Fixes scylladb#13308
Tests dtest/test_skip_expired_rows_on_repair
@asias
Copy link
Contributor

asias commented Jun 6, 2023

A PR is sent for this issue: #14148

@mykaul mykaul modified the milestones: 5.x, 5.4 Jun 6, 2023
@vladzcloudius
Copy link
Contributor Author

A PR is sent for this issue: #14148

@avikivity @denesb here is the "receiving side" patchwork I was referring on Botond's PR earlier today.

@mykaul mykaul added the P2 High Priority label Aug 23, 2023
@DoronArazii DoronArazii modified the milestones: 5.4, 6.0 Sep 3, 2023
@mykaul mykaul modified the milestones: 6.0, Backlog Nov 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants