Streaming should compact data before sending #3561

tgrabiec · 2018-06-29T16:07:09Z

Currently streaming is combining sstable and memtable readers, but doesn't compact. That may result in lots of unnecessary data (deleted, expired) to be sent.

asias · 2020-03-03T02:55:36Z

The compaction will reduce the number of data we sent which is good. However, on the other hand, it will trigger more work during the streaming and it might cause higher impact on the cql workload.

asias · 2020-03-03T02:57:20Z

Repair has the same problem. To solve this, we can introduce a compaction reader which compact before it emits the results. @denesb

asias · 2020-03-03T03:08:09Z

@slivne @bhalevy @glommer I talked with @denesb in the past, such reader is doable and is not that complicated, he needs someone to prioritize it.

bhalevy · 2020-03-05T16:12:04Z

Cc @glommer fyi

bhalevy · 2020-03-05T16:15:08Z

@glommer, @raphaelsc Let's take this into account for off-strategy compaction

denesb · 2020-03-24T13:24:22Z

The compacting reader is available as of 342c967.

asias · 2020-04-01T02:50:48Z

Thanks botond for the awesome reader ;-)

…

On Tue, Mar 24, 2020 at 9:24 PM Botond Dénes ***@***.***> wrote: The compacting reader is available as of 342c967 <342c967> . — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#3561 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACOETAKG7OK4SZI2VMDARLRJCYBLANCNFSM4FHVCQLQ> .

-- Asias

denesb · 2020-04-01T05:25:21Z

@asias note that we don't need https://github.com/scylladb/scylla/blob/dee0b6834738778921fc582b5a90484a214ef56f/sstables/compaction.cc#L76 like we discussed before. Tomek revealed that since streaming and repair will be reading all sstables containing relevant data for the range we read, we can purge all tombstones, so you can just pass [] (const dht::decorated_key&) { return api::max_timestamp; } as get_max_purgeable.

asias · 2020-04-01T05:38:40Z

@asias note that we don't need

https://github.com/scylladb/scylla/blob/dee0b6834738778921fc582b5a90484a214ef56f/sstables/compaction.cc#L76
like we discussed before. Tomek revealed that since streaming and repair will be reading all sstables containing relevant data for the range we read, we can purge all tombstones, so you can just pass [] (const dht::decorated_key&) { return api::max_timestamp; } as get_max_purgeable.

Good to know.

slivne · 2020-11-04T15:44:54Z

@bhalevy / @asias lets add this to the qeue and implement this now that we have a compacting reader

asias · 2020-11-05T04:08:08Z

I would rather to defer it until we GA repair based node ops.

vladzcloudius · 2021-06-23T23:30:14Z

@asias I believe that if we apply compacting reader on a streamed data before sending we should make sure to send expired tombstones if they are the latest data version on a sending replica.

If we don't do that we will find ourselves in the situation where repair will be unable to restore the data consistency on repaired nodes.

raphaelsc · 2021-06-23T23:39:28Z

On Wed, Jun 23, 2021 at 8:30 PM Vladislav Zolotarov < ***@***.***> wrote: @asias <https://github.com/asias> I believe that if we apply compacting reader on a streamed data before sending we should make sure to send expired tombstones if they are the latest data version on a sending replica.

this behavior would only work if the expired tombstones weren't compacted away no? please correct me if i am mistaken, but this would only work if expired tombstones are still around by luck.

…

If we don't do that we will find ourselves in the situation where repair will be unable to restore the data consistency on repaired nodes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3561 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKYA45O3VZPGKL7LZAMDHLTUJVBDANCNFSM4FHVCQLQ> .

asias · 2021-06-23T23:48:28Z

We need to repair before the tombstone expires and being compacted away anyway. If the tombstone is expired but is still present, it is better to send it to peers even if it is expired to be safe.

asias · 2021-06-24T01:03:55Z

@denesb can you please clarify if the flat mutation reader for streaming will return the latest tombstone if it is older than gc_grace_period currently without the compact reader? What happens if we plug the compact reader into streaming reader.

denesb · 2021-06-24T06:19:42Z

Currently the streaming reader will return all data, live, shadowed, expired or not. If we plug in the compacting reader it will only return live data and non-expired tombstones. I strongly disagree about returning expired tombstones. It basically comes down to relying on luck to maintain correctness.

vladzcloudius · 2021-06-24T22:18:20Z

Currently the streaming reader will return all data, live, shadowed, expired or not. If we plug in the compacting reader it will only return live data and non-expired tombstones. I strongly disagree about returning expired tombstones. It basically comes down to relying on luck to maintain correctness.

@denesb Nobody is relying on it. But it's stupid not to use the tombstone if you still have it.
No need to make things worse then they already are.

It's a matter of CONSISTENCY and not CORRECTNESS as we discussed earlier.

gleb-cloudius · 2021-07-07T11:39:48Z

On Wed, Jul 07, 2021 at 04:35:50AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 01:47:10AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 01:24:51AM -0700, Asias He wrote: > Without repair cleanup may be slow, but it is guarantied to happen. With repair like it is today it does not. How do you know the problems we see today are not caused by the current repair? Pointing a bug in the system is not exaggeration even if you think it is hard to trigger. But here we do not even have any evidence for that. For all we know it is triggered all of the time and cause us problem that we blame on something else (slow compaction). With repair, cleanup is guaranteed to happen too. > Sorry. I do not see how: 1. A, B, C all have tombstones. 2. A gced it. 3. repair restored it to A 4. B, C gced it 5. repair restored it to B and C 6 goto 1 cleanup == compaction on a node to remove the expired compaction It is possible what you describe can happen. It is also possible when you run repair, every node has do the clean up already. Eventually, the tombstone can go away. The hint after repair can solve this problem. The hint is effectively saying you are safe to drop the expire tombstone. > "Can go away" is not the same as "with repair, cleanup is guaranteed to happen too" like you said above and "can" != "guaranteed" and is not enough. Hence the current behaviour _is_ broken. We are referring cleanup to different things. Like I just mentioned above , by "clean up" I meant compaction on a node to remove the expired tombstone.

It does not accomplish much if it gets replicated back again.

It is a price to pay since we do not have protection with gc_grace_period currently. The gc_grace_period mechanism is broken. Without syncing tombstone, it is even more broken.

It is "broken" by design and it is documented how it should be used properly. Users who follow the documentation should not pay the price.

…

-- Gleb.

asias · 2021-07-07T11:43:11Z

On Wed, Jul 07, 2021 at 04:35:50AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 01:47:10AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 01:24:51AM -0700, Asias He wrote: > Without repair cleanup may be slow, but it is guarantied to happen. With repair like it is today it does not. How do you know the problems we see today are not caused by the current repair? Pointing a bug in the system is not exaggeration even if you think it is hard to trigger. But here we do not even have any evidence for that. For all we know it is triggered all of the time and cause us problem that we blame on something else (slow compaction). With repair, cleanup is guaranteed to happen too. > Sorry. I do not see how: 1. A, B, C all have tombstones. 2. A gced it. 3. repair restored it to A 4. B, C gced it 5. repair restored it to B and C 6 goto 1 cleanup == compaction on a node to remove the expired compaction It is possible what you describe can happen. It is also possible when you run repair, every node has do the clean up already. Eventually, the tombstone can go away. The hint after repair can solve this problem. The hint is effectively saying you are safe to drop the expire tombstone. > "Can go away" is not the same as "with repair, cleanup is guaranteed to happen too" like you said above and "can" != "guaranteed" and is not enough. Hence the current behaviour is broken. We are referring cleanup to different things. Like I just mentioned above , by "clean up" I meant compaction on a node to remove the expired tombstone.
It does not accomplish much if it gets replicated back again.
It is a price to pay since we do not have protection with gc_grace_period currently. The gc_grace_period mechanism is broken. Without syncing tombstone, it is even more broken.
It is "broken" by design and it is documented how it should be used properly. Users who follow the documentation should not pay the price.

So now you think we should drop the expired tombstone with the current code even without the gc_grace_period protection?

gleb-cloudius · 2021-07-07T11:51:16Z

On Wed, Jul 07, 2021 at 04:43:22AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 04:35:50AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 01:47:10AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 01:24:51AM -0700, Asias He wrote: > Without repair cleanup may be slow, but it is guarantied to happen. With repair like it is today it does not. How do you know the problems we see today are not caused by the current repair? Pointing a bug in the system is not exaggeration even if you think it is hard to trigger. But here we do not even have any evidence for that. For all we know it is triggered all of the time and cause us problem that we blame on something else (slow compaction). With repair, cleanup is guaranteed to happen too. > Sorry. I do not see how: 1. A, B, C all have tombstones. 2. A gced it. 3. repair restored it to A 4. B, C gced it 5. repair restored it to B and C 6 goto 1 cleanup == compaction on a node to remove the expired compaction It is possible what you describe can happen. It is also possible when you run repair, every node has do the clean up already. Eventually, the tombstone can go away. The hint after repair can solve this problem. The hint is effectively saying you are safe to drop the expire tombstone. > "Can go away" is not the same as "with repair, cleanup is guaranteed to happen too" like you said above and "can" != "guaranteed" and is not enough. Hence the current behaviour _is_ broken. We are referring cleanup to different things. Like I just mentioned above , by "clean up" I meant compaction on a node to remove the expired tombstone. > It does not accomplish much if it gets replicated back again. > It is a price to pay since we do not have protection with gc_grace_period currently. The gc_grace_period mechanism is broken. Without syncing tombstone, it is even more broken. > It is "broken" by design and it is documented how it should be used properly. Users who follow the documentation should not pay the price. So now you think we should drop the expired tombstone with the current code even without the gc_grace_period protection?

If we are not going to work on a protection (which I think we should since it looks like Cassandra has it with incremental repair) we should add a flag to nodetool to enable transferring them conditionally. A user usually know that he missed a repair I hope.

…

-- Gleb.

vladzcloudius · 2021-07-08T16:29:30Z

You're talking about performance when we're in a corrupted state. I'm
talking about performance when we're not in a corrupted state. The fix
would affect it too.

This is precisely the point of disagreement: we should not pessimize
well maintained system in favor of making non maintained one to behave
marginally better.

@tgrabiec @gleb-cloudius The "performance penalty" is not going to be measurable because the "extra work" frequency is going to be equal to the frequency of compactions that evict tombstones in the scenario described by @tgrabiec earlier - and the time between two consequent compactions like these is "eternity" compared to the frequency of requests. If that's the only argument against the fix - then I believe that this argument can be safely ignored.

This way or another I think it makes no sense arguing about a read-repair on the GH issue related to a "regular" repair therefore I created a separate issue for that (#8970).

I have nothing to add to what @asias wrote here: #3561 (comment).

The bottom line - the issue is obvious (both about the read-repair and about a regular repair).
So, is my view on the current state of things.
I'm open for any feasible solution for the issues at hand.

mykaul · 2023-07-18T09:45:19Z

@bhalevy - Per the discussion today, should we re-visit this? At least higher priority so it won't be lost in 5.x?

bhalevy · 2023-07-18T09:54:00Z

Yes, I think that P2 is appropriate.
@denesb can you pitch in if @raphaelsc is overwhelmed with other issues?

denesb · 2023-07-19T05:40:27Z

@denesb can you pitch in if @raphaelsc is overwhelmed with other issues?

Sure.

denesb · 2023-07-19T08:47:42Z

PR is here: #14756

avikivity · 2023-10-25T16:52:49Z

Not a regression, not backporting.

tgrabiec added the symptom/performance Issues causing performance problems label Jun 29, 2018

duarten added the area/streaming label Jun 29, 2018

tgrabiec mentioned this issue Jul 4, 2018

Consistency issues #3557

Open

tzach assigned asias Jul 8, 2018

tzach added this to the 2.x milestone Jul 8, 2018

slivne added the Eng-3 label Oct 27, 2019

bhalevy assigned denesb Mar 3, 2020

slivne modified the milestones: 4.x, 4.3 May 25, 2020

slivne modified the milestones: 4.3, 4.4 Nov 26, 2020

slivne modified the milestones: 4.4, 4.5 Mar 29, 2021

slivne modified the milestones: 4.5, 4.6 Nov 10, 2021

slivne modified the milestones: 4.6, 5.0 Feb 8, 2022

bhalevy assigned raphaelsc and unassigned denesb Mar 13, 2022

DoronArazii modified the milestones: 5.0, 5.x Oct 12, 2022

vladzcloudius mentioned this issue Mar 23, 2023

Repair should "compact" data on the receiving side before writing it to disk #13308

Open

bhalevy added the P2 High Priority label Jul 18, 2023

mykaul removed the Eng-3 label Jul 18, 2023

bhalevy assigned denesb and unassigned asias Jul 19, 2023

denesb mentioned this issue Jul 19, 2023

Compact data before streaming #14756

Merged

scylladb-promoter closed this as completed in 4e9d95d Jul 28, 2023

scylladb-promoter closed this as completed in #14756 Jul 28, 2023

scylladb-promoter added the Backport candidate label Jul 28, 2023

DoronArazii modified the milestones: Backlog, 5.4 Aug 6, 2023

avikivity removed the Backport candidate label Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming should compact data before sending #3561

Streaming should compact data before sending #3561

tgrabiec commented Jun 29, 2018

asias commented Mar 3, 2020

asias commented Mar 3, 2020

asias commented Mar 3, 2020

bhalevy commented Mar 5, 2020

bhalevy commented Mar 5, 2020

denesb commented Mar 24, 2020

asias commented Apr 1, 2020 via email

denesb commented Apr 1, 2020

asias commented Apr 1, 2020

slivne commented Nov 4, 2020

asias commented Nov 5, 2020

vladzcloudius commented Jun 23, 2021

raphaelsc commented Jun 23, 2021 via email

asias commented Jun 23, 2021

asias commented Jun 24, 2021

denesb commented Jun 24, 2021

vladzcloudius commented Jun 24, 2021 •

edited

gleb-cloudius commented Jul 7, 2021 via email

asias commented Jul 7, 2021

gleb-cloudius commented Jul 7, 2021 via email

vladzcloudius commented Jul 8, 2021 •

edited

mykaul commented Jul 18, 2023

bhalevy commented Jul 18, 2023

denesb commented Jul 19, 2023

denesb commented Jul 19, 2023

avikivity commented Oct 25, 2023

Streaming should compact data before sending #3561

Streaming should compact data before sending #3561

Comments

tgrabiec commented Jun 29, 2018

asias commented Mar 3, 2020

asias commented Mar 3, 2020

asias commented Mar 3, 2020

bhalevy commented Mar 5, 2020

bhalevy commented Mar 5, 2020

denesb commented Mar 24, 2020

asias commented Apr 1, 2020 via email

denesb commented Apr 1, 2020

asias commented Apr 1, 2020

slivne commented Nov 4, 2020

asias commented Nov 5, 2020

vladzcloudius commented Jun 23, 2021

raphaelsc commented Jun 23, 2021 via email

asias commented Jun 23, 2021

asias commented Jun 24, 2021

denesb commented Jun 24, 2021

vladzcloudius commented Jun 24, 2021 • edited

gleb-cloudius commented Jul 7, 2021 via email

asias commented Jul 7, 2021

gleb-cloudius commented Jul 7, 2021 via email

vladzcloudius commented Jul 8, 2021 • edited

mykaul commented Jul 18, 2023

bhalevy commented Jul 18, 2023

denesb commented Jul 19, 2023

denesb commented Jul 19, 2023

avikivity commented Oct 25, 2023

vladzcloudius commented Jun 24, 2021 •

edited

vladzcloudius commented Jul 8, 2021 •

edited