Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming should compact data before sending #3561

Closed
tgrabiec opened this issue Jun 29, 2018 · 92 comments · Fixed by #14756
Closed

Streaming should compact data before sending #3561

tgrabiec opened this issue Jun 29, 2018 · 92 comments · Fixed by #14756
Assignees
Labels
area/streaming P2 High Priority symptom/performance Issues causing performance problems
Milestone

Comments

@tgrabiec
Copy link
Contributor

Currently streaming is combining sstable and memtable readers, but doesn't compact. That may result in lots of unnecessary data (deleted, expired) to be sent.

@tgrabiec tgrabiec added the symptom/performance Issues causing performance problems label Jun 29, 2018
@tzach tzach added this to the 2.x milestone Jul 8, 2018
@slivne slivne added the Eng-3 label Oct 27, 2019
@asias
Copy link
Contributor

asias commented Mar 3, 2020

The compaction will reduce the number of data we sent which is good. However, on the other hand, it will trigger more work during the streaming and it might cause higher impact on the cql workload.

@asias
Copy link
Contributor

asias commented Mar 3, 2020

Repair has the same problem. To solve this, we can introduce a compaction reader which compact before it emits the results. @denesb

@asias
Copy link
Contributor

asias commented Mar 3, 2020

@slivne @bhalevy @glommer I talked with @denesb in the past, such reader is doable and is not that complicated, he needs someone to prioritize it.

@bhalevy
Copy link
Member

bhalevy commented Mar 5, 2020

Cc @glommer fyi

@bhalevy
Copy link
Member

bhalevy commented Mar 5, 2020

@glommer, @raphaelsc Let's take this into account for off-strategy compaction

@denesb
Copy link
Contributor

denesb commented Mar 24, 2020

The compacting reader is available as of 342c967.

@asias
Copy link
Contributor

asias commented Apr 1, 2020 via email

@denesb
Copy link
Contributor

denesb commented Apr 1, 2020

@asias note that we don't need https://github.com/scylladb/scylla/blob/dee0b6834738778921fc582b5a90484a214ef56f/sstables/compaction.cc#L76 like we discussed before. Tomek revealed that since streaming and repair will be reading all sstables containing relevant data for the range we read, we can purge all tombstones, so you can just pass [] (const dht::decorated_key&) { return api::max_timestamp; } as get_max_purgeable.

@asias
Copy link
Contributor

asias commented Apr 1, 2020

@asias note that we don't need

https://github.com/scylladb/scylla/blob/dee0b6834738778921fc582b5a90484a214ef56f/sstables/compaction.cc#L76
like we discussed before. Tomek revealed that since streaming and repair will be reading all sstables containing relevant data for the range we read, we can purge all tombstones, so you can just pass [] (const dht::decorated_key&) { return api::max_timestamp; } as get_max_purgeable.

Good to know.

@slivne slivne modified the milestones: 4.x, 4.3 May 25, 2020
@slivne
Copy link
Contributor

slivne commented Nov 4, 2020

@bhalevy / @asias lets add this to the qeue and implement this now that we have a compacting reader

@asias
Copy link
Contributor

asias commented Nov 5, 2020

I would rather to defer it until we GA repair based node ops.

@slivne slivne modified the milestones: 4.3, 4.4 Nov 26, 2020
@slivne slivne modified the milestones: 4.4, 4.5 Mar 29, 2021
@vladzcloudius
Copy link
Contributor

@asias I believe that if we apply compacting reader on a streamed data before sending we should make sure to send expired tombstones if they are the latest data version on a sending replica.

If we don't do that we will find ourselves in the situation where repair will be unable to restore the data consistency on repaired nodes.

@raphaelsc
Copy link
Member

raphaelsc commented Jun 23, 2021 via email

@asias
Copy link
Contributor

asias commented Jun 23, 2021

We need to repair before the tombstone expires and being compacted away anyway. If the tombstone is expired but is still present, it is better to send it to peers even if it is expired to be safe.

@asias
Copy link
Contributor

asias commented Jun 24, 2021

@denesb can you please clarify if the flat mutation reader for streaming will return the latest tombstone if it is older than gc_grace_period currently without the compact reader? What happens if we plug the compact reader into streaming reader.

@denesb
Copy link
Contributor

denesb commented Jun 24, 2021

Currently the streaming reader will return all data, live, shadowed, expired or not. If we plug in the compacting reader it will only return live data and non-expired tombstones. I strongly disagree about returning expired tombstones. It basically comes down to relying on luck to maintain correctness.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Jun 24, 2021

Currently the streaming reader will return all data, live, shadowed, expired or not. If we plug in the compacting reader it will only return live data and non-expired tombstones. I strongly disagree about returning expired tombstones. It basically comes down to relying on luck to maintain correctness.

@denesb Nobody is relying on it. But it's stupid not to use the tombstone if you still have it.
No need to make things worse then they already are.

It's a matter of CONSISTENCY and not CORRECTNESS as we discussed earlier.

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Jul 7, 2021 via email

@asias
Copy link
Contributor

asias commented Jul 7, 2021

On Wed, Jul 07, 2021 at 04:35:50AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 01:47:10AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 01:24:51AM -0700, Asias He wrote: > Without repair cleanup may be slow, but it is guarantied to happen. With repair like it is today it does not. How do you know the problems we see today are not caused by the current repair? Pointing a bug in the system is not exaggeration even if you think it is hard to trigger. But here we do not even have any evidence for that. For all we know it is triggered all of the time and cause us problem that we blame on something else (slow compaction). With repair, cleanup is guaranteed to happen too. > Sorry. I do not see how: 1. A, B, C all have tombstones. 2. A gced it. 3. repair restored it to A 4. B, C gced it 5. repair restored it to B and C 6 goto 1 cleanup == compaction on a node to remove the expired compaction It is possible what you describe can happen. It is also possible when you run repair, every node has do the clean up already. Eventually, the tombstone can go away. The hint after repair can solve this problem. The hint is effectively saying you are safe to drop the expire tombstone. > "Can go away" is not the same as "with repair, cleanup is guaranteed to happen too" like you said above and "can" != "guaranteed" and is not enough. Hence the current behaviour is broken. We are referring cleanup to different things. Like I just mentioned above , by "clean up" I meant compaction on a node to remove the expired tombstone.
It does not accomplish much if it gets replicated back again.
It is a price to pay since we do not have protection with gc_grace_period currently. The gc_grace_period mechanism is broken. Without syncing tombstone, it is even more broken.
It is "broken" by design and it is documented how it should be used properly. Users who follow the documentation should not pay the price.

So now you think we should drop the expired tombstone with the current code even without the gc_grace_period protection?

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Jul 7, 2021 via email

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Jul 8, 2021

You're talking about performance when we're in a corrupted state. I'm
talking about performance when we're not in a corrupted state. The fix
would affect it too.

This is precisely the point of disagreement: we should not pessimize
well maintained system in favor of making non maintained one to behave
marginally better.

@tgrabiec @gleb-cloudius The "performance penalty" is not going to be measurable because the "extra work" frequency is going to be equal to the frequency of compactions that evict tombstones in the scenario described by @tgrabiec earlier - and the time between two consequent compactions like these is "eternity" compared to the frequency of requests. If that's the only argument against the fix - then I believe that this argument can be safely ignored.

This way or another I think it makes no sense arguing about a read-repair on the GH issue related to a "regular" repair therefore I created a separate issue for that (#8970).

I have nothing to add to what @asias wrote here: #3561 (comment).

The bottom line - the issue is obvious (both about the read-repair and about a regular repair).
So, is my view on the current state of things.
I'm open for any feasible solution for the issues at hand.

@mykaul
Copy link
Contributor

mykaul commented Jul 18, 2023

@bhalevy - Per the discussion today, should we re-visit this? At least higher priority so it won't be lost in 5.x?

@bhalevy bhalevy added the P2 High Priority label Jul 18, 2023
@bhalevy
Copy link
Member

bhalevy commented Jul 18, 2023

Yes, I think that P2 is appropriate.
@denesb can you pitch in if @raphaelsc is overwhelmed with other issues?

@mykaul mykaul removed the Eng-3 label Jul 18, 2023
@denesb
Copy link
Contributor

denesb commented Jul 19, 2023

@denesb can you pitch in if @raphaelsc is overwhelmed with other issues?

Sure.

@denesb
Copy link
Contributor

denesb commented Jul 19, 2023

PR is here: #14756

@avikivity
Copy link
Member

Not a regression, not backporting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/streaming P2 High Priority symptom/performance Issues causing performance problems
Projects
None yet
Development

Successfully merging a pull request may close this issue.