New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming should compact data before sending #3561
Comments
The compaction will reduce the number of data we sent which is good. However, on the other hand, it will trigger more work during the streaming and it might cause higher impact on the cql workload. |
Repair has the same problem. To solve this, we can introduce a compaction reader which compact before it emits the results. @denesb |
Cc @glommer fyi |
@glommer, @raphaelsc Let's take this into account for off-strategy compaction |
The compacting reader is available as of 342c967. |
Thanks botond for the awesome reader ;-)
…On Tue, Mar 24, 2020 at 9:24 PM Botond Dénes ***@***.***> wrote:
The compacting reader is available as of 342c967
<342c967>
.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#3561 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACOETAKG7OK4SZI2VMDARLRJCYBLANCNFSM4FHVCQLQ>
.
--
Asias
|
@asias note that we don't need https://github.com/scylladb/scylla/blob/dee0b6834738778921fc582b5a90484a214ef56f/sstables/compaction.cc#L76 like we discussed before. Tomek revealed that since streaming and repair will be reading all sstables containing relevant data for the range we read, we can purge all tombstones, so you can just pass |
Good to know. |
I would rather to defer it until we GA repair based node ops. |
@asias I believe that if we apply compacting reader on a streamed data before sending we should make sure to send expired tombstones if they are the latest data version on a sending replica. If we don't do that we will find ourselves in the situation where repair will be unable to restore the data consistency on repaired nodes. |
On Wed, Jun 23, 2021 at 8:30 PM Vladislav Zolotarov < ***@***.***> wrote:
@asias <https://github.com/asias> I believe that if we apply compacting
reader on a streamed data before sending we should make sure to send
expired tombstones if they are the latest data version on a sending replica.
this behavior would only work if the expired tombstones weren't compacted
away no? please correct me if i am mistaken, but this would only work if
expired tombstones are still around by luck.
… If we don't do that we will find ourselves in the situation where repair
will be unable to restore the data consistency on repaired nodes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3561 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKYA45O3VZPGKL7LZAMDHLTUJVBDANCNFSM4FHVCQLQ>
.
|
We need to repair before the tombstone expires and being compacted away anyway. If the tombstone is expired but is still present, it is better to send it to peers even if it is expired to be safe. |
@denesb can you please clarify if the flat mutation reader for streaming will return the latest tombstone if it is older than gc_grace_period currently without the compact reader? What happens if we plug the compact reader into streaming reader. |
Currently the streaming reader will return all data, live, shadowed, expired or not. If we plug in the compacting reader it will only return live data and non-expired tombstones. I strongly disagree about returning expired tombstones. It basically comes down to relying on luck to maintain correctness. |
@denesb Nobody is relying on it. But it's stupid not to use the tombstone if you still have it. It's a matter of CONSISTENCY and not CORRECTNESS as we discussed earlier. |
On Wed, Jul 07, 2021 at 04:35:50AM -0700, Asias He wrote:
> On Wed, Jul 07, 2021 at 01:47:10AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 01:24:51AM -0700, Asias He wrote: > Without repair cleanup may be slow, but it is guarantied to happen. With repair like it is today it does not. How do you know the problems we see today are not caused by the current repair? Pointing a bug in the system is not exaggeration even if you think it is hard to trigger. But here we do not even have any evidence for that. For all we know it is triggered all of the time and cause us problem that we blame on something else (slow compaction). With repair, cleanup is guaranteed to happen too. > Sorry. I do not see how: 1. A, B, C all have tombstones. 2. A gced it. 3. repair restored it to A 4. B, C gced it 5. repair restored it to B and C 6 goto 1 cleanup == compaction on a node to remove the expired compaction It is possible what you describe can happen. It is also possible when you run repair, every node has do the clean up already. Eventually, the tombstone can go away. The hint after repair can solve this problem. The hint is effectively saying you are safe to drop the expire tombstone.
> "Can go away" is not the same as "with repair, cleanup is guaranteed to happen too" like you said above and "can" != "guaranteed" and is not enough. Hence the current behaviour _is_ broken.
We are referring cleanup to different things. Like I just mentioned above , by "clean up" I meant compaction on a node to remove the expired tombstone.
It does not accomplish much if it gets replicated back again.
It is a price to pay since we do not have protection with gc_grace_period currently. The gc_grace_period mechanism is broken. Without syncing tombstone, it is even more broken.
It is "broken" by design and it is documented how it should be used
properly. Users who follow the documentation should not pay the
price.
…--
Gleb.
|
So now you think we should drop the expired tombstone with the current code even without the gc_grace_period protection? |
On Wed, Jul 07, 2021 at 04:43:22AM -0700, Asias He wrote:
> On Wed, Jul 07, 2021 at 04:35:50AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 01:47:10AM -0700, Asias He wrote: > On Wed, Jul 07, 2021 at 01:24:51AM -0700, Asias He wrote: > Without repair cleanup may be slow, but it is guarantied to happen. With repair like it is today it does not. How do you know the problems we see today are not caused by the current repair? Pointing a bug in the system is not exaggeration even if you think it is hard to trigger. But here we do not even have any evidence for that. For all we know it is triggered all of the time and cause us problem that we blame on something else (slow compaction). With repair, cleanup is guaranteed to happen too. > Sorry. I do not see how: 1. A, B, C all have tombstones. 2. A gced it. 3. repair restored it to A 4. B, C gced it 5. repair restored it to B and C 6 goto 1 cleanup == compaction on a node to remove the expired compaction It is possible what you describe can happen. It is also possible when you run repair, every node has do the clean up already. Eventually, the tombstone can go away. The hint after repair can solve this problem. The hint is effectively saying you are safe to drop the expire tombstone. > "Can go away" is not the same as "with repair, cleanup is guaranteed to happen too" like you said above and "can" != "guaranteed" and is not enough. Hence the current behaviour _is_ broken. We are referring cleanup to different things. Like I just mentioned above , by "clean up" I meant compaction on a node to remove the expired tombstone.
> It does not accomplish much if it gets replicated back again.
> It is a price to pay since we do not have protection with gc_grace_period currently. The gc_grace_period mechanism is broken. Without syncing tombstone, it is even more broken.
> It is "broken" by design and it is documented how it should be used properly. Users who follow the documentation should not pay the price.
So now you think we should drop the expired tombstone with the current code even without the gc_grace_period protection?
If we are not going to work on a protection (which I think we should
since it looks like Cassandra has it with incremental repair) we should
add a flag to nodetool to enable transferring them conditionally. A user
usually know that he missed a repair I hope.
…--
Gleb.
|
@tgrabiec @gleb-cloudius The "performance penalty" is not going to be measurable because the "extra work" frequency is going to be equal to the frequency of compactions that evict tombstones in the scenario described by @tgrabiec earlier - and the time between two consequent compactions like these is "eternity" compared to the frequency of requests. If that's the only argument against the fix - then I believe that this argument can be safely ignored. This way or another I think it makes no sense arguing about a read-repair on the GH issue related to a "regular" repair therefore I created a separate issue for that (#8970). I have nothing to add to what @asias wrote here: #3561 (comment). The bottom line - the issue is obvious (both about the read-repair and about a regular repair). |
@bhalevy - Per the discussion today, should we re-visit this? At least higher priority so it won't be lost in 5.x? |
Yes, I think that P2 is appropriate. |
Sure. |
PR is here: #14756 |
Not a regression, not backporting. |
Currently streaming is combining sstable and memtable readers, but doesn't compact. That may result in lots of unnecessary data (deleted, expired) to be sent.
The text was updated successfully, but these errors were encountered: