You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
QUORUM Reads can start randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame
#8970
Open
vladzcloudius opened this issue
Jul 2, 2021
· 3 comments
If the data was deleted with a CL=QUORUM, the tombstone haven't reached one of the replicas and if it wasn't repaired during the gc_grace_seconds following reads with CL=QUORUM are going to randomly return either the old or a NULL data depending on which replicas are chosen.
When this happens read-repair will be executed when the replica without a tombstone is involved but will not be able to restore replicas' consistency.
As a result read-repair will keep on sending data, burning the CPU and harming the latency.
The root-cause is that our combined reader doesn't return a tombstone if it is the latest version of data and it is expired. Instead it returns a NULL data.
As a result a reconciliation logic concludes that the "old" data is the newest one and sends a mutation to the replica with the tombstone which is applied but since the tombstone still has the later timestamp the next time reader returns a NULL data again and round we go.
For the same reason the reconciliation logic return the old data to the user (as a CQL SELECT result) every time the replica without a tombstone is involved and a NULL data when two replicas with the tombstone are read.
This state will remain as long as expired tombstones are not compacted away or replicas' consistency is restored somehow (see below).
Unfortunately, expired tombstones are not going to be compacted away as long as they are not compacted with the data they shadow and there isn't any guarantee when this is going to happen unless a major compaction is issued.
How to work around?
Option 1
Run a full repair - luckily our regular repair doesn't suffer from the same issue as a read-repair today.
However if #3561 is implemented as proposed in the opening message a repair would no longer be able to get the system out of the state above.
Option 2
Run a major compaction on all relevant replica nodes.
This would make tombstones to be evicted and will guarantee a data resurrection. But at least your reads will become consistent.
Option 3
Increase a gc_grace_seconds so that tombstones become not expired.
The text was updated successfully, but these errors were encountered:
@vladzcloudius it's not really random data, isn't it?
Do you mean the reads can return resurrected data inconsistently/randomly?
@bhalevy yes, that's what I meant. I'll update the description to make it clearer.
vladzcloudius
changed the title
QUORUM Reads can start constantly return random data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame
QUORUM Reads can start constantly randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame
Jul 8, 2021
vladzcloudius
changed the title
QUORUM Reads can start constantly randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame
QUORUM Reads can start randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame
Jul 8, 2021
Installation details
HEAD: 373fa3f
Cluster size: 3
Description
If the data was deleted with a CL=QUORUM, the tombstone haven't reached one of the replicas and if it wasn't repaired during the
gc_grace_seconds
following reads with CL=QUORUM are going to randomly return either the old or a NULL data depending on which replicas are chosen.When this happens read-repair will be executed when the replica without a tombstone is involved but will not be able to restore replicas' consistency.
As a result read-repair will keep on sending data, burning the CPU and harming the latency.
How to reproduce on a local machine
ccm
cluster of 3 nodes.cqlsh
). Then read the data - old data is returned:cqlsh
). Read the data - no data again:How to see that read-repair is running and not succeeding
system_traces
to 3:What's the root-cause?
The root-cause is that our combined reader doesn't return a tombstone if it is the latest version of data and it is expired. Instead it returns a NULL data.
As a result a reconciliation logic concludes that the "old" data is the newest one and sends a mutation to the replica with the tombstone which is applied but since the tombstone still has the later timestamp the next time reader returns a NULL data again and round we go.
For the same reason the reconciliation logic return the old data to the user (as a CQL SELECT result) every time the replica without a tombstone is involved and a NULL data when two replicas with the tombstone are read.
This state will remain as long as expired tombstones are not compacted away or replicas' consistency is restored somehow (see below).
Unfortunately, expired tombstones are not going to be compacted away as long as they are not compacted with the data they shadow and there isn't any guarantee when this is going to happen unless a major compaction is issued.
How to work around?
Option 1
Run a full repair - luckily our regular repair doesn't suffer from the same issue as a read-repair today.
However if #3561 is implemented as proposed in the opening message a repair would no longer be able to get the system out of the state above.
Option 2
Run a major compaction on all relevant replica nodes.
This would make tombstones to be evicted and will guarantee a data resurrection. But at least your reads will become consistent.
Option 3
Increase a
gc_grace_seconds
so that tombstones become not expired.The text was updated successfully, but these errors were encountered: