QUORUM Reads can start randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame #8970

vladzcloudius · 2021-07-02T18:18:50Z

Installation details

HEAD: 373fa3f
Cluster size: 3

Description

If the data was deleted with a CL=QUORUM, the tombstone haven't reached one of the replicas and if it wasn't repaired during the gc_grace_seconds following reads with CL=QUORUM are going to randomly return either the old or a NULL data depending on which replicas are chosen.

When this happens read-repair will be executed when the replica without a tombstone is involved but will not be able to restore replicas' consistency.

As a result read-repair will keep on sending data, burning the CPU and harming the latency.

How to reproduce on a local machine

Create a scylla ccm cluster of 3 nodes.
Populate some data with cassandra-stress:

$ cassandra-stress write n=1 no-warmup -mode native cql3 -schema keyspace="ks2" "replication(factor=3)"

See what's the key for the data you populated and make sure that it's on all 3 replicas:

cqlsh> CONSISTENCY ALL
Consistency level set to ALL
cqlsh> SELECT * FROM ks2.standard1 LIMIT 1;
 key                    | C0                                                                     | C1                                                                     | C2                                                                     | C3                                                                     | C4
------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------
 0x4c4e5032304e34334d30 | 0xfb073d21b291e3030d2be5d3217fe78e75f6dcd1fb8d690ed26677bc13b52cf5e0de | 0xfe069ee8cb3ed5589e7b15891b6fada8d719cb68257fd31b118db7a157bdd185e615 | 0xfc9b745e4beb31506821babde05d31e13e0c0f15e73807f028b9406960058c1450a9 | 0x1288207dc2f2527b2e49e7794a0b67b23a9dbe489913aa12b96519dd95abd0ec5d43 | 0xa09c829ae45d230939d8b003a86ed37e2624581181070f529891e2751dee8c5c4386
                                                               
(1 rows)

Change a gc_grace_seconds to 5 seconds

cqlsh> ALTER table ks2.standard1 with gc_grace_seconds = 5;

Stop node3.
Set consistency to QUORUM and delete the key from (3).

cqlsh> CONSISTENCY QUORUM
Consistency level set to QUORUM
cqlsh> DELETE FROM ks2.standard1 where key=0x4c4e5032304e34334d30;

Wait for 10 seconds

$ sleep 10

Read the data - see how no data is returned

cqlsh> SELECT * FROM ks2.standard1 where key=0x4c4e5032304e34334d30;                                                                                                                                                                                
                                                                                                                                                                                                                                                               key | C0 | C1 | C2 | C3 | C4                                                                                                                                                                                                                                 
-----+----+----+----+----+----                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                              
(0 rows)

Start node3 and stop node2 (don't exit the cqlsh). Then read the data - old data is returned:

cqlsh> SELECT * FROM ks2.standard1 where key=0x4c4e5032304e34334d30;                                                 
                                                                                                                                                                                                                                                               key                    | C0                                                                     | C1                                                                     | C2                                                                     | C3       
                                                              | C4                                                                                                                                                                                            ------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+----------
--------------------------------------------------------------+------------------------------------------------------------------------                                                                                                                       
 0x4c4e5032304e34334d30 | 0xfb073d21b291e3030d2be5d3217fe78e75f6dcd1fb8d690ed26677bc13b52cf5e0de | 0xfe069ee8cb3ed5589e7b15891b6fada8d719cb68257fd31b118db7a157bdd185e615 | 0xfc9b745e4beb31506821babde05d31e13e0c0f15e73807f028b9406960058c1450a9 | 0x1288207
dc2f2527b2e49e7794a0b67b23a9dbe489913aa12b96519dd95abd0ec5d43 | 0xa09c829ae45d230939d8b003a86ed37e2624581181070f529891e2751dee8c5c4386                                                                                                                        
                                                               
(1 rows)

Start node2 and stop node3 (don't exit cqlsh). Read the data - no data again:

cqlsh> SELECT * FROM ks2.standard1 where key=0x4c4e5032304e34334d30;

 key | C0 | C1 | C2 | C3 | C4
-----+----+----+----+----+----

(0 rows)

How to see that read-repair is running and not succeeding

Run steps (1)-(9) above.
Change RF of system_traces to 3:

cqlsh> ALTER keyspace system_traces  WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'}  AND durable_writes = true;

Enable CQL Tracing and read the data 2 or more times and see how a read-repair is executing every time:

cqlsh> TRACING ON
Now Tracing is enabled
cqlsh> SELECT * FROM ks2.standard1 where key=0x343131504b3850323631;                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                              
 key                    | C0                                                                     | C1                                                                     | C2                                                                     | C3                                                                     | C4
------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------
 0x343131504b3850323631 | 0xe47ca02cc31a43e7b77d06979fb0788b6f900766abad58ff4d76330f2f006a64f052 | 0x112b68a452dcbfd42d8680b11cf598b6fdd57850daaa1523b06cf33f94e8a3505da0 | 0xa12a44e2aa7e86138c50a49a3aa21058877bb610164321bef504629a41f14dd04ed4 | 0xd9d0ae5483a47575d1232cf1aa3cd868bb82e41a436b2c67a33cf88c0a9b0eabb79a | 0x57a9aa5deac4c8595f58f3082d796a8056a4d34d2233c529455341b4295deafa7edb
                                                                                                                                                                                                                                                              
(1 rows)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
Tracing session: 0dcfcb70-db5f-11eb-80a1-bb092b031d93                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 activity                                                                                                                                               | timestamp                  | source    | source_elapsed | client                                                                                                                                                                                                              
--------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------                                                                                                                                                                                                                                                                                                                                               Execute CQL3 query | 2021-07-02 13:58:06.503000 | 127.0.0.1 |              0 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                          Parsing a statement [shard 0] | 2021-07-02 13:58:06.503298 | 127.0.0.1 |              2 | 127.0.0.1                                 
                                                                                                                       Processing a statement [shard 0] | 2021-07-02 13:58:06.503577 | 127.0.0.1 |            281 | 127.0.0.1                                 
 Creating read executor for token -9223043771259877837 with all: {127.0.0.1, 127.0.0.3} targets: {127.0.0.1, 127.0.0.3} repair decision: NONE [shard 0] | 2021-07-02 13:58:06.507053 | 127.0.0.1 |           3757 | 127.0.0.1                                 
                                                                                                                  read_data: querying locally [shard 0] | 2021-07-02 13:58:06.507061 | 127.0.0.1 |           3765 | 127.0.0.1                                 
                                                         Start querying singular range {{-9223043771259877837, pk{000a343131504b3850323631}}} [shard 0] | 2021-07-02 13:58:06.507077 | 127.0.0.1 |           3780 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                 read_digest: sending a message to /127.0.0.3 [shard 0] | 2021-07-02 13:58:06.507093 | 127.0.0.1 |           3797 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:06.508337 | 127.0.0.1 |           5040 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                             Querying is done [shard 0] | 2021-07-02 13:58:06.508440 | 127.0.0.1 |           5144 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                read_digest: message received from /127.0.0.1 [shard 0] | 2021-07-02 13:58:06.509819 | 127.0.0.3 |             22 | 127.0.0.1                                                                                                                                                                                                           
                                                         Start querying singular range {{-9223043771259877837, pk{000a343131504b3850323631}}} [shard 0] | 2021-07-02 13:58:06.509845 | 127.0.0.3 |             49 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:06.509887 | 127.0.0.3 |             91 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                             Querying is done [shard 0] | 2021-07-02 13:58:06.509972 | 127.0.0.3 |            176 | 127.0.0.1                                                                                                                                                                                                           
                                                                               read_digest handling is done, sending a response to /127.0.0.1 [shard 0] | 2021-07-02 13:58:06.509986 | 127.0.0.3 |            190 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                    read_digest: got response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:06.512820 | 127.0.0.1 |           9524 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                         read_mutation_data: querying locally [shard 0] | 2021-07-02 13:58:06.512836 | 127.0.0.1 |           9540 | 127.0.0.1                                                                                                                                                                                                           
                                                                                          read_mutation_data: sending a message to /127.0.0.3 [shard 0] | 2021-07-02 13:58:06.512872 | 127.0.0.1 |           9576 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:06.512969 | 127.0.0.1 |           9673 | 127.0.0.1                                                                                                                                                                                                           
                                                                                         read_mutation_data: message received from /127.0.0.1 [shard 0] | 2021-07-02 13:58:06.513245 | 127.0.0.3 |             16 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:06.513299 | 127.0.0.3 |             70 | 127.0.0.1                                                                                                                                                                                                           
                                                                        read_mutation_data handling is done, sending a response to /127.0.0.1 [shard 0] | 2021-07-02 13:58:06.513462 | 127.0.0.3 |            233 | 127.0.0.1                                                                                                                                                                                                           
                                                                                             read_mutation_data: got response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:06.514391 | 127.0.0.1 |          11094 | 127.0.0.1                                                                                                                                                                                                           
                                          Creating write handler for read repair token: -9223043771259877837 endpoint: {127.0.0.3, 127.0.0.1} [shard 0] | 2021-07-02 13:58:06.514553 | 127.0.0.1 |          11256 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                               Got a response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:06.514574 | 127.0.0.1 |          11277 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                 Executing a mutation locally [shard 0] | 2021-07-02 13:58:06.514579 | 127.0.0.1 |          11282 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                               Got a response from /127.0.0.1 [shard 0] | 2021-07-02 13:58:06.514683 | 127.0.0.1 |          11386 | 127.0.0.1                                                                                                                                                                                                           
                                                                                 Delay decision due to throttling: do not delay, resuming now [shard 0] | 2021-07-02 13:58:06.514686 | 127.0.0.1 |          11389 | 127.0.0.1                                                                                                                                                                                                                                                                                                                         Mutation successfully completed [shard 0] | 2021-07-02 13:58:06.514695 | 127.0.0.1 |          11398 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                         Done processing - preparing a result [shard 0] | 2021-07-02 13:58:06.514738 | 127.0.0.1 |          11442 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                                       Request complete | 2021-07-02 13:58:06.514474 | 127.0.0.1 |          11474 | 127.0.0.1                                                                                                                                                                                                           


cqlsh> SELECT * FROM ks2.standard1 where key=0x343131504b3850323631;                                                                                                                                      

 key                    | C0                                                                     | C1                                                                     | C2                                                                     | C3                                                                     | C4
------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------
 0x343131504b3850323631 | 0xe47ca02cc31a43e7b77d06979fb0788b6f900766abad58ff4d76330f2f006a64f052 | 0x112b68a452dcbfd42d8680b11cf598b6fdd57850daaa1523b06cf33f94e8a3505da0 | 0xa12a44e2aa7e86138c50a49a3aa21058877bb610164321bef504629a41f14dd04ed4 | 0xd9d0ae5483a47575d1232cf1aa3cd868bb82e41a436b2c67a33cf88c0a9b0eabb79a | 0x57a9aa5deac4c8595f58f3082d796a8056a4d34d2233c529455341b4295deafa7edb

(1 rows)                                                                                                  

Tracing session: 12cfdac0-db5f-11eb-80a1-bb092b031d93                                                     

 activity                                                                                                                                               | timestamp                  | source    | source_elapsed | client                                                                                                                                                                                                              
--------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------                                                                                                                                                                                                          
                                                                                                                                     Execute CQL3 query | 2021-07-02 13:58:14.892000 | 127.0.0.1 |              0 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                          Parsing a statement [shard 0] | 2021-07-02 13:58:14.892653 | 127.0.0.1 |              1 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                       Processing a statement [shard 0] | 2021-07-02 13:58:14.892887 | 127.0.0.1 |            236 | 127.0.0.1                                                                                                                                                                                                           
 Creating read executor for token -9223043771259877837 with all: {127.0.0.1, 127.0.0.3} targets: {127.0.0.1, 127.0.0.3} repair decision: NONE [shard 0] | 2021-07-02 13:58:14.892984 | 127.0.0.1 |            332 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                  read_data: querying locally [shard 0] | 2021-07-02 13:58:14.892990 | 127.0.0.1 |            339 | 127.0.0.1                                                                                                                                                                                                                                                                    Start querying singular range {{-9223043771259877837, pk{000a343131504b3850323631}}} [shard 0] | 2021-07-02 13:58:14.893000 | 127.0.0.1 |            349 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                 read_digest: sending a message to /127.0.0.3 [shard 0] | 2021-07-02 13:58:14.893015 | 127.0.0.1 |            364 | 127.0.0.1                                                                                                                                                                                                                                                Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:14.893096 | 127.0.0.1 |            444 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                             Querying is done [shard 0] | 2021-07-02 13:58:14.893243 | 127.0.0.1 |            591 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                read_digest: message received from /127.0.0.1 [shard 0] | 2021-07-02 13:58:14.893460 | 127.0.0.3 |             19 | 127.0.0.1                                                                                                                                                                                                                                                                    Start querying singular range {{-9223043771259877837, pk{000a343131504b3850323631}}} [shard 0] | 2021-07-02 13:58:14.893482 | 127.0.0.3 |             41 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:14.893517 | 127.0.0.3 |             76 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                             Querying is done [shard 0] | 2021-07-02 13:58:14.893596 | 127.0.0.3 |            155 | 127.0.0.1                                                                                                                                                                                                           
                                                                               read_digest handling is done, sending a response to /127.0.0.1 [shard 0] | 2021-07-02 13:58:14.893608 | 127.0.0.3 |            167 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                    read_digest: got response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:14.894522 | 127.0.0.1 |           1871 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                         read_mutation_data: querying locally [shard 0] | 2021-07-02 13:58:14.894534 | 127.0.0.1 |           1883 | 127.0.0.1                                                                                                                                                                                                           
                                                                                          read_mutation_data: sending a message to /127.0.0.3 [shard 0] | 2021-07-02 13:58:14.894552 | 127.0.0.1 |           1901 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:14.894607 | 127.0.0.1 |           1955 | 127.0.0.1                                                                                                                                                                                                           
                                                                                         read_mutation_data: message received from /127.0.0.1 [shard 0] | 2021-07-02 13:58:14.894966 | 127.0.0.3 |             15 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:14.895010 | 127.0.0.3 |             59 | 127.0.0.1                                                                                                                                                                                                           
                                                                        read_mutation_data handling is done, sending a response to /127.0.0.1 [shard 0] | 2021-07-02 13:58:14.895100 | 127.0.0.3 |            150 | 127.0.0.1                                                                                                                                                                                                           
                                                                                             read_mutation_data: got response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:14.895798 | 127.0.0.1 |           3147 | 127.0.0.1                                                                                                                                                                                                           
                                          Creating write handler for read repair token: -9223043771259877837 endpoint: {127.0.0.3, 127.0.0.1} [shard 0] | 2021-07-02 13:58:14.895903 | 127.0.0.1 |           3252 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                               Got a response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:14.895918 | 127.0.0.1 |           3266 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                 Executing a mutation locally [shard 0] | 2021-07-02 13:58:14.895922 | 127.0.0.1 |           3270 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                               Got a response from /127.0.0.1 [shard 0] | 2021-07-02 13:58:14.895984 | 127.0.0.1 |           3333 | 127.0.0.1                                                                                                                                                                                                           
                                                                                 Delay decision due to throttling: do not delay, resuming now [shard 0] | 2021-07-02 13:58:14.895986 | 127.0.0.1 |           3334 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                              Mutation successfully completed [shard 0] | 2021-07-02 13:58:14.895996 | 127.0.0.1 |           3345 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                         Done processing - preparing a result [shard 0] | 2021-07-02 13:58:14.896032 | 127.0.0.1 |           3380 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                                       Request complete | 2021-07-02 13:58:14.895404 | 127.0.0.1 |           3404 | 127.0.0.1

What's the root-cause?

The root-cause is that our combined reader doesn't return a tombstone if it is the latest version of data and it is expired. Instead it returns a NULL data.
As a result a reconciliation logic concludes that the "old" data is the newest one and sends a mutation to the replica with the tombstone which is applied but since the tombstone still has the later timestamp the next time reader returns a NULL data again and round we go.

For the same reason the reconciliation logic return the old data to the user (as a CQL SELECT result) every time the replica without a tombstone is involved and a NULL data when two replicas with the tombstone are read.

This state will remain as long as expired tombstones are not compacted away or replicas' consistency is restored somehow (see below).

Unfortunately, expired tombstones are not going to be compacted away as long as they are not compacted with the data they shadow and there isn't any guarantee when this is going to happen unless a major compaction is issued.

How to work around?

Option 1

Run a full repair - luckily our regular repair doesn't suffer from the same issue as a read-repair today.
However if #3561 is implemented as proposed in the opening message a repair would no longer be able to get the system out of the state above.

Option 2

Run a major compaction on all relevant replica nodes.

This would make tombstones to be evicted and will guarantee a data resurrection. But at least your reads will become consistent.

Option 3

Increase a gc_grace_seconds so that tombstones become not expired.

The text was updated successfully, but these errors were encountered:

vladzcloudius · 2021-07-02T18:20:04Z

@tgrabiec @denesb @eliransin @tarzanek @kostja @gleb-cloudius FYI

bhalevy · 2021-07-04T08:56:46Z

@vladzcloudius it's not really random data, isn't it?
Do you mean the reads can return resurrected data inconsistently/randomly?

vladzcloudius · 2021-07-08T14:57:28Z

@vladzcloudius it's not really random data, isn't it?
Do you mean the reads can return resurrected data inconsistently/randomly?

@bhalevy yes, that's what I meant. I'll update the description to make it clearer.

vladzcloudius mentioned this issue Jul 2, 2021

Streaming should compact data before sending #3561

Closed

DoronArazii added this to the 5.x milestone Jan 5, 2023

denesb self-assigned this May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUORUM Reads can start randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame #8970

QUORUM Reads can start randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame #8970

vladzcloudius commented Jul 2, 2021

vladzcloudius commented Jul 2, 2021

bhalevy commented Jul 4, 2021

vladzcloudius commented Jul 8, 2021

QUORUM Reads can start randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame #8970

QUORUM Reads can start randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame #8970

Comments

vladzcloudius commented Jul 2, 2021

Installation details

Description

How to reproduce on a local machine

What's the root-cause?

How to work around?

Option 1

Option 2

Option 3

vladzcloudius commented Jul 2, 2021

bhalevy commented Jul 4, 2021

vladzcloudius commented Jul 8, 2021