Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QUORUM Reads can start randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame #8970

Open
vladzcloudius opened this issue Jul 2, 2021 · 3 comments
Assignees
Milestone

Comments

@vladzcloudius
Copy link
Contributor

Installation details

HEAD: 373fa3f
Cluster size: 3

Description

If the data was deleted with a CL=QUORUM, the tombstone haven't reached one of the replicas and if it wasn't repaired during the gc_grace_seconds following reads with CL=QUORUM are going to randomly return either the old or a NULL data depending on which replicas are chosen.

When this happens read-repair will be executed when the replica without a tombstone is involved but will not be able to restore replicas' consistency.

As a result read-repair will keep on sending data, burning the CPU and harming the latency.

How to reproduce on a local machine

  1. Create a scylla ccm cluster of 3 nodes.
  2. Populate some data with cassandra-stress:
$ cassandra-stress write n=1 no-warmup -mode native cql3 -schema keyspace="ks2" "replication(factor=3)"
  1. See what's the key for the data you populated and make sure that it's on all 3 replicas:
cqlsh> CONSISTENCY ALL
Consistency level set to ALL
cqlsh> SELECT * FROM ks2.standard1 LIMIT 1;
 key                    | C0                                                                     | C1                                                                     | C2                                                                     | C3                                                                     | C4
------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------
 0x4c4e5032304e34334d30 | 0xfb073d21b291e3030d2be5d3217fe78e75f6dcd1fb8d690ed26677bc13b52cf5e0de | 0xfe069ee8cb3ed5589e7b15891b6fada8d719cb68257fd31b118db7a157bdd185e615 | 0xfc9b745e4beb31506821babde05d31e13e0c0f15e73807f028b9406960058c1450a9 | 0x1288207dc2f2527b2e49e7794a0b67b23a9dbe489913aa12b96519dd95abd0ec5d43 | 0xa09c829ae45d230939d8b003a86ed37e2624581181070f529891e2751dee8c5c4386
                                                               
(1 rows)
  1. Change a gc_grace_seconds to 5 seconds
cqlsh> ALTER table ks2.standard1 with gc_grace_seconds = 5;
  1. Stop node3.
  2. Set consistency to QUORUM and delete the key from (3).
cqlsh> CONSISTENCY QUORUM
Consistency level set to QUORUM
cqlsh> DELETE FROM ks2.standard1 where key=0x4c4e5032304e34334d30;
  1. Wait for 10 seconds
$ sleep 10
  1. Read the data - see how no data is returned
cqlsh> SELECT * FROM ks2.standard1 where key=0x4c4e5032304e34334d30;                                                                                                                                                                                
                                                                                                                                                                                                                                                               key | C0 | C1 | C2 | C3 | C4                                                                                                                                                                                                                                 
-----+----+----+----+----+----                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                              
(0 rows)          
  1. Start node3 and stop node2 (don't exit the cqlsh). Then read the data - old data is returned:
cqlsh> SELECT * FROM ks2.standard1 where key=0x4c4e5032304e34334d30;                                                 
                                                                                                                                                                                                                                                               key                    | C0                                                                     | C1                                                                     | C2                                                                     | C3       
                                                              | C4                                                                                                                                                                                            ------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+----------
--------------------------------------------------------------+------------------------------------------------------------------------                                                                                                                       
 0x4c4e5032304e34334d30 | 0xfb073d21b291e3030d2be5d3217fe78e75f6dcd1fb8d690ed26677bc13b52cf5e0de | 0xfe069ee8cb3ed5589e7b15891b6fada8d719cb68257fd31b118db7a157bdd185e615 | 0xfc9b745e4beb31506821babde05d31e13e0c0f15e73807f028b9406960058c1450a9 | 0x1288207
dc2f2527b2e49e7794a0b67b23a9dbe489913aa12b96519dd95abd0ec5d43 | 0xa09c829ae45d230939d8b003a86ed37e2624581181070f529891e2751dee8c5c4386                                                                                                                        
                                                               
(1 rows)
  1. Start node2 and stop node3 (don't exit cqlsh). Read the data - no data again:
cqlsh> SELECT * FROM ks2.standard1 where key=0x4c4e5032304e34334d30;

 key | C0 | C1 | C2 | C3 | C4
-----+----+----+----+----+----

(0 rows)

How to see that read-repair is running and not succeeding

  1. Run steps (1)-(9) above.
  2. Change RF of system_traces to 3:
cqlsh> ALTER keyspace system_traces  WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'}  AND durable_writes = true;
  1. Enable CQL Tracing and read the data 2 or more times and see how a read-repair is executing every time:
cqlsh> TRACING ON
Now Tracing is enabled
cqlsh> SELECT * FROM ks2.standard1 where key=0x343131504b3850323631;                                                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                              
 key                    | C0                                                                     | C1                                                                     | C2                                                                     | C3                                                                     | C4
------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------
 0x343131504b3850323631 | 0xe47ca02cc31a43e7b77d06979fb0788b6f900766abad58ff4d76330f2f006a64f052 | 0x112b68a452dcbfd42d8680b11cf598b6fdd57850daaa1523b06cf33f94e8a3505da0 | 0xa12a44e2aa7e86138c50a49a3aa21058877bb610164321bef504629a41f14dd04ed4 | 0xd9d0ae5483a47575d1232cf1aa3cd868bb82e41a436b2c67a33cf88c0a9b0eabb79a | 0x57a9aa5deac4c8595f58f3082d796a8056a4d34d2233c529455341b4295deafa7edb
                                                                                                                                                                                                                                                              
(1 rows)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
Tracing session: 0dcfcb70-db5f-11eb-80a1-bb092b031d93                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 activity                                                                                                                                               | timestamp                  | source    | source_elapsed | client                                                                                                                                                                                                              
--------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------                                                                                                                                                                                                                                                                                                                                               Execute CQL3 query | 2021-07-02 13:58:06.503000 | 127.0.0.1 |              0 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                          Parsing a statement [shard 0] | 2021-07-02 13:58:06.503298 | 127.0.0.1 |              2 | 127.0.0.1                                 
                                                                                                                       Processing a statement [shard 0] | 2021-07-02 13:58:06.503577 | 127.0.0.1 |            281 | 127.0.0.1                                 
 Creating read executor for token -9223043771259877837 with all: {127.0.0.1, 127.0.0.3} targets: {127.0.0.1, 127.0.0.3} repair decision: NONE [shard 0] | 2021-07-02 13:58:06.507053 | 127.0.0.1 |           3757 | 127.0.0.1                                 
                                                                                                                  read_data: querying locally [shard 0] | 2021-07-02 13:58:06.507061 | 127.0.0.1 |           3765 | 127.0.0.1                                 
                                                         Start querying singular range {{-9223043771259877837, pk{000a343131504b3850323631}}} [shard 0] | 2021-07-02 13:58:06.507077 | 127.0.0.1 |           3780 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                 read_digest: sending a message to /127.0.0.3 [shard 0] | 2021-07-02 13:58:06.507093 | 127.0.0.1 |           3797 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:06.508337 | 127.0.0.1 |           5040 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                             Querying is done [shard 0] | 2021-07-02 13:58:06.508440 | 127.0.0.1 |           5144 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                read_digest: message received from /127.0.0.1 [shard 0] | 2021-07-02 13:58:06.509819 | 127.0.0.3 |             22 | 127.0.0.1                                                                                                                                                                                                           
                                                         Start querying singular range {{-9223043771259877837, pk{000a343131504b3850323631}}} [shard 0] | 2021-07-02 13:58:06.509845 | 127.0.0.3 |             49 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:06.509887 | 127.0.0.3 |             91 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                             Querying is done [shard 0] | 2021-07-02 13:58:06.509972 | 127.0.0.3 |            176 | 127.0.0.1                                                                                                                                                                                                           
                                                                               read_digest handling is done, sending a response to /127.0.0.1 [shard 0] | 2021-07-02 13:58:06.509986 | 127.0.0.3 |            190 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                    read_digest: got response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:06.512820 | 127.0.0.1 |           9524 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                         read_mutation_data: querying locally [shard 0] | 2021-07-02 13:58:06.512836 | 127.0.0.1 |           9540 | 127.0.0.1                                                                                                                                                                                                           
                                                                                          read_mutation_data: sending a message to /127.0.0.3 [shard 0] | 2021-07-02 13:58:06.512872 | 127.0.0.1 |           9576 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:06.512969 | 127.0.0.1 |           9673 | 127.0.0.1                                                                                                                                                                                                           
                                                                                         read_mutation_data: message received from /127.0.0.1 [shard 0] | 2021-07-02 13:58:06.513245 | 127.0.0.3 |             16 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:06.513299 | 127.0.0.3 |             70 | 127.0.0.1                                                                                                                                                                                                           
                                                                        read_mutation_data handling is done, sending a response to /127.0.0.1 [shard 0] | 2021-07-02 13:58:06.513462 | 127.0.0.3 |            233 | 127.0.0.1                                                                                                                                                                                                           
                                                                                             read_mutation_data: got response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:06.514391 | 127.0.0.1 |          11094 | 127.0.0.1                                                                                                                                                                                                           
                                          Creating write handler for read repair token: -9223043771259877837 endpoint: {127.0.0.3, 127.0.0.1} [shard 0] | 2021-07-02 13:58:06.514553 | 127.0.0.1 |          11256 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                               Got a response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:06.514574 | 127.0.0.1 |          11277 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                 Executing a mutation locally [shard 0] | 2021-07-02 13:58:06.514579 | 127.0.0.1 |          11282 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                               Got a response from /127.0.0.1 [shard 0] | 2021-07-02 13:58:06.514683 | 127.0.0.1 |          11386 | 127.0.0.1                                                                                                                                                                                                           
                                                                                 Delay decision due to throttling: do not delay, resuming now [shard 0] | 2021-07-02 13:58:06.514686 | 127.0.0.1 |          11389 | 127.0.0.1                                                                                                                                                                                                                                                                                                                         Mutation successfully completed [shard 0] | 2021-07-02 13:58:06.514695 | 127.0.0.1 |          11398 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                         Done processing - preparing a result [shard 0] | 2021-07-02 13:58:06.514738 | 127.0.0.1 |          11442 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                                       Request complete | 2021-07-02 13:58:06.514474 | 127.0.0.1 |          11474 | 127.0.0.1                                                                                                                                                                                                           


cqlsh> SELECT * FROM ks2.standard1 where key=0x343131504b3850323631;                                                                                                                                      

 key                    | C0                                                                     | C1                                                                     | C2                                                                     | C3                                                                     | C4
------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------------------------------------------------------------------
 0x343131504b3850323631 | 0xe47ca02cc31a43e7b77d06979fb0788b6f900766abad58ff4d76330f2f006a64f052 | 0x112b68a452dcbfd42d8680b11cf598b6fdd57850daaa1523b06cf33f94e8a3505da0 | 0xa12a44e2aa7e86138c50a49a3aa21058877bb610164321bef504629a41f14dd04ed4 | 0xd9d0ae5483a47575d1232cf1aa3cd868bb82e41a436b2c67a33cf88c0a9b0eabb79a | 0x57a9aa5deac4c8595f58f3082d796a8056a4d34d2233c529455341b4295deafa7edb

(1 rows)                                                                                                  

Tracing session: 12cfdac0-db5f-11eb-80a1-bb092b031d93                                                     

 activity                                                                                                                                               | timestamp                  | source    | source_elapsed | client                                                                                                                                                                                                              
--------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------                                                                                                                                                                                                          
                                                                                                                                     Execute CQL3 query | 2021-07-02 13:58:14.892000 | 127.0.0.1 |              0 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                          Parsing a statement [shard 0] | 2021-07-02 13:58:14.892653 | 127.0.0.1 |              1 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                       Processing a statement [shard 0] | 2021-07-02 13:58:14.892887 | 127.0.0.1 |            236 | 127.0.0.1                                                                                                                                                                                                           
 Creating read executor for token -9223043771259877837 with all: {127.0.0.1, 127.0.0.3} targets: {127.0.0.1, 127.0.0.3} repair decision: NONE [shard 0] | 2021-07-02 13:58:14.892984 | 127.0.0.1 |            332 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                  read_data: querying locally [shard 0] | 2021-07-02 13:58:14.892990 | 127.0.0.1 |            339 | 127.0.0.1                                                                                                                                                                                                                                                                    Start querying singular range {{-9223043771259877837, pk{000a343131504b3850323631}}} [shard 0] | 2021-07-02 13:58:14.893000 | 127.0.0.1 |            349 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                 read_digest: sending a message to /127.0.0.3 [shard 0] | 2021-07-02 13:58:14.893015 | 127.0.0.1 |            364 | 127.0.0.1                                                                                                                                                                                                                                                Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:14.893096 | 127.0.0.1 |            444 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                             Querying is done [shard 0] | 2021-07-02 13:58:14.893243 | 127.0.0.1 |            591 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                read_digest: message received from /127.0.0.1 [shard 0] | 2021-07-02 13:58:14.893460 | 127.0.0.3 |             19 | 127.0.0.1                                                                                                                                                                                                                                                                    Start querying singular range {{-9223043771259877837, pk{000a343131504b3850323631}}} [shard 0] | 2021-07-02 13:58:14.893482 | 127.0.0.3 |             41 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:14.893517 | 127.0.0.3 |             76 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                             Querying is done [shard 0] | 2021-07-02 13:58:14.893596 | 127.0.0.3 |            155 | 127.0.0.1                                                                                                                                                                                                           
                                                                               read_digest handling is done, sending a response to /127.0.0.1 [shard 0] | 2021-07-02 13:58:14.893608 | 127.0.0.3 |            167 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                    read_digest: got response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:14.894522 | 127.0.0.1 |           1871 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                         read_mutation_data: querying locally [shard 0] | 2021-07-02 13:58:14.894534 | 127.0.0.1 |           1883 | 127.0.0.1                                                                                                                                                                                                           
                                                                                          read_mutation_data: sending a message to /127.0.0.3 [shard 0] | 2021-07-02 13:58:14.894552 | 127.0.0.1 |           1901 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:14.894607 | 127.0.0.1 |           1955 | 127.0.0.1                                                                                                                                                                                                           
                                                                                         read_mutation_data: message received from /127.0.0.1 [shard 0] | 2021-07-02 13:58:14.894966 | 127.0.0.3 |             15 | 127.0.0.1                                                                                                                                                                                                           
                                     Querying cache for range {{-9223043771259877837, pk{000a343131504b3850323631}}} and slice {(-inf, +inf)} [shard 0] | 2021-07-02 13:58:14.895010 | 127.0.0.3 |             59 | 127.0.0.1                                                                                                                                                                                                           
                                                                        read_mutation_data handling is done, sending a response to /127.0.0.1 [shard 0] | 2021-07-02 13:58:14.895100 | 127.0.0.3 |            150 | 127.0.0.1                                                                                                                                                                                                           
                                                                                             read_mutation_data: got response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:14.895798 | 127.0.0.1 |           3147 | 127.0.0.1                                                                                                                                                                                                           
                                          Creating write handler for read repair token: -9223043771259877837 endpoint: {127.0.0.3, 127.0.0.1} [shard 0] | 2021-07-02 13:58:14.895903 | 127.0.0.1 |           3252 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                               Got a response from /127.0.0.3 [shard 0] | 2021-07-02 13:58:14.895918 | 127.0.0.1 |           3266 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                 Executing a mutation locally [shard 0] | 2021-07-02 13:58:14.895922 | 127.0.0.1 |           3270 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                               Got a response from /127.0.0.1 [shard 0] | 2021-07-02 13:58:14.895984 | 127.0.0.1 |           3333 | 127.0.0.1                                                                                                                                                                                                           
                                                                                 Delay decision due to throttling: do not delay, resuming now [shard 0] | 2021-07-02 13:58:14.895986 | 127.0.0.1 |           3334 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                              Mutation successfully completed [shard 0] | 2021-07-02 13:58:14.895996 | 127.0.0.1 |           3345 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                         Done processing - preparing a result [shard 0] | 2021-07-02 13:58:14.896032 | 127.0.0.1 |           3380 | 127.0.0.1                                                                                                                                                                                                           
                                                                                                                                       Request complete | 2021-07-02 13:58:14.895404 | 127.0.0.1 |           3404 | 127.0.0.1                                                                                                                                                                                                           


What's the root-cause?

The root-cause is that our combined reader doesn't return a tombstone if it is the latest version of data and it is expired. Instead it returns a NULL data.
As a result a reconciliation logic concludes that the "old" data is the newest one and sends a mutation to the replica with the tombstone which is applied but since the tombstone still has the later timestamp the next time reader returns a NULL data again and round we go.

For the same reason the reconciliation logic return the old data to the user (as a CQL SELECT result) every time the replica without a tombstone is involved and a NULL data when two replicas with the tombstone are read.

This state will remain as long as expired tombstones are not compacted away or replicas' consistency is restored somehow (see below).

Unfortunately, expired tombstones are not going to be compacted away as long as they are not compacted with the data they shadow and there isn't any guarantee when this is going to happen unless a major compaction is issued.

How to work around?

Option 1

Run a full repair - luckily our regular repair doesn't suffer from the same issue as a read-repair today.
However if #3561 is implemented as proposed in the opening message a repair would no longer be able to get the system out of the state above.

Option 2

Run a major compaction on all relevant replica nodes.

This would make tombstones to be evicted and will guarantee a data resurrection. But at least your reads will become consistent.

Option 3

Increase a gc_grace_seconds so that tombstones become not expired.

@vladzcloudius
Copy link
Contributor Author

@bhalevy
Copy link
Member

bhalevy commented Jul 4, 2021

@vladzcloudius it's not really random data, isn't it?
Do you mean the reads can return resurrected data inconsistently/randomly?

@vladzcloudius
Copy link
Contributor Author

@vladzcloudius it's not really random data, isn't it?
Do you mean the reads can return resurrected data inconsistently/randomly?

@bhalevy yes, that's what I meant. I'll update the description to make it clearer.

@vladzcloudius vladzcloudius changed the title QUORUM Reads can start constantly return random data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame QUORUM Reads can start constantly randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame Jul 8, 2021
@vladzcloudius vladzcloudius changed the title QUORUM Reads can start constantly randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame QUORUM Reads can start randomly return resurrected/current data if the data was deleted and if a full repair failed to run in a gc_grace_second time frame Jul 8, 2021
@DoronArazii DoronArazii added this to the 5.x milestone Jan 5, 2023
@denesb denesb self-assigned this May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants