New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible split-brain with LWT ops over CQL lists #7116
Comments
I've written a variant of this test which operates over plain old registers, to rule out any weirdness with CQL collections. Those also appear to exhibit some sort of weird nonlinearizable behavior--maybe a stale read? Hard to tell with just reads and writes. For instance, check out what key 10531 does in this test run. I'm eliding a bunch of other writes/reads to get to the weird bit. ; Process 5524 writes 4, speaking to node n5:
{:type :invoke, :f :txn, :value [[:w 10531 4]], :time 346339471007, :process 5524, :index 685723}
{:type :ok, :f :txn, :value [[:w 10531 4]], :time 348263505500, :process 5524, :index 685834}
; Process 5879 writes 14, also on node n5:
{:type :invoke, :f :txn, :value [[:w 10532 9] [:w 10531 14]], :time 348450414605, :process 5879, :index 685859}
{:type :ok, :f :txn, :value [[:w 10532 9] [:w 10531 14]], :time 349026558073, :process 5879, :index 685880}
; A read on node n5 shows 14, but a concurrent read on node n4 shows the old value of 4!
{:type :invoke, :f :txn, :value [[:r 10532 nil] [:r 10531 nil]], :time 349567838206, :process 4708, :index 685905}
{:type :invoke, :f :txn, :value [[:r 10531 nil]], :time 350427521895, :process 5769, :index 685987}
{:type :ok, :f :txn, :value [[:r 10531 14]], :time 350499193743, :process 5769, :index 685994}
{:type :ok, :f :txn, :value [[:r 10532 10] [:r 10531 4]], :time 353391562348, :process 4708, :index 686064} You can reproduce this using jepsen.scylla af86be8, with something like lein run test -w wr-register --time-limit 600 --concurrency 10n -r 1000 --test-count 20 --nemesis partition --nemesis-interval 20 This one involved network partitions--I'm still looking to see if it occurs without faults. |
Ah, good, it's not partition-dependent--we can reproduce single-cell linearizability faults without partitions in healthy clusters. It's relatively infrequent, though--took a few thousand seconds to catch this one: http://jepsen.io.s3.amazonaws.com/analyses/scylla-4.2/20200825T225228.000-0400.zip |
To refine this a bit: nonlinearizability persists even when reads are executed with
... produced the following history for key 95. One of the very first operations is a write of 5. It completes at 20.1 seconds. {:type :invoke, :f :txn, :value [[:w 95 5] [:w 98 31]], :time 19289533242, :process 0, :index 147}
...
{:type :ok, :f :txn, :value [[:w 95 5] [:w 98 31]], :time 20107770399, :process 0, :index 204} At 23 seconds, another write sets 95 to 8:
And a subsequent read at ~24 seconds observes the new value of 8: {:type :invoke, :f :txn, :value [[:r 95 nil] [:r 99 nil]], :time 23749325067, :process 84, :index 383}
...
{:type :ok, :f :txn, :value [[:r 95 8] [:r 99 113]], :time 24309638221, :process 84, :index 408 A thousand-odd writes occur over the next 40 seconds, including several successful operations: {:type :ok, :f :txn, :value [[:w 99 115] [:w 95 9]], :time 24365152122, :process 69, :index 410}
{:type :ok, :f :txn, :value [[:w 95 12]], :time 25964001811, :process 64, :index 520}
{:type :ok, :f :txn, :value [[:w 95 14] [:w 96 30]], :time 28821871078, :process 265, :index 656}
{:type :ok, :f :txn, :value [[:w 99 183] [:w 95 16]], :time 29582091953, :process 265, :index 694}
{:type :ok, :f :txn, :value [[:w 95 368]], :time 35266952756, :process 209, :index 33144}
{:type :ok, :f :txn, :value [[:w 102 887] [:w 95 1336]], :time 43007046683, :process 474, :index 85417}
{:type :ok, :f :txn, :value [[:w 97 5735] [:w 95 1483]], :time 46844142349, :process 470, :index 92992}
{:type :ok, :f :txn, :value [[:w 95 1808]], :time 58741618886, :process 899, :index 116859}
{:type :ok, :f :txn, :value [[:w 102 8971] [:w 95 1824]], :time 58972573806, :process 918, :index 117335} When suddenly, at 60 seconds, the value flips back to 5: {:type :invoke, :f :txn, :value [[:r 102 nil] [:r 95 nil]], :time 59634475022, :process 216, :index 118702}
...
{:type :ok, :f :txn, :value [[:r 102 9611] [:r 95 5]], :time 60886189322, :process 216, :index 119242} |
I've done some tuning: with 767baf5, you should be able to hit faults in roughly a minute by doing either:
|
I can confirm this is present on 4.1 as well. With 7fab1f4, you can use the --version flag to swap between versions:
(specifically, this is Scylla version 4.1.6-0.20200831.6d9ff622df with build-id c2ed04d00095c63b7456b5fcda6775c5a9f33b0b) |
We are currently investigating an issue where some nodes don't return the appropriate learned value. |
This appears to be resolved in scylla-server_4.2~rc4-0.20200915.fdf86ffb8-1_amd64.deb! |
Oh, a weird question--if identical hashes means that users can experience linearizability violations... does that mean that hash collisions could cause them too? |
Yes. |
Huh! Might be interesting to go about deliberately constructing rows with hash collisions. :-) |
Note though that the timestamp is also hashed and it is chosen by the server, so you do not control all of the data which makes creating hash collisions harder. |
The root cause of the problem is #4567, fixed in Scylla 4.2+ |
In a workload comprised of single-key select and update queries appending unique integers to CQL lists, it appears that histories may not be linearizable, even in healthy clusters--no partitions, crashes, etc. Reads executed at
SERIAL
fail to observe recent writes. Reads reflect states that could not possibly be the product of any order of updates.Our schema involves a single table,
lists
, where elements are identified by a composite primary key[part, id]
.part
, in this workload, is always 0.Updates append a single unique integer to a randomly selected row's value, by primary key. We use a trivial IF expression so that these updates go through LWT but still upsert:
Our reads are of the form:
...and are always executed with SERIAL consistency.
Stale Reads
In this test run, which performed roughly 1172 acknowledged transactions over 120 seconds, we observed a single stale read:
In short, we failed to observe an append of 4 that completed 196 ms before the read even began. This history actually contains an astonishing two hundred sixty-four transactions, all of which depend, either via data dependencies or realtime edges, on every other transaction in the cluster, so there might be other, worse anomalies lurking in there--the checker gave up looking:
Incompatible Orders
Three keys (36, 39, 42) exhibited incompatible orders: reads which could not have arisen from a total order of appends. Appended values can go missing from lists, then "reappear" later. Here's key 36:
This looks like a normal list, until the first four and last seven elements disappear for no reason. Then they come back, two seconds later. Timestamps here are in nanoseconds--elements [17, 18, 19] were present for roughly 4.6 seconds before vanishing.
On key 39, we alternate between two different timelines. Note the process IDs:
process mod 5
gives the node index they talked to. This looks like split brain--node n4 had [2 12], but node n5 had [2 6 1 4 5 3]. After roughly 4 seconds, node 5 managed to pick up on the append of 12.In case these are CQL collection-specific behaviors, I'm going to try writing a variant of this test which uses reads and writes over standard ints instead.
You can reproduce this with jepsen.scylla caf3d3c by running
This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
Installation details
Scylla version (or git commit hash): 4.2
Cluster size: 5
OS (RHEL/CentOS/Ubuntu/AWS AMI): Debian 10
The text was updated successfully, but these errors were encountered: