Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible split-brain with LWT ops over CQL lists #7116

Closed
1 task done
aphyr opened this issue Aug 25, 2020 · 12 comments
Closed
1 task done

Possible split-brain with LWT ops over CQL lists #7116

aphyr opened this issue Aug 25, 2020 · 12 comments
Assignees

Comments

@aphyr
Copy link

aphyr commented Aug 25, 2020

In a workload comprised of single-key select and update queries appending unique integers to CQL lists, it appears that histories may not be linearizable, even in healthy clusters--no partitions, crashes, etc. Reads executed at SERIAL fail to observe recent writes. Reads reflect states that could not possibly be the product of any order of updates.

Our schema involves a single table, lists, where elements are identified by a composite primary key [part, id]. part, in this workload, is always 0.

CREATE TABLE IF NOT EXISTS "lists" (
  part int,
  id int,
  lwt_trivial int,
  value list<int>,
  PRIMARY KEY (part, id)
) WITH compaction = {'class' : 'SizeTieredCompactionStrategy'};

Updates append a single unique integer to a randomly selected row's value, by primary key. We use a trivial IF expression so that these updates go through LWT but still upsert:

UPDATE "lists" SET value = value + [10] WHERE part = 0 AND id = 41 IF lwt_trivial = null;

Our reads are of the form:

SELECT * FROM "lists" WHERE part = 0 AND id = 51;

...and are always executed with SERIAL consistency.

Stale Reads

In this test run, which performed roughly 1172 acknowledged transactions over 120 seconds, we observed a single stale read:

G-single-realtime #0
Let:
  T1 = {:type :ok, :f :txn, :value [[:r 43 [25 26 27]]], :time 109236930848, :process 1673, :index 5026}
  T2 = {:type :ok, :f :txn, :value [[:append 43 4]], :time 108768382054, :process 1632, :index 4988}

Then:
  - T1 < T2, because T1 did not observe T2's append of 4 to 43.
  - However, T2 < T1, because T2 completed at index 4988, 0.196 seconds before the invocation of T1, at index 5005: a contradiction!

In short, we failed to observe an append of 4 that completed 196 ms before the read even began. This history actually contains an astonishing two hundred sixty-four transactions, all of which depend, either via data dependencies or realtime edges, on every other transaction in the cluster, so there might be other, worse anomalies lurking in there--the checker gave up looking:

                        :cycle-search-timeout [{:type :cycle-search-timeout,
                                                :anomaly-spec-type :G0-realtime,
                                                :scc-size 264}],

Incompatible Orders

Three keys (36, 39, 42) exhibited incompatible orders: reads which could not have arisen from a total order of appends. Appended values can go missing from lists, then "reappear" later. Here's key 36:

{:type :ok, :f :txn, :value [[:r 36 [17 18 19]]], :time 91376794993, :process 1216, :index 3926}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19]]], :time 91499714393, :process 1216, :index 3932}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19]]], :time 91533393942, :process 1216, :index 3934}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19]]], :time 91570987098, :process 1216, :index 3936}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20]]], :time 91783200441, :process 1216, :index 3966}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20]]], :time 91947447893, :process 1216, :index 3974}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20]]], :time 91983632413, :process 1216, :index 3976}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22]]], :time 92103365054, :process 1216, :index 3982}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22]]], :time 92187090416, :process 1216, :index 3986}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22]]], :time 92226487566, :process 1216, :index 3988}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22]]], :time 92265031099, :process 1216, :index 3990}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22]]], :time 92303676170, :process 1216, :index 3992}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23]]], :time 92392034777, :process 1216, :index 3996}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23]]], :time 92554104555, :process 1216, :index 4004}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23]]], :time 92590421500, :process 1216, :index 4006}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23 24]]], :time 92667072456, :process 1216, :index 4010}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23 24]]], :time 92708989806, :process 1216, :index 4012}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23 24]]], :time 92802954055, :process 1216, :index 4016}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23 24]]], :time 93033376860, :process 1216, :index 4026}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23 24]]], :time 94115850303, :process 1369, :index 4036}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23 24]]], :time 94137022635, :process 1309, :index 4038}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23 24 5 9 11 1 26 27]]], :time 94624237116, :process 1257, :index 4088}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23 24 5 9 11 1 26 27]]], :time 94752877417, :process 1340, :index 4136}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23 24 5 9 11 1 26 27 21]]], :time 95139570386, :process 1355, :index 4156}
{:type :ok, :f :txn, :value [[:r 36 [22 23 24]]], :time 95994166342, :process 1358, :index 4226}
{:type :ok, :f :txn, :value [[:r 36 [17 18 19 20 22 23 24 5 9 11 1 26 27 21]]], :time 98204456356, :process 1216, :index 4372}

This looks like a normal list, until the first four and last seven elements disappear for no reason. Then they come back, two seconds later. Timestamps here are in nanoseconds--elements [17, 18, 19] were present for roughly 4.6 seconds before vanishing.

On key 39, we alternate between two different timelines. Note the process IDs: process mod 5 gives the node index they talked to. This looks like split brain--node n4 had [2 12], but node n5 had [2 6 1 4 5 3]. After roughly 4 seconds, node 5 managed to pick up on the append of 12.

{:type :ok, :f :txn, :value [[:r 39 [2]]], :time 99733288150, :process 1370, :index 4500}
{:type :ok, :f :txn, :value [[:r 39 [2 6 1 4 5 3]]], :time 104291210058, :process 1694, :index 4728}
{:type :ok, :f :txn, :value [[:r 39 [2 12]]], :time 108976694275, :process 1688, :index 5006}
{:type :ok, :f :txn, :value [[:r 39 [2 12]]], :time 109024373455, :process 1543, :index 5010}
{:type :ok, :f :txn, :value [[:r 39 [2 12]]], :time 109718655735, :process 1683, :index 5044}
{:type :ok, :f :txn, :value [[:r 39 [2 6 1 4 5 3 12 16 14 29]]], :time 113527547843, :process 1624, :index 5316}
{:type :ok, :f :txn, :value [[:r 39 [2 6 1 4 5 3 12 16 14 29 31]]], :time 115688709483, :process 1617, :index 5388}

In case these are CQL collection-specific behaviors, I'm going to try writing a variant of this test which uses reads and writes over standard ints instead.

You can reproduce this with jepsen.scylla caf3d3c by running

lein run test -w list-append --time-limit 120 --concurrency 20n -r 1000 --test-count 10 --max-txn-length 1

This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

  • I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

Installation details
Scylla version (or git commit hash): 4.2
Cluster size: 5
OS (RHEL/CentOS/Ubuntu/AWS AMI): Debian 10

@aphyr
Copy link
Author

aphyr commented Aug 26, 2020

I've written a variant of this test which operates over plain old registers, to rule out any weirdness with CQL collections. Those also appear to exhibit some sort of weird nonlinearizable behavior--maybe a stale read? Hard to tell with just reads and writes. For instance, check out what key 10531 does in this test run. I'm eliding a bunch of other writes/reads to get to the weird bit.

; Process 5524 writes 4, speaking to node n5:
{:type :invoke, :f :txn, :value [[:w 10531 4]], :time 346339471007, :process 5524, :index 685723}
{:type :ok, :f :txn, :value [[:w 10531 4]], :time 348263505500, :process 5524, :index 685834}
; Process 5879 writes 14, also on node n5:
{:type :invoke, :f :txn, :value [[:w 10532 9] [:w 10531 14]], :time 348450414605, :process 5879, :index 685859}
{:type :ok, :f :txn, :value [[:w 10532 9] [:w 10531 14]], :time 349026558073, :process 5879, :index 685880}
; A read on node n5 shows 14, but a concurrent read on node n4 shows the old value of 4!
{:type :invoke, :f :txn, :value [[:r 10532 nil] [:r 10531 nil]], :time 349567838206, :process 4708, :index 685905}
{:type :invoke, :f :txn, :value [[:r 10531 nil]], :time 350427521895, :process 5769, :index 685987}
{:type :ok, :f :txn, :value [[:r 10531 14]], :time 350499193743, :process 5769, :index 685994}
{:type :ok, :f :txn, :value [[:r 10532 10] [:r 10531 4]], :time 353391562348, :process 4708, :index 686064}

You can reproduce this using jepsen.scylla af86be8, with something like

lein run test -w wr-register --time-limit 600 --concurrency 10n -r 1000 --test-count 20 --nemesis partition --nemesis-interval 20

This one involved network partitions--I'm still looking to see if it occurs without faults.

@aphyr
Copy link
Author

aphyr commented Aug 26, 2020

Ah, good, it's not partition-dependent--we can reproduce single-cell linearizability faults without partitions in healthy clusters. It's relatively infrequent, though--took a few thousand seconds to catch this one: http://jepsen.io.s3.amazonaws.com/analyses/scylla-4.2/20200825T225228.000-0400.zip

@aphyr
Copy link
Author

aphyr commented Aug 27, 2020

To refine this a bit: nonlinearizability persists even when reads are executed with :consistency :serial and writes are executed with :consistency :quorum and :serial-consistency :serial--before we were just setting :consistency as :serial. For example, this two minute test with jepsen.scylla 3988808, using network partitions...

lein run test -w wr-register --time-limit 120 --concurrency 10n -r 1000 --test-count 10 --max-writes-per-key 10000 --key-count 100 --nemesis partition --nemesis-interval 20

... produced the following history for key 95. One of the very first operations is a write of 5. It completes at 20.1 seconds.

{:type :invoke, :f :txn, :value [[:w 95 5] [:w 98 31]], :time 19289533242, :process 0, :index 147}
...
{:type :ok, :f :txn, :value [[:w 95 5] [:w 98 31]], :time 20107770399, :process 0, :index 204}

At 23 seconds, another write sets 95 to 8:

{:type :invoke, :f :txn, :value [[:w 99 104] [:w 95 8]], :time 23081676258, :process 89, :index 347}
{:type :ok, :f :txn, :value [[:w 99 104] [:w 95 8]], :time 23638924442, :process 89, :index 378}

And a subsequent read at ~24 seconds observes the new value of 8:

{:type :invoke, :f :txn, :value [[:r 95 nil] [:r 99 nil]], :time 23749325067, :process 84, :index 383}
...
{:type :ok, :f :txn, :value [[:r 95 8] [:r 99 113]], :time 24309638221, :process 84, :index 408

A thousand-odd writes occur over the next 40 seconds, including several successful operations:

{:type :ok, :f :txn, :value [[:w 99 115] [:w 95 9]], :time 24365152122, :process 69, :index 410}
{:type :ok, :f :txn, :value [[:w 95 12]], :time 25964001811, :process 64, :index 520}
{:type :ok, :f :txn, :value [[:w 95 14] [:w 96 30]], :time 28821871078, :process 265, :index 656}
{:type :ok, :f :txn, :value [[:w 99 183] [:w 95 16]], :time 29582091953, :process 265, :index 694}
{:type :ok, :f :txn, :value [[:w 95 368]], :time 35266952756, :process 209, :index 33144}
{:type :ok, :f :txn, :value [[:w 102 887] [:w 95 1336]], :time 43007046683, :process 474, :index 85417}
{:type :ok, :f :txn, :value [[:w 97 5735] [:w 95 1483]], :time 46844142349, :process 470, :index 92992}
{:type :ok, :f :txn, :value [[:w 95 1808]], :time 58741618886, :process 899, :index 116859}
{:type :ok, :f :txn, :value [[:w 102 8971] [:w 95 1824]], :time 58972573806, :process 918, :index 117335}

When suddenly, at 60 seconds, the value flips back to 5:

{:type :invoke, :f :txn, :value [[:r 102 nil] [:r 95 nil]], :time 59634475022, :process 216, :index 118702}
...
{:type :ok, :f :txn, :value [[:r 102 9611] [:r 95 5]], :time 60886189322, :process 216, :index 119242}

@aphyr
Copy link
Author

aphyr commented Sep 2, 2020

I've done some tuning: with 767baf5, you should be able to hit faults in roughly a minute by doing either:

lein run test-all --concurrency 2n --time-limit 30 --nemesis partition -w wr-register
lein run test-all --concurrency 2n --time-limit 30 --nemesis partition -w list-append

@aphyr
Copy link
Author

aphyr commented Sep 2, 2020

I can confirm this is present on 4.1 as well. With 7fab1f4, you can use the --version flag to swap between versions:

lein run test -w wr-register --nemesis partition --version 4.1

(specifically, this is Scylla version 4.1.6-0.20200831.6d9ff622df with build-id c2ed04d00095c63b7456b5fcda6775c5a9f33b0b)

@alecco
Copy link
Contributor

alecco commented Sep 3, 2020

We are currently investigating an issue where some nodes don't return the appropriate learned value.

@aphyr
Copy link
Author

aphyr commented Sep 15, 2020

This appears to be resolved in scylla-server_4.2~rc4-0.20200915.fdf86ffb8-1_amd64.deb!

@aphyr aphyr closed this as completed Sep 15, 2020
@aphyr
Copy link
Author

aphyr commented Sep 16, 2020

Oh, a weird question--if identical hashes means that users can experience linearizability violations... does that mean that hash collisions could cause them too?

@kostja
Copy link
Contributor

kostja commented Oct 1, 2020

Yes.

@aphyr
Copy link
Author

aphyr commented Oct 2, 2020

Huh! Might be interesting to go about deliberately constructing rows with hash collisions. :-)

@gleb-cloudius
Copy link
Contributor

Note though that the timestamp is also hashed and it is chosen by the server, so you do not control all of the data which makes creating hash collisions harder.

@kostja kostja added the jepsen label Dec 23, 2020
@kostja
Copy link
Contributor

kostja commented Dec 23, 2020

The root cause of the problem is #4567, fixed in Scylla 4.2+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants