Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aborted reads of writes which fail with "Not enough replicas available" #7258

Closed
1 task done
aphyr opened this issue Sep 18, 2020 · 9 comments
Closed
1 task done

Aborted reads of writes which fail with "Not enough replicas available" #7258

aphyr opened this issue Sep 18, 2020 · 9 comments
Assignees
Milestone

Comments

@aphyr
Copy link

aphyr commented Sep 18, 2020

Infrequently, with network partitions and process crashes, LWT operations which fail with a "Not enough replicas available" message may, in fact, be visible to later reads. @kostja confirms that this message should be interpreted as a definite failure, which means this behavior is technically an aborted read.

For example, see this list-append test, which contains the write:

{:type :fail,
 :f :txn,
 :value [[:append 629 6]
         [:append 628 37]
         [:append 623 101]],
 :time 164324993082,
 :process 839,
 :error {:type :no-host-available,
         :definite? true,
         :message "All host(s) tried for query failed (tried: n5/192.168.122.15:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)))"}
 :index 32000

Which was immediately followed by the read:

{:type :ok,
 :f :txn,
 :value [[:r 628 [37]]],
 :time 164330209696,
 :process 924,
 :index 32002}

A subsequent read observed [:r 628 [37 83]], which is surprising, because the write of 83 to key 628 also failed:

969     :fail   :txn    [[:append 618 50] [:append 628 83]]     {:type :no-host-available, :definite? true, :message All host(s) tried for query failed (tried: n5/192.168.122.15:9042 (com.
datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)))}

... And yet, its write to key 618 also apparently succeeded:

759     :fail   :txn    [[:append 629 127] [:append 618 52]]    {:type :no-host-available, :definite? true, :message All host(s) tried for query failed (tried: n5/192.168.122.15:9042 (com.
datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive)))}
721     :ok     :txn    [[:append 618 69]]
950     :ok     :txn    [[:append 618 74]]
648     :ok     :txn    [[:r 618 [50 52 69 74]]]

Here, failed writes of 50 and 52 actually succeeded, and were followed, in the history of that key, by acknowledged writes of 69 and 74.

This sort of problem--getting a definite failure for something that probably should have been indefinite--often comes up as a result of internal retry logic, where the retry mechanism either a.) improperly retries an (e.g. non-idempotent) indeterminate failure, or b.) properly retries, and returns the most recent, rather than the most indeterminate, of the errors it encountered. Since we haven't, so far, observed duplicate writes, I suspect the latter. I know Cassandra clients have a RetryPolicy--perhaps it's to blame here? I'm still investigating.

You can reproduce this issue with jepsen-scylla b724545, by running

lein run test -w list-append --nemesis partition,kill --nemesis-interval 10 --time-limit 300 --concurrency 10n -r 100 --local-deb scylla-server_4.2~rc4-0.20200915.fdf86ffb8-1_amd64.deb --test-count 10 --write-consistency quorum

This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

  • I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

Installation details
Scylla version (or git commit hash): scylla-server_4.2~rc4-0.20200915.fdf86ffb8-1_amd64.deb
Cluster size: 5
OS (RHEL/CentOS/Ubuntu/AWS AMI): Debian Buster

@aphyr
Copy link
Author

aphyr commented Sep 19, 2020

I thiiink this might involve an overly-aggressive default client retry policy in com.scylladb/scylla-driver-core version 3.7.1-scylla-2. Reproducing this is tricky, so I can't tell for sure, but so far I haven't encountered this issue when we replace the default retry policy with one that always rethrows.

@slivne slivne added this to the 4.3 milestone Sep 20, 2020
@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Sep 21, 2020 via email

@aphyr
Copy link
Author

aphyr commented Sep 23, 2020

Huh, okay! The error docs are remarkably unhelpful about what an UnavailableException actually means--thanks for clearing that up! I managed to dig up an old Cassandra blog post which mentions UnavailableExceptions in the context of LWT, and... I think it implies what you're saying as well; it's just confusing. They explicitly call out that WriteTimeoutException is indefinite, and do not say this about UnavailableException:

If the paxos phase fails, the driver will throw a WriteTimeoutException with a WriteType.CAS as retrieved with WriteTimeoutException#getWriteType(). In this situation you can't know if the CAS operation has been applied so you need to retry it in order to fallback on a stable state. Because lightweight transactions are much more expensive that regular updates, the driver doesn't automatically retry it for you. The paxos phase can also lead to an UnavailableException if not enough replicas are available. In this situation, retries won't help as only SERIAL and LOCAL_SERIAL consistencies are available.

But the following paragraph notes that UnavailableException can also be thrown from the commit phase--commit and learning are the same in this context, I'm guessing?

The commit phase is then similar to regular Cassandra writes in the sense that it will throw an UnavailableException or a WriteTimeoutException if the amount of required replicas or acknowledges isn't met. In this situation rather than retrying the entire CAS operation, you can simply ignore this error if you make sure to use setConsistencyLevel(ConsistencyLevel.SERIAL) on the subsequent read statements on the column that was touched by this transaction, as it will force Cassandra to commit any remaining uncommitted Paxos state before proceeding with the read.

So... this does seem to suggest that "Unavailable" might actually mean the cluster was available enough for the operation to succeed!

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Sep 23, 2020 via email

@aphyr
Copy link
Author

aphyr commented Sep 30, 2020

It might be helpful to compare the diagrams in the Datastax documentation for UnavailableException vs other classes of failure, like WriteTimeout:

driversUnavailableException

driversWriteTimeoutException

Notice that NoReplicas (Unavailable) is returned when the coordinator has performed no IO to replicas, whereas WriteTimeout does involve a coordinator which has sent at least one message to a replica. These diagrams strongly suggest that users should interpret UnavailableException as a definite failure.

On this basis... I'm inclined to say this looks like a bug--this is a server implementation detail which is leaking into the client layer, and causing clients to incorrectly infer that the operation has a.) definitely failed, and b.) is safely retryable. It's OK to perform this check, but if Scylla has already issued a message to replicas (which might commit!), Scylla should respond with an obviously indefinite error.

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Sep 30, 2020 via email

@aphyr
Copy link
Author

aphyr commented Sep 30, 2020

If the exception were named PartlyCommittedException that would be fine! The problem is a.) the name "Unavailable", b.) the differentiation in both type hierarchy and retry policy callbacks, and c.) the Cassandra documentation all suggest that an UnavailableException is a categorically different kind of error from a timeout. If users interpret "unavailable" that way, then for those users, this behavior constitutes an aborted read, which is a serious violation of the linearizability property people expect from LWT. They might, for example, improperly retry non-retriable operations, or signal that definite failure to the user. I don't think the chances of this are high, exactly--you need to be doing an LWT operation when a partition or crash happens--but for larger clusters with higher throughput, that might pose a risk of, say, double-booking a resource, paying someone twice, or otherwise corrupting logical state.

Is there an argument to preserving the current behavior because it conveys more information about when in the Paxos process the operation failed? Perhaps! But is this information actually useful? I'm not sure! What can I, as a user, actually gain from knowing that the transaction failed specifically during the Paxos commit phase (even though it may already be committed!), versus failing somewhere after issuing the first message to replicas? I can't envision a scenario where I could actually use this information to decide anything, and I think it's outweighed by the risk of misinterpreting the error as a definite failure--especially given the paucity of Scylla error documentation.

You can, if you like, keep throwing UnavailableException here, and offer a sort of error-message apologetics. Perhaps a warning on the LWT transaction docs saying "Be aware: UnavailableException may mean the cluster actually was available and your operation succeeded". Right now there's--as far as I can tell--basically no Scylla documentation for any kind of error messages? Anything would help, really.

Alternatively, you could return an exception which obviously signals the indeterminacy problem, like a WriteTimeout. I think that's probably easier--you won't have to keep explaining the behavior to users, and it doesn't put the onus on users to review and rewrite their existing exception-handling code.

@kostja
Copy link
Contributor

kostja commented Oct 6, 2020

Should be backported to 4.3 at least (ideally to 4.2)

tgrabiec pushed a commit that referenced this issue Oct 7, 2020
Unavailable exception means that operation was not started and it can be
retried safely. If lwt fails in the learn stage though it most
certainly means that its effect will be observable already. The patch
returns timeout exception instead which means uncertainty.

Fixes #7258

Message-Id: <20201001130724.GA2283830@scylladb.com>
(cherry picked from commit 3e8dbb3)
tgrabiec pushed a commit that referenced this issue Oct 7, 2020
Unavailable exception means that operation was not started and it can be
retried safely. If lwt fails in the learn stage though it most
certainly means that its effect will be observable already. The patch
returns timeout exception instead which means uncertainty.

Fixes #7258

Message-Id: <20201001130724.GA2283830@scylladb.com>
(cherry picked from commit 3e8dbb3)
@tgrabiec
Copy link
Contributor

tgrabiec commented Oct 7, 2020

Backported to 4.2 and 4.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants