New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aborted reads of writes which fail with "Not enough replicas available" #7258
Comments
I thiiink this might involve an overly-aggressive default client retry policy in com.scylladb/scylla-driver-core version 3.7.1-scylla-2. Reproducing this is tricky, so I can't tell for sure, but so far I haven't encountered this issue when we replace the default retry policy with one that always rethrows. |
On Fri, Sep 18, 2020 at 01:51:39PM +0000, Kyle Kingsbury wrote:
Infrequently, with network partitions and process crashes, LWT operations which fail with a "Not enough replicas available" message may, in fact, be visible to later reads. @kostja confirms that this message should be interpreted as a definite failure, which means this behavior is technically an aborted read.
Kostja is incorrect here. As the code is written right now we check
availability twice. First before we start paxos operation and second
during learning stage. If the error happens during learning stage the
write will be visible to future serial reads.
It should be possible to distinguish when the error was thrown by
looking at the CL reported in the error. If it is serial or local_serial
it is the former otherwise the later.
…--
Gleb.
|
Huh, okay! The error docs are remarkably unhelpful about what an UnavailableException actually means--thanks for clearing that up! I managed to dig up an old Cassandra blog post which mentions UnavailableExceptions in the context of LWT, and... I think it implies what you're saying as well; it's just confusing. They explicitly call out that WriteTimeoutException is indefinite, and do not say this about UnavailableException:
But the following paragraph notes that UnavailableException can also be thrown from the commit phase--commit and learning are the same in this context, I'm guessing?
So... this does seem to suggest that "Unavailable" might actually mean the cluster was available enough for the operation to succeed! |
On Tue, Sep 22, 2020 at 07:26:47PM -0700, Kyle Kingsbury wrote:
Huh, okay! The [error docs](https://docs.datastax.com/en/devapp/doc/devapp/driversServerErrors.html) are remarkably unhelpful about what an UnavailableException actually means--thanks for clearing that up! I managed to dig up an old [Cassandra blog post](https://www.datastax.com/blog/2014/10/cassandra-error-handling-done-right) which mentions UnavailableExceptions in the context of LWT, and... I think it implies what you're saying as well; it's just confusing. They explicitly call out that WriteTimeoutException is indefinite, and do *not* say this about UnavailableException:
Note that the behaviour I described here is specific for Scylla.
IIRC Cassandra does not check availability again before attempting
the "learn" stage and will just timeout.
> If the paxos phase fails, the driver will throw a WriteTimeoutException with a WriteType.CAS as retrieved with WriteTimeoutException#getWriteType(). In this situation you can't know if the CAS operation has been applied so you need to retry it in order to fallback on a stable state. Because lightweight transactions are much more expensive that regular updates, the driver doesn't automatically retry it for you. The paxos phase can also lead to an UnavailableException if not enough replicas are available. In this situation, retries won't help as only SERIAL and LOCAL_SERIAL consistencies are available.
But the following paragraph notes that UnavailableException can *also* be thrown from the commit phase--commit and learning are the same in this context, I'm guessing?
> The commit phase is then similar to regular Cassandra writes in the sense that it will throw an UnavailableException or a WriteTimeoutException if the amount of required replicas or acknowledges isn't met. In this situation rather than retrying the entire CAS operation, you can simply ignore this error if you make sure to use setConsistencyLevel(ConsistencyLevel.SERIAL) on the subsequent read statements on the column that was touched by this transaction, as it will force Cassandra to commit any remaining uncommitted Paxos state before proceeding with the read.
So... this does seem to suggest that "Unavailable" might actually mean the cluster *was* available enough for the operation to succeed!
This is a philosophical question :) What it means for an operation to
succeed? If a user writes with "learn" CL set to QUORUM it may rightfully
expect to be able to read the data back with non serial QUORUM read in
case of the write succeeding. If learn fails it means that the data is
only available for serial reads.
…--
Gleb.
|
It might be helpful to compare the diagrams in the Datastax documentation for UnavailableException vs other classes of failure, like WriteTimeout: Notice that NoReplicas (Unavailable) is returned when the coordinator has performed no IO to replicas, whereas WriteTimeout does involve a coordinator which has sent at least one message to a replica. These diagrams strongly suggest that users should interpret On this basis... I'm inclined to say this looks like a bug--this is a server implementation detail which is leaking into the client layer, and causing clients to incorrectly infer that the operation has a.) definitely failed, and b.) is safely retryable. It's OK to perform this check, but if Scylla has already issued a message to replicas (which might commit!), Scylla should respond with an obviously indefinite error. |
On Tue, Sep 29, 2020 at 05:49:31PM -0700, Kyle Kingsbury wrote:
On this basis... I'm inclined to say this looks like a bug--this is a server implementation detail which is leaking into the client layer, and causing clients to incorrectly infer that the operation has a.) definitely failed, and b.) is safely retryable. It's OK to perform this check, but if Scylla has already issued a message to replicas (which might commit!), Scylla should respond with an obviously indefinite error.
We do not have to follow Cassandra to the letter, so we can document
differently. Of course it is possible to return "indefinite error"
(which in current cql protocol we support will map to a timeout error),
but by doing so we will drop information we have. Namely the fact that
a query was successfully committed, but not learned. From a user
perspective how is less certain error is better than more certain?
…--
Gleb.
|
If the exception were named Is there an argument to preserving the current behavior because it conveys more information about when in the Paxos process the operation failed? Perhaps! But is this information actually useful? I'm not sure! What can I, as a user, actually gain from knowing that the transaction failed specifically during the Paxos commit phase (even though it may already be committed!), versus failing somewhere after issuing the first message to replicas? I can't envision a scenario where I could actually use this information to decide anything, and I think it's outweighed by the risk of misinterpreting the error as a definite failure--especially given the paucity of Scylla error documentation. You can, if you like, keep throwing UnavailableException here, and offer a sort of error-message apologetics. Perhaps a warning on the LWT transaction docs saying "Be aware: Alternatively, you could return an exception which obviously signals the indeterminacy problem, like a WriteTimeout. I think that's probably easier--you won't have to keep explaining the behavior to users, and it doesn't put the onus on users to review and rewrite their existing exception-handling code. |
Should be backported to 4.3 at least (ideally to 4.2) |
Unavailable exception means that operation was not started and it can be retried safely. If lwt fails in the learn stage though it most certainly means that its effect will be observable already. The patch returns timeout exception instead which means uncertainty. Fixes #7258 Message-Id: <20201001130724.GA2283830@scylladb.com> (cherry picked from commit 3e8dbb3)
Unavailable exception means that operation was not started and it can be retried safely. If lwt fails in the learn stage though it most certainly means that its effect will be observable already. The patch returns timeout exception instead which means uncertainty. Fixes #7258 Message-Id: <20201001130724.GA2283830@scylladb.com> (cherry picked from commit 3e8dbb3)
Backported to 4.2 and 4.1 |
Infrequently, with network partitions and process crashes, LWT operations which fail with a "Not enough replicas available" message may, in fact, be visible to later reads. @kostja confirms that this message should be interpreted as a definite failure, which means this behavior is technically an aborted read.
For example, see this list-append test, which contains the write:
Which was immediately followed by the read:
A subsequent read observed
[:r 628 [37 83]]
, which is surprising, because the write of 83 to key 628 also failed:... And yet, its write to key 618 also apparently succeeded:
Here, failed writes of 50 and 52 actually succeeded, and were followed, in the history of that key, by acknowledged writes of 69 and 74.
This sort of problem--getting a definite failure for something that probably should have been indefinite--often comes up as a result of internal retry logic, where the retry mechanism either a.) improperly retries an (e.g. non-idempotent) indeterminate failure, or b.) properly retries, and returns the most recent, rather than the most indeterminate, of the errors it encountered. Since we haven't, so far, observed duplicate writes, I suspect the latter. I know Cassandra clients have a RetryPolicy--perhaps it's to blame here? I'm still investigating.
You can reproduce this issue with jepsen-scylla b724545, by running
This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
Installation details
Scylla version (or git commit hash): scylla-server_4.2~rc4-0.20200915.fdf86ffb8-1_amd64.deb
Cluster size: 5
OS (RHEL/CentOS/Ubuntu/AWS AMI): Debian Buster
The text was updated successfully, but these errors were encountered: