New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single-partition updates aren't isolated #7170
Comments
This is a good point. You yourself discovered it 7 years ago in Cassandra - https://aphyr.com/posts/294-call-me-maybe-cassandra. The same point still applies to Scylla because it uses the same timestamp-based conflict resolution without a consistent tie-breaker for equal timestamps. Indeed the identity of the coordinator doing the update (and serializing the timestamps it uses for those updates) could have served as such a tie-breaker (timeuuid is one ugly way to do that that we already use in other places). Or, indeed, we can change the documentation to at least make it clearer we have this issue. For example, in a recent blog post we did comparing Scylla to DynamoDB (https://www.scylladb.com/2020/05/12/comparing-cql-and-the-dynamodb-api/), we stated this explicitly:
|
@nyh can you please suggest a doc update (till other solution is implemented?) |
@tzach perhaps the main offender is, like @aphyr said, https://docs.scylladb.com/getting-started/dml/, which we copied from Cassandra as-is, and has the statement like "All updates for an INSERT are applied atomically and in isolation." - in multiple places for multiple different operations. We could drop these statements, but should probably also add a clarification like I wrote above about the fact that row updates are not serialized if the same timestamp is used. One thing that bothers me - and I'm not an expert enough on this (maybe @gleb-cloudius @tgrabiec or @avikivity would know) - is that I vaguely remember that Scylla does make some effort to do row updates atomically in all sorts of places in the code and not to break up a single row. Did we, or can we, officially decide that this effort has all been for nothing, and in fact we can't make such a guarantee at all, so we shouldn't even try? |
Kostja brought up that this might be due to batch-specific anomalies, so I wrote a second variant of the test to confirm single-row behavior. With perfectly synchronized clocks, a single row, and 500 reads/sec & 500 writes/sec, we see isolation violations roughly every 20 seconds. |
The following docs are updated with by scylladb/scylla-docs#3150 |
Documentation looks good--I'd say we can close this! |
TL;DR: CASSANDRA-13550 applies to Scylla too.
Scylla's DML documentation states:
However, single-partition BATCH statements are not, in fact, isolated--presumably, INSERT, UPDATE, and DELETE aren't either. For instance, consider this workload which performs BATCH updates and SELECTs over a fixed collection of rows in a single partition. Each update picks a unique integer
x
, and sets every row's value to eitherx
or-x
. Since every BATCH updates the same set of keys, and BATCHs are theoretically isolated, every SELECT over those rows should observe a state in which every row has the same absolute value. Instead, we observe reads like...Here, each line is a transaction corresponding to a single SELECT query, and
[:r 5 -3]
denotes that that SELECT observed the value of key 5 to be -3. If Scylla's BATCH updates were isolated per-partition, we would expect to observe every value to be either +/-3 or +/-2, but not a mix. This particular isolation violation is called read skew: a single SELECT observed a mixed state between two transactions.I believe this issue is the same I described in Cassandra, circa 2013: conflict resolution in Cassandra is performed at the level of cells, rather than partitions. Since cell values are resolved using client-supplied timestamps plus lexicographic order, and timestamps are not required to be unique, timestamp collisions can result in conflict resolution which selectively discards some, but not all, writes performed in a single UPDATE, BATCH, etc. This problem occurs with read and write consistency level ALL, and is exacerbated when timestamps are more likely to collide.
Read-repair and other mechanisms also violate claimed isolation guarantees.
At least some people at Scylla already know about this problem: as @duarten noted, @avikivity opened CASSANDRA-13550 against Cassandra's documentation in 2017! Might I suggest dropping the isolation language from the Scylla documentation as well? Alternatively, Scylla could consider modifying the conflit resolution strategy. For instance, servers could use Flake IDs to guarantee timestamp uniqueness--using client-provided timestamps to derive the timestamp portion of flake IDs where desired.
This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
Installation details
Scylla version (or git commit hash): 4.2
Cluster size: 5
OS (RHEL/CentOS/Ubuntu/AWS AMI): Debian
The text was updated successfully, but these errors were encountered: