Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-partition updates aren't isolated #7170

Closed
1 task done
aphyr opened this issue Sep 4, 2020 · 8 comments
Closed
1 task done

Single-partition updates aren't isolated #7170

aphyr opened this issue Sep 4, 2020 · 8 comments
Assignees
Labels
Milestone

Comments

@aphyr
Copy link

aphyr commented Sep 4, 2020

TL;DR: CASSANDRA-13550 applies to Scylla too.

Scylla's DML documentation states:

All updates for an INSERT are applied atomically and in isolation.

In an UPDATE statement, all updates within the same partition key are applied atomically and in isolation.

All updates in a BATCH belonging to a given partition key are performed in isolation.

By default, Scylla uses a batch log to ensure all operations in a batch eventually complete or none will (note, however, that operations are only isolated within a single partition)

However, single-partition BATCH statements are not, in fact, isolated--presumably, INSERT, UPDATE, and DELETE aren't either. For instance, consider this workload which performs BATCH updates and SELECTs over a fixed collection of rows in a single partition. Each update picks a unique integer x, and sets every row's value to either x or -x. Since every BATCH updates the same set of keys, and BATCHs are theoretically isolated, every SELECT over those rows should observe a state in which every row has the same absolute value. Instead, we observe reads like...

[[:r 5 -3] [:r 3 -2] [:r 4 3]]
[[:r 3 -2] [:r 4 -5] [:r 5 -3]]
[[:r 4 -5] [:r 3 -2] [:r 5 -3]]
[[:r 6 -1] [:r 7 -1] [:r 8 -4]]

Here, each line is a transaction corresponding to a single SELECT query, and [:r 5 -3] denotes that that SELECT observed the value of key 5 to be -3. If Scylla's BATCH updates were isolated per-partition, we would expect to observe every value to be either +/-3 or +/-2, but not a mix. This particular isolation violation is called read skew: a single SELECT observed a mixed state between two transactions.

I believe this issue is the same I described in Cassandra, circa 2013: conflict resolution in Cassandra is performed at the level of cells, rather than partitions. Since cell values are resolved using client-supplied timestamps plus lexicographic order, and timestamps are not required to be unique, timestamp collisions can result in conflict resolution which selectively discards some, but not all, writes performed in a single UPDATE, BATCH, etc. This problem occurs with read and write consistency level ALL, and is exacerbated when timestamps are more likely to collide.

Read-repair and other mechanisms also violate claimed isolation guarantees.

At least some people at Scylla already know about this problem: as @duarten noted, @avikivity opened CASSANDRA-13550 against Cassandra's documentation in 2017! Might I suggest dropping the isolation language from the Scylla documentation as well? Alternatively, Scylla could consider modifying the conflit resolution strategy. For instance, servers could use Flake IDs to guarantee timestamp uniqueness--using client-provided timestamps to derive the timestamp portion of flake IDs where desired.

This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

  • I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

Installation details
Scylla version (or git commit hash): 4.2
Cluster size: 5
OS (RHEL/CentOS/Ubuntu/AWS AMI): Debian

@nyh
Copy link
Contributor

nyh commented Sep 6, 2020

This is a good point. You yourself discovered it 7 years ago in Cassandra - https://aphyr.com/posts/294-call-me-maybe-cassandra. The same point still applies to Scylla because it uses the same timestamp-based conflict resolution without a consistent tie-breaker for equal timestamps.

Indeed the identity of the coordinator doing the update (and serializing the timestamps it uses for those updates) could have served as such a tie-breaker (timeuuid is one ugly way to do that that we already use in other places). Or, indeed, we can change the documentation to at least make it clearer we have this issue. For example, in a recent blog post we did comparing Scylla to DynamoDB (https://www.scylladb.com/2020/05/12/comparing-cql-and-the-dynamodb-api/), we stated this explicitly:

These ordinary writes are not always serialized. They usually are: CQL assigns a timestamp (with microsecond resolution) to each write and the latest timestamp wins. However two concurrent writes which happen to receive exactly the same timestamp may get mixed up, with values from both writes being present in the result. For more information on this problem, see this Jepsen post.

@tzach
Copy link
Contributor

tzach commented Sep 6, 2020

@nyh can you please suggest a doc update (till other solution is implemented?)

@nyh
Copy link
Contributor

nyh commented Sep 6, 2020

@tzach perhaps the main offender is, like @aphyr said, https://docs.scylladb.com/getting-started/dml/, which we copied from Cassandra as-is, and has the statement like "All updates for an INSERT are applied atomically and in isolation." - in multiple places for multiple different operations.

We could drop these statements, but should probably also add a clarification like I wrote above about the fact that row updates are not serialized if the same timestamp is used.

One thing that bothers me - and I'm not an expert enough on this (maybe @gleb-cloudius @tgrabiec or @avikivity would know) - is that I vaguely remember that Scylla does make some effort to do row updates atomically in all sorts of places in the code and not to break up a single row. Did we, or can we, officially decide that this effort has all been for nothing, and in fact we can't make such a guarantee at all, so we shouldn't even try?

@tgrabiec
Copy link
Contributor

tgrabiec commented Sep 6, 2020

@nyh There was some discussion about dropping the guarantee here: #2379

@kostja kostja self-assigned this Sep 16, 2020
@slivne slivne added the jepsen label Sep 24, 2020
@slivne slivne added this to the 4.4 milestone Sep 24, 2020
@kostja
Copy link
Contributor

kostja commented Sep 24, 2020

@aphyr
Copy link
Author

aphyr commented Sep 28, 2020

Kostja brought up that this might be due to batch-specific anomalies, so I wrote a second variant of the test to confirm single-row behavior. With perfectly synchronized clocks, a single row, and 500 reads/sec & 500 writes/sec, we see isolation violations roughly every 20 seconds.

20200928T142938.000-0400.zip

@tzach
Copy link
Contributor

tzach commented Sep 29, 2020

@aphyr
Copy link
Author

aphyr commented Dec 12, 2020

Documentation looks good--I'd say we can close this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants