Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost inserts in CQL sets and maps with chaotic timestamps #7082

Closed
1 task done
aphyr opened this issue Aug 21, 2020 · 23 comments
Closed
1 task done

Lost inserts in CQL sets and maps with chaotic timestamps #7082

aphyr opened this issue Aug 21, 2020 · 23 comments
Assignees
Labels

Comments

@aphyr
Copy link

aphyr commented Aug 21, 2020

In ScyllaDB 4.2, it appears that inserting unique elements into a CQL map or set can, when clocks are not synchronized, result in the loss of acknowledged inserts. See, for example, this Jepsen test run, in which a single client, performing approximately one insert per second to a single CQL set, loses roughly 35% of acknowledged inserts over a 30 second period--as observed by a single read performed after 20 seconds of quiescence.

This behavior appears to depend on timestamps: when we do not set a TimestampGenerator on clients, or provide clients with a TimestampGenerator which yields values directly from System.currentTimeMillis(), the behavior disappears; when we introduce jitter (on the order of 1 second) to those values, write loss appears.

Our schema for this workload is a single table with an integer primary key and a set of integer elements. We insert a single row into this table, and then have clients update that row, each updating the elements field to add a single unique integer. After a few seconds of these inserts, we pause for 20 seconds to allow Scylla time to recover, then perform a single read by primary key.

Our randomized timestamps are generated by this generator--tuning uncertainty-s to zero appears to resolve the write loss issue.

You can reproduce this using Jepsen's fork of the Scylla Jepsen tests by cloning 604502f and running

lein run test-all -w cset --time-limit 30 --nemesis none --concurrency 1 -r 1

There may be additional options required (e.g. --user, --password, --nodes-file) depending on your Jepsen environment. See https://github.com/jepsen-io/jepsen#setting-up-a-jepsen-environment for guidance.

This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

  • I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

Installation details
Scylla version (or git commit hash): 4.2
Cluster size: 5 nodes
OS (RHEL/CentOS/Ubuntu/AWS AMI): Debian 10

Hardware details (for performance issues) Delete if unneeded
Platform (physical/VM/cloud instance type/docker): LXC
Hardware: sockets=2 cores=24 hyperthreading=2 memory=128GB
Disks: (SSD/HDD, count) SSD (shared via LXC)

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

Hi @aphyr,

Thanks for creating the issue. I can see that the final select is done with CL=ALL but what's the CL of the updates (writec)?

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

I think the link to this generator is broken. Did you mean noisy-timestamps?

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

Just to clarify - there's a single client and a single node Scylla cluster in this test? or are there more nodes in the cluster?

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

I would expect writes to be done with CL=ONE based on this but there's also that so I'm not sure.

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

Just to clarify - there's a single client and a single node Scylla cluster in this test? or are there more nodes in the cluster?

I can see RF of 3 is used so I would expect at least 3 Scylla nodes

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

@aphyr I have a hypothesis that the test might be wrong if any update ends up with timestamp smaller than the timestamp of the initial insert. I expect this can happen because the uncertainty-s is set to 100 seconds. Every update with timestamp smaller than the timestamp of the initial insert will be missing and that's correct by design.

See the following example:

cqlsh:ks> create TABLE ks.ks (pk int, s set<int>, primary key(pk));
cqlsh:ks> insert into ks.ks (pk, s) values(1, {}) USING TIMESTAMP 100;
cqlsh:ks> select * from ks.ks;
 pk | s
----+--------
  1 | {}
cqlsh:ks> update ks.ks USING TIMESTAMP 110 SET s = s + {5} WHERE pk=1;
cqlsh:ks> select * from ks.ks;
 pk | s
----+--------
  1 | {5}
cqlsh:ks> update ks.ks USING TIMESTAMP 101 SET s = s + {7} WHERE pk=1;
cqlsh:ks> select * from ks.ks;
 pk | s
----+--------
  1 | {5, 7}
cqlsh:ks> update ks.ks USING TIMESTAMP 50 SET s = s + {10} WHERE pk=1;
cqlsh:ks> select * from ks.ks;
 pk | s
----+--------
  1 | {5, 7}

Note that element 10 is missing because timestamp 50 is smaller than timestamp 100. This is correct because the initial insert (insert into ks.ks (pk, s) values(1, {}) USING TIMESTAMP 100;) overrides the value of s with an empty set and does that with timestamp 100. This means any write with a timestamp smaller than 100 will be overwritten by this empty set.

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

Probably the test could be fixed by waiting for at least uncertainty-s seconds after running setup and before running the first update.

@aphyr
Copy link
Author

aphyr commented Aug 22, 2020

Thanks for creating the issue. I can see that the final select is done with CL=ALL but what's the CL of the updates (writec)?

This occurs with either CL=ALL or CL=ONE.

I think the link to this generator is broken. Did you mean noisy-timestamps?

Ah, yes, thank you. Fixed.

Just to clarify - there's a single client and a single node Scylla cluster in this test? or are there more nodes in the cluster?

There is a single client, and five nodes.

I would expect writes to be done with CL=ONE based on this but there's also that so I'm not sure.

The code in this commit uses CL=ONE, yes; I've tested with a variety of CLs and all fail.

I can see RF of 3 is used so I would expect at least 3 Scylla nodes

Yup, five nodes!

Every update with timestamp smaller than the timestamp of the initial insert will be missing and that's correct by design.

I have to admit, I'm somewhat surprised by this, though on further reflection, I think it makes sense. I was expecting (based on the way the test was originally written) that row updates commuted with row creation (which would preserve these updates), or that row updates to a row which doesn't logically exist yet would fail (which would result in client-visible errors). But you're right--the documentation for INSERT describes it as either an insert or an update, which allows inserts to destroy data! That'd explain why the test fails: the initial insert introduces a non-commutative, destructive operation.

It sounds like the right way to preserve set insertions is to ensure that one never inserts a row with a set, and instead performs only updates--relying on the fact that UPDATE implicitly creates nonexistent rows? Indeed, dropping the insert from the test workload appears to resolve the problem.

@ManManson--you wrote this test initially--does this sound right to you?

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

Yes. Dropping the insert will work too.

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

I guess we can close the issue then. What do you think, @aphyr ?

@aphyr
Copy link
Author

aphyr commented Aug 22, 2020

I'm unsure! The documentation for CQL sets steers users towards inserting sets with a CQL insert, as opposed to update, and when I talked about this issue with @kostja, he didn't realize it was unsafe either! It sounds like this kind of insert+update is an antipattern, because it can lead to write loss--but we already know that CQL sets are unsafe in the presence of deletion updates, so... maybe it's not a huge deal? I do think this is the sort of thing that deserves documentation--some sort of best-practices guidelines, perhaps?

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

It depends on the perspective. If the user understands that insert assigns the value of the set (it's like an assignment of the variable) then it all makes sense. This is how INSERTs in Scylla and Cassandra work for all types so it's consistent. After all, that's the most natural semantic of an insert I can think of. What was your understanding of what INSERT does?

It sounds like this kind of insert+update is an antipattern, because it can lead to write loss

I don't look at it this way. This is the same as setting the value of the set to some value (for example {1, 2, 3}) and updating it with updates. If someone understands how mutations of data work in Scylla/Cassandra it's not a surprise at all. Data will be lost only if you do updates with timestamps smaller than the INSERT but then the behavior will be correct. It works like this for every type. If you insert value X to an int column and then try to update this column to Y but it turns out the update has a smaller timestamp than the insert then the data will be "lost" in exactly the same way as for a set.

The first step to understanding the way it works is the fact that INSERT INTO ks.tb(pk, s) VALUES(0, {}) is like UPDATE ks.tb SET s = {} WHERE pk = 0. It assigns a value to the column no matter what the value of the column was before.

What I'm really trying to say is that this is how timestamp-based eventual consistency works in Scylla. It is consistent and shouldn't be a surprise to anyone who understands it.

I agree that the model is not simple and we have to be helped users to understand it correctly but I don't think there's any issue with sets here. Everything works as expected in a way consistent with every other type.

@aphyr
Copy link
Author

aphyr commented Aug 22, 2020

It is consistent and shouldn't be a surprise to anyone who understands it.

I agree that it may be consistent, but I should note that this behavior was... apparently not expected by the Scylla engineer who wrote this test, the Scylla engineer who helped me explore it, and the writers who wrote the CQL collection documentation. Heck, I've been working with AP databases for a little over a decade, and I failed to catch it too! There's sort of a double surprise here--CQL looks like SQL, where insert + update would be safe--and on the commutative side, if you're used to CRDTs and expecting something like a G-set + unsafe deletes built on wide rows, you might expect the insert to commute with updates as well.

Maybe this is well-understood in the Cassandra/Scylla community at large! I'd like to figure out how well-understood it is, because if it's surprising, it might warrant some discussion in the Jepsen writeup.

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

It is consistent and shouldn't be a surprise to anyone who understands it.

I agree that it may be consistent, but I should note that this behavior was... apparently not expected by the Scylla engineer who wrote this test, the Scylla engineer who helped me explore it, and the writers who wrote the CQL collection documentation.

I don't know what to tell you except that everyone makes mistakes every now and then.

Heck, I've been working with AP databases for a little over a decade, and I failed to catch it too! There's sort of a double surprise here--CQL looks like SQL, where insert + update would be safe--and on the commutative side, if you're used to CRDTs and expecting something like a G-set + unsafe deletes built on wide rows, you might expect the insert to commute with updates as well.

The most important point is to understand that insert is like a variable assignment. Then everything becomes easy to understand.

Then what the test tries to do is (in Java):

// Set<Integer> column;

// Insert
column = new HashSet<>();
// Update
column.add(1);
// Update
column.add(2);
// Update
column.add(3);
// column contains 1, 2 and 3

When you think of it like this it becomes apparent that reordering these operations can lead to some of the elements missing in the set.

// Set<Integer> column;

// Update
column.add(1);
// Insert
column = new HashSet<>();
// Update
column.add(2);
// Update
column.add(3);
// column contains 2 and 3

Maybe this is well-understood in the Cassandra/Scylla community at large! I'd like to figure out how well-understood it is, because if it's surprising, it might warrant some discussion in the Jepsen writeup.

It's certainly worth to mention it just for the sake of users getting more educated.

But it will be just a documentation of the existing correct behavior. There's no issue in the implementation or the Scylla behavior themselves.

@nyh
Copy link
Contributor

nyh commented Aug 22, 2020

My three cents on this issue:

  1. When you do in CQL insert into ks.ks (pk, s) values(1, {}) the key thing to notice here is the "{}". What the CQL insert syntax means is that you ask to assign a new value to this column. This is a set column, and you asked to assign the empty set to it. The new empty set replaces whatever content existed in the set with earlier timestamps, just like setting a string to "hello" replaces whatever string of whatever length was set in earlier timestamps. So not only is the behavior you saw correct, I also can't think of any other reasonable behavior for this "insert {}" operation besides the "empty the set" behavior.

  2. Our documentation which you pointed to, https://docs.scylladb.com/getting-started/types/#sets, even tried to explain this, but apparently doesn't do a good-enough job. It includes the text "Replace the existing set entirely", but for for some reason only says this in a code comment (!) for the "UPDATE" syntax, while it applies just as much to the "INSERT" syntax.

  3. But, despite my two "excuses" above, we have to face reality: @aphyr is an expert in the field, and still didn't understand this, so clearly our documentation in this area is not good enough, and we need to fix it. We need to explain that Scylla supports three different CRDT operations on a set column: adding elements, removing elements, and replacing the entire set. The operation to replace the entire set has two different syntaxes, using either INSERT or UPDATE (and we should link to another page explaining the subtle difference between an INSERT and an UPDATE - the row marker issue - which is a delicate CQL issue that nobody coming from other databases is likely to understand).

To summarize, this is not a bug in the code, but it should be considered a bug in the documentation, and fixed in the documentation, so let's not close this issue before we fix the documentation - or at least open a documentation issue.

@aphyr
Copy link
Author

aphyr commented Aug 22, 2020

Yeah, I do understand the behavior now, and I agree that it is defensible based on the current documentation. I agree it's not a bug! It's more that I made some incorrect assumptions, based on how I expected CQL INSERTs and sets to behave, and based on how the current test suite was designed. It's not something I'm gonna report as a bug in the writeup, but I do think it's interesting to talk about! I'm gonna try and get a sense of how CQL users expect this to behave, and whether the same kind of use pattern that's shown in the documentation is something that users are also doing in the wild.

@haaawk
Copy link
Contributor

haaawk commented Aug 22, 2020

I would suggest renaming the issue though to avoid misconception that something's wrong when it's not.

BTW @aphyr could you please elaborate more on the issue with deletes and sets you mentioned in one of the previous comments?

@nyh
Copy link
Contributor

nyh commented Aug 23, 2020

I opened a doc issue on: scylladb/scylla-doc-issues#346
@aphyr @haaawk please check if the changes which I proposed to the documentation in that issue are what you would have expected to see, and if so, can we close this issue?

@haaawk
Copy link
Contributor

haaawk commented Aug 23, 2020

LGTM @nyh

@avikivity
Copy link
Member

Closing, reporter agrees it is not a bug.

@kostja kostja self-assigned this Dec 23, 2020
@kostja kostja added the jepsen label Dec 23, 2020
@kostja
Copy link
Contributor

kostja commented Dec 23, 2020

@nyh
Copy link
Contributor

nyh commented Dec 23, 2020

@kostja why did you assign to me a closed issue?
I wanted review the doc commit you mention, but it seems completely unrelated :-( Maybe you have a wrong link?

@nyh
Copy link
Contributor

nyh commented Dec 23, 2020

Oh, sorry, I found the right commit in that PR: https://github.com/scylladb/scylla-docs/pull/3150/commits/f5db644ee15df27480cb745867a1b70f06a6b132
Unassigning myself from this issue.

@nyh nyh removed their assignment Dec 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants