-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternator Streams thinks that PutItem is a REMOVE+MODIFY #6930
Comments
This patch adds additional tests for Alternator Streams, which helped uncover 9 new issues. The stream tests are noticibly slower than most other Alternator tests - test_streams.py now has 27 tests taking a total of 20 seconds. Much of this slowness is attributed to Alternator Stream's 512 "shards" per stream in the single-node test setup with 256 vnodes, meaning that we need over 1000 API requests per test using GetRecords. These tests could be made significantly faster (as little as 4 seconds) by setting a lower number of vnodes. Issue #6979 is about doing this in the future. The tests in this patch have comments explaining clearly (I hope) what they test, and also pointing to issues I opened about the problems discovered through these tests. In particular, the tests reproduce the following bugs: Refs #6918 Refs #6926 Refs #6930 Refs #6933 Refs #6935 Refs #6939 Refs #6942 The tests also work around the following issues (and can be changed to be more strict and reproduce these issues): Refs #6918 Refs #6931 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200804154755.1461309-1-nyh@scylladb.com>
This patch adds additional tests for Alternator Streams, which helped uncover 9 new issues. The stream tests are noticibly slower than most other Alternator tests - test_streams.py now has 27 tests taking a total of 20 seconds. Much of this slowness is attributed to Alternator Stream's 512 "shards" per stream in the single-node test setup with 256 vnodes, meaning that we need over 1000 API requests per test using GetRecords. These tests could be made significantly faster (as little as 4 seconds) by setting a lower number of vnodes. Issue #6979 is about doing this in the future. The tests in this patch have comments explaining clearly (I hope) what they test, and also pointing to issues I opened about the problems discovered through these tests. In particular, the tests reproduce the following bugs: Refs #6918 Refs #6926 Refs #6930 Refs #6933 Refs #6935 Refs #6939 Refs #6942 The tests also work around the following issues (and can be changed to be more strict and reproduce these issues): Refs #6918 Refs #6931 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200804154755.1461309-1-nyh@scylladb.com>
On paper it looks like it would be solvable in The same more or less applied, but much more easily handled, for the query limit. Otoh also, I don't think alternator giving a slightly different event stream in these cases are a super huge deal. It does technically express the same thing. Would be nice to know if it really would hamper a real user... |
At some point we were thinking about marking this tombstone with ts instead of ts-1. This works fine with CQL because update SET x = null using timestamp T actually uses T-1. I don't remember whether we did this in the end or not. @kbr- do you remember? I think we did. |
The issue is reported 23 days ago, @nyh can you confirm this is still happening? |
The test still xfails (this is why I love test-first development).
|
@kbr Alternator put_item explicitly puts a row tombstone at ts-1 when doing insert. See executor.cc:1277 |
Right, I'll explain. When Alternator wishes to replace an item completely (not merge some new attributes with old attributes) it needs to delete the old item and then write the new item. If a write and delete are done with the same timestamp the delete wins, which is why we must do the delete at ts-1. This is not a new Alternator invention - Scylla already does exactly the same thing when replacing a collection, it puts a range tombstone for the collection at timestamp ts-1 and the new collection cells with timestamp ts=1. |
This is a good question. If the item did exist before, replacing it is indeed equivalent to a REMOVE followed by an INSERT (See #6918 on the separate issues of an item which didn't exist before, and on INSERT vs MODIFY). I think we need to learn more about how real users use DynamoDB streams, and how much they might care about this issue and about #6918. So maybe these issues should be considered very low priority. On the other hand, it would have been nicer to have perfect DynamoDB compatibility, without making excuses on why we chose not to be compatible. |
Writing a cell completely replaces the older versions of this cell. Unless:
So if I understand correctly
Okay, if you're asking me whether we should give special treatment to mutations which have a row tombstone with timestamp First, it would also affect updates coming from CQL. So if someone in CQL creates a batch as follows: begin batch
delete from ks.t using timestamp 42 where pk = 0 and ck = 0;
update ks.t using timestamp 43 set v = 0 where pk = 0 and ck = 0;
apply batch; they would get a single entry in CDC (instead of two) with a completely new type of operation due to our special handling. Second, CDC code is complex already. Believe me, you don't want to have any more edge cases there. Generally speaking, CDC was written with CQL in mind. That's why it has special handling for the collection tombstone case. If now we mix-in Alternator-frontend-specific logic to the code, we're going to make a mess. And I'm afraid the problem discussed in this thread is not the first problem we'll run into. I think we should have a translation layer between CDC log table and Alternator streams which detects such Alternator-frontend-specific stuff (maybe before even considering query windows etc. so we don't run into problems such as the one Calle described) and makes the necessary adjustments. |
Instead of just setting ts-1 for tombstone, we could for internal purposes mark it in such a way that it is recognized as not being user generated. For all the cases in scylla, i.e. collections etc. It should effectively make cdc easier. |
Because we want to replace the entire row - not a single cell. As I said, Scylla also has exactly the same need in a CQL operation which wants to set - i.e., replace - a collection (e.g., a set). It also needs to delete the previous contents of the collection (this is a range tombstone) and write new contents. In that existing code originating in Cassandra, Scylla uses timestamp ts-1 for the range tombstone, and in Alternator we did exactly the same when replacing an item.
Exactly, but 1. It doesn't have to be "except one column", and 2. Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.
I suggest you please add to your tests a CQL test which replaces a collection with a new value (there's a CQL syntax for this!). In the CDC log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp TS. Think if that is what you want to see - or not.
Right. Note that in Alternator we assume the user cannot choose the timestamp on their own. So a user cannot manufacture this case.
Yes, sounds like a reasonable approach, but might not be easy because of the time window issues that Calle described. |
On Wed, Aug 19, 2020 at 3:01 PM nyh ***@***.***> wrote:
Writing a cell completely replaces the older versions of this cell.
I don't understand why you need a row tombstone?
Because we want to replace the entire row - not a single cell.
Imagine that no matter what this row previously contained you want to set
"a=3". Even if it had previously "b=4", you need to remove it as well. Not
merge and have both "a=3" and "b=4".
As I said, Scylla also has exactly the same need in a CQL operation which
wants to set - i.e., replace - a collection (e.g., a set). It also needs to
delete the previous contents of the collection (this is a range tombstone)
and write new contents. In that existing code originating in Cassandra,
Scylla uses timestamp ts-1 for the range tombstone, and in Alternator we
did exactly the same when replacing an item.
Unless:
* you have multiple columns in the schema
* you want to write a completely new row that is empty for all columns except one column with some new value.
Exactly, but 1. It doesn't have to be "except one column", and 2. Some of
the columns are *collections* (in Alternator, we have one main map
column) and the entire collection must be clearer - we don't know which
specific items we need to delete, and we need a tombstone.
First, it would also affect updates coming from CQL.
I suggest you please add to your tests a *CQL* test which *replaces* a
collection with a new value (there's a CQL syntax for this!). In the CDC
log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp
TS. Think if that is what you want to see - or not.
So if someone in CQL creates a batch as follows:
begin batchdelete from ks.t using timestamp 42 where pk = 0 and ck = 0;update ks.t using timestamp 43 set v = 0 where pk = 0 and ck = 0;
apply batch;
they would get a single entry in CDC (instead of two) with a completely
new type of operation due to our special handling.
Right. Note that in Alternator we assume the user cannot choose the
timestamp on their own. So a user cannot manufacture this case.
Second, CDC code is complex already. Believe me, you don't want to have
any more edge cases there.
Generally speaking, CDC was written with CQL in mind. That's why it has
special handling for the collection tombstone case. If now we mix-in
Alternator-frontend-specific logic to the code, we're going to make a mess.
And I'm afraid the problem discussed in this thread is not the first
problem we'll run into.
I think we should have a translation layer between CDC log table and
Alternator streams which detects such Alternator-frontend-specific stuff
(maybe before even considering query windows etc. so we don't run into
problems such as the one Calle described) and makes the necessary
adjustments.
Yes, sounds like a reasonable approach, but might not be easy because of
the time window issues that Calle described.
+1 to Kamil's comment. This is Alternator stream specific and better not to
inject it into core CDC code.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6930 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKADLO3IUZAHWF4HHSRUBDSBPEMHANCNFSM4PJKHF4A>
.
|
@haaawk we already do ts-1 detection in cdc code, to deal with collections tombstones and avoid generating to many cdc rows. I don't think it unreasonable that even this code would benefit from the ts-1 pattern being marked as "generated by internal code" and thus treated explicitly. In fact, it would be safer. |
You can use a collection tombstone for that, you don't need a row tombstone.
No, I'll see one entry with one timestamp. We changed it some time ago (#6491).
That's the special collection handling that I mentioned in my previous post. |
This is a very good observation, which I already made for other reasons in #6918 (comment). Whereas we currently use a row tombstone to delete the item, we could use individual cell tombstones (for regular columns, if not overwritten by a new value) plus a collection tombstone (for the ":attrs" column) to delete the item. If CDC already has special handling for collection tombstones at ts-1 - but doesn't have the same handling for row tombstones - then it will indeed solve this problem.
Oh good, I didn't remember this. So, any particular reason why you changed this specifically just for collection tombstones and not for the similar issue of entire-row tombstones? I guess the main reason is that CQL only does the former, and doesn't have syntax for the latter (for overwriting an entire item)? |
This is done by simply adding So we made a decision based on CQL specifics, because CDC was designed with CQL in mind. EDIT: @nyh this should answer your last question I don't think we should do such a thing for row tombstones, because |
@kbr - I agree. We should in fact not do any of these things. The tombstones should be marked as "artificial", "created internally" to express something. Then we could safely instead generate more expressive cdc info for whatever is happening. |
Today, CDC takes a base table mutation and returns a log table mutation. It's very simple. What you're proposing would require a significant redesign: we would have to take additional data as input. |
Yes. And I think it might be worth considering doing so. Having more explicit track-keeping of what we are doing is not a bad thing. |
It seems to be the best solution here to just use cell/collection tombstones instead of row tombstones. That would solve the issue and not add any complexity to the CDC itself. |
Its clear this is a difference - and we except it till we find there is an issue with the approach we have taken - pushing this out till we get user feedback |
This issue is similar to issue #6918 (perhaps it should be included into that issue? but I'm not sure), but even more surprising:
It turns out that every
PutItem
operation appears today in the Alternator Stream as two events - a REMOVE and a MODIFY! This is instead of just one event - a MODIFY or an INSERT (depending on whether the item existed before - see #6918).This "REMOVE" event appears because our implementation of
PutItem
inserts a row tombstone with timestampts-1
together with the new value with timestampts
. We need this tombstone because the row is supposed to completely replace the old row - not be merged into it.I'm not sure how we can or should solve this. CDC already gives special attention to tombstones preceeding a value's timestamp by exactly one, maybe this situation can be recognized already by CDC and the "REMOVE" event avoided from CDC in the lower level. But if we can't do that, at least the Alternator layer should notice this situation - a REMOVE followed by a MODIFY with a difference of one in timestamp, and hide the REMOVE event from its output.
CC @elcallio
The text was updated successfully, but these errors were encountered: