Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternator Streams thinks that PutItem is a REMOVE+MODIFY #6930

Open
nyh opened this issue Jul 27, 2020 · 20 comments
Open

Alternator Streams thinks that PutItem is a REMOVE+MODIFY #6930

nyh opened this issue Jul 27, 2020 · 20 comments
Labels
Milestone

Comments

@nyh
Copy link
Contributor

nyh commented Jul 27, 2020

This issue is similar to issue #6918 (perhaps it should be included into that issue? but I'm not sure), but even more surprising:
It turns out that every PutItem operation appears today in the Alternator Stream as two events - a REMOVE and a MODIFY! This is instead of just one event - a MODIFY or an INSERT (depending on whether the item existed before - see #6918).

This "REMOVE" event appears because our implementation of PutItem inserts a row tombstone with timestamp ts-1 together with the new value with timestamp ts. We need this tombstone because the row is supposed to completely replace the old row - not be merged into it.

I'm not sure how we can or should solve this. CDC already gives special attention to tombstones preceeding a value's timestamp by exactly one, maybe this situation can be recognized already by CDC and the "REMOVE" event avoided from CDC in the lower level. But if we can't do that, at least the Alternator layer should notice this situation - a REMOVE followed by a MODIFY with a difference of one in timestamp, and hide the REMOVE event from its output.

CC @elcallio

@nyh nyh added area/alternator Alternator related Issues area/alternator-streams labels Jul 27, 2020
@slivne slivne added this to the 4.3 milestone Jul 30, 2020
psarna pushed a commit that referenced this issue Aug 5, 2020
This patch adds additional tests for Alternator Streams, which helped
uncover 9 new issues.

The stream tests are noticibly slower than most other Alternator tests -
test_streams.py now has 27 tests taking a total of 20 seconds. Much of this
slowness is attributed to Alternator Stream's 512 "shards" per stream in the
single-node test setup with 256 vnodes, meaning that we need over 1000 API
requests per test using GetRecords. These tests could be made significantly
faster (as little as 4 seconds) by setting a lower number of vnodes.
Issue #6979 is about doing this in the future.

The tests in this patch have comments explaining clearly (I hope) what they
test, and also pointing to issues I opened about the problems discovered
through these tests. In particular, the tests reproduce the following bugs:

Refs #6918
Refs #6926
Refs #6930
Refs #6933
Refs #6935
Refs #6939
Refs #6942

The tests also work around the following issues (and can be changed to
be more strict and reproduce these issues):

Refs #6918
Refs #6931

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200804154755.1461309-1-nyh@scylladb.com>
avikivity pushed a commit that referenced this issue Aug 5, 2020
This patch adds additional tests for Alternator Streams, which helped
uncover 9 new issues.

The stream tests are noticibly slower than most other Alternator tests -
test_streams.py now has 27 tests taking a total of 20 seconds. Much of this
slowness is attributed to Alternator Stream's 512 "shards" per stream in the
single-node test setup with 256 vnodes, meaning that we need over 1000 API
requests per test using GetRecords. These tests could be made significantly
faster (as little as 4 seconds) by setting a lower number of vnodes.
Issue #6979 is about doing this in the future.

The tests in this patch have comments explaining clearly (I hope) what they
test, and also pointing to issues I opened about the problems discovered
through these tests. In particular, the tests reproduce the following bugs:

Refs #6918
Refs #6926
Refs #6930
Refs #6933
Refs #6935
Refs #6939
Refs #6942

The tests also work around the following issues (and can be changed to
be more strict and reproduce these issues):

Refs #6918
Refs #6931

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200804154755.1461309-1-nyh@scylladb.com>
@elcallio
Copy link
Contributor

On paper it looks like it would be solvable in get_records, but there are some caveats, all due to selection windows and result limit. Since the tombstone marking the "it's a new row" is ts-1, it would be perfectly possible to get a query window that includes the tombstone, but not the actual insert. So, to handle this we would pretty much be forced to never always skip a trailing REMOVE and adjust the next query window to see if it fit with an insert on remove ts+1.
But this would instead somewhat break cases where we have a lot of actual deletes in a row, in which case we would be constantly staggering. Perhaps manageable, but not nice.

The same more or less applied, but much more easily handled, for the query limit.
It strikes me that the pattern "tombstone at ts-1" is common enough that we should maybe have it formalized and recognized in cdc, and generate something identifiable for it. It is after all prevalent for all non-atomics etc as well. Maybe coalesce such a pattern into a cdc_op::replace? @kbr, @haaawk - thoughts?

Otoh also, I don't think alternator giving a slightly different event stream in these cases are a super huge deal. It does technically express the same thing. Would be nice to know if it really would hamper a real user...

@haaawk
Copy link
Contributor

haaawk commented Aug 19, 2020

On paper it looks like it would be solvable in get_records, but there are some caveats, all due to selection windows and result limit. Since the tombstone marking the "it's a new row" is ts-1, it would be perfectly possible to get a query window that includes the tombstone, but not the actual insert. So, to handle this we would pretty much be forced to never always skip a trailing REMOVE and adjust the next query window to see if it fit with an insert on remove ts+1.
But this would instead somewhat break cases where we have a lot of actual deletes in a row, in which case we would be constantly staggering. Perhaps manageable, but not nice.

The same more or less applied, but much more easily handled, for the query limit.
It strikes me that the pattern "tombstone at ts-1" is common enough that we should maybe have it formalized and recognized in cdc, and generate something identifiable for it. It is after all prevalent for all non-atomics etc as well. Maybe coalesce such a pattern into a cdc_op::replace? @kbr, @haaawk - thoughts?

Otoh also, I don't think alternator giving a slightly different event stream in these cases are a super huge deal. It does technically express the same thing. Would be nice to know if it really would hamper a real user...

At some point we were thinking about marking this tombstone with ts instead of ts-1. This works fine with CQL because update SET x = null using timestamp T actually uses T-1. I don't remember whether we did this in the end or not. @kbr- do you remember? I think we did.

@kbr-
Copy link
Contributor

kbr- commented Aug 19, 2020

  1. This should happen only for collections, because that's where the T-1 tombstone appears. The OP says this happens for every PutItem. Does PutItem refer to collections only?
  2. Actually this shouldn't happen for non-TTLed operations even with collections. Today if we replace a collection without a TTL, we show it as a single delta entry in the CDC log, we don't split anymore.

The issue is reported 23 days ago, @nyh can you confirm this is still happening?

@nyh
Copy link
Contributor Author

nyh commented Aug 19, 2020

The test still xfails (this is why I love test-first development).

test/alternator/run --runxfail test_streams.py::test_streams_putitem_keys_only

@elcallio
Copy link
Contributor

@kbr Alternator put_item explicitly puts a row tombstone at ts-1 when doing insert. See executor.cc:1277

@nyh
Copy link
Contributor Author

nyh commented Aug 19, 2020

@kbr Alternator put_item explicitly puts a row tombstone at ts-1 when doing insert. See executor.cc:1277

Right, I'll explain. When Alternator wishes to replace an item completely (not merge some new attributes with old attributes) it needs to delete the old item and then write the new item. If a write and delete are done with the same timestamp the delete wins, which is why we must do the delete at ts-1. This is not a new Alternator invention - Scylla already does exactly the same thing when replacing a collection, it puts a range tombstone for the collection at timestamp ts-1 and the new collection cells with timestamp ts=1.

@nyh
Copy link
Contributor Author

nyh commented Aug 19, 2020

Otoh also, I don't think alternator giving a slightly different event stream in these cases are a super huge deal. It does technically express the same thing. Would be nice to know if it really would hamper a real user...

This is a good question. If the item did exist before, replacing it is indeed equivalent to a REMOVE followed by an INSERT (See #6918 on the separate issues of an item which didn't exist before, and on INSERT vs MODIFY). I think we need to learn more about how real users use DynamoDB streams, and how much they might care about this issue and about #6918.

So maybe these issues should be considered very low priority. On the other hand, it would have been nicer to have perfect DynamoDB compatibility, without making excuses on why we chose not to be compatible.

@kbr-
Copy link
Contributor

kbr- commented Aug 19, 2020

Writing a cell completely replaces the older versions of this cell.
I don't understand why you need a row tombstone?

Unless:

  • you have multiple columns in the schema
  • you want to write a completely new row that is empty for all columns except one column with some new value.

So if I understand correctly

  • those alternator "items" are rows composed of multiple cells which represent attributes,
  • changing a single attribute is adding a single cell
  • replacing an item is putting a row tombstone and adding a cell

Okay, if you're asking me whether we should give special treatment to mutations which have a row tombstone with timestamp T-1 and a cell with timestamp T, my answer is: I don't know, but I'd probably refrain from doing that.

First, it would also affect updates coming from CQL. So if someone in CQL creates a batch as follows:

begin batch
delete from ks.t using timestamp 42 where pk = 0 and ck = 0;
update ks.t using timestamp 43 set v = 0 where pk = 0 and ck = 0;
apply batch;

they would get a single entry in CDC (instead of two) with a completely new type of operation due to our special handling.

Second, CDC code is complex already. Believe me, you don't want to have any more edge cases there.

Generally speaking, CDC was written with CQL in mind. That's why it has special handling for the collection tombstone case. If now we mix-in Alternator-frontend-specific logic to the code, we're going to make a mess. And I'm afraid the problem discussed in this thread is not the first problem we'll run into.

I think we should have a translation layer between CDC log table and Alternator streams which detects such Alternator-frontend-specific stuff (maybe before even considering query windows etc. so we don't run into problems such as the one Calle described) and makes the necessary adjustments.

@elcallio
Copy link
Contributor

Instead of just setting ts-1 for tombstone, we could for internal purposes mark it in such a way that it is recognized as not being user generated. For all the cases in scylla, i.e. collections etc. It should effectively make cdc easier.
Not sure how such a mark would look, if we have any status bits or anything to play with, but it would extend the expressiveness of our internal manipulations so we can detect what we mean, not what we do.

@nyh
Copy link
Contributor Author

nyh commented Aug 19, 2020

Writing a cell completely replaces the older versions of this cell.
I don't understand why you need a row tombstone?

Because we want to replace the entire row - not a single cell.
Imagine that no matter what this row previously contained you want to set "a=3". Even if it had previously "b=4", you need to remove it as well. Not merge and have both "a=3" and "b=4".

As I said, Scylla also has exactly the same need in a CQL operation which wants to set - i.e., replace - a collection (e.g., a set). It also needs to delete the previous contents of the collection (this is a range tombstone) and write new contents. In that existing code originating in Cassandra, Scylla uses timestamp ts-1 for the range tombstone, and in Alternator we did exactly the same when replacing an item.

Unless:

* you have multiple columns in the schema
* you want to write a completely new row that is empty for all columns except one column with some new value.

Exactly, but 1. It doesn't have to be "except one column", and 2. Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

First, it would also affect updates coming from CQL.

I suggest you please add to your tests a CQL test which replaces a collection with a new value (there's a CQL syntax for this!). In the CDC log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp TS. Think if that is what you want to see - or not.

So if someone in CQL creates a batch as follows:

begin batch
delete from ks.t using timestamp 42 where pk = 0 and ck = 0;
update ks.t using timestamp 43 set v = 0 where pk = 0 and ck = 0;
apply batch;

they would get a single entry in CDC (instead of two) with a completely new type of operation due to our special handling.

Right. Note that in Alternator we assume the user cannot choose the timestamp on their own. So a user cannot manufacture this case.

Second, CDC code is complex already. Believe me, you don't want to have any more edge cases there.

Generally speaking, CDC was written with CQL in mind. That's why it has special handling for the collection tombstone case. If now we mix-in Alternator-frontend-specific logic to the code, we're going to make a mess. And I'm afraid the problem discussed in this thread is not the first problem we'll run into.

I think we should have a translation layer between CDC log table and Alternator streams which detects such Alternator-frontend-specific stuff (maybe before even considering query windows etc. so we don't run into problems such as the one Calle described) and makes the necessary adjustments.

Yes, sounds like a reasonable approach, but might not be easy because of the time window issues that Calle described.

@haaawk
Copy link
Contributor

haaawk commented Aug 19, 2020 via email

@elcallio
Copy link
Contributor

elcallio commented Aug 19, 2020

@haaawk we already do ts-1 detection in cdc code, to deal with collections tombstones and avoid generating to many cdc rows. I don't think it unreasonable that even this code would benefit from the ts-1 pattern being marked as "generated by internal code" and thus treated explicitly. In fact, it would be safer.
Again, I am not advocating we add more assumptions to cdc. I am advocating that the ts-1 tombstone patterns needs to be better identifiable as "created by scylla code to emulate certain ops" (like replacing a collection, or row). It's right now that we are making assumptions on just timestamps, which is generally bad. Because even in alternator, where user timestamps does not exist, it would still be perfectly possible to generate a delete + put with exactly one timestamp tick between them, just not very likely.

@kbr-
Copy link
Contributor

kbr- commented Aug 19, 2020

@nyh

Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

You can use a collection tombstone for that, you don't need a row tombstone.

I suggest you please add to your tests a CQL test which replaces a collection with a new value (there's a CQL syntax for this!). In the CDC log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp TS. Think if that is what you want to see - or not.

No, I'll see one entry with one timestamp. We changed it some time ago (#6491).

cqlsh> create keyspace ks with replication = {'class':'SimpleStrategy', 'replication_factor': 1};
cqlsh> create table ks.t (pk int, ck int, v map<int, int>, primary key (pk, ck)) with cdc = {'enabled': true};
cqlsh> update ks.t set v = {0:0} where pk = 0 and ck = 0;
cqlsh> select "cdc$time", v, "cdc$deleted_v", "cdc$deleted_elements_v" from ks.t_scylla_cdc_log;

 cdc$time                             | v      | cdc$deleted_v | cdc$deleted_elements_v
--------------------------------------+--------+---------------+------------------------
 5f1930ec-e223-11ea-1385-76387c3931ec | {0: 0} |          True |                   null

That's the special collection handling that I mentioned in my previous post.

@nyh
Copy link
Contributor Author

nyh commented Aug 19, 2020

@nyh

Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

You can use a collection tombstone for that, you don't need a row tombstone.

This is a very good observation, which I already made for other reasons in #6918 (comment). Whereas we currently use a row tombstone to delete the item, we could use individual cell tombstones (for regular columns, if not overwritten by a new value) plus a collection tombstone (for the ":attrs" column) to delete the item. If CDC already has special handling for collection tombstones at ts-1 - but doesn't have the same handling for row tombstones - then it will indeed solve this problem.

I suggest you please add to your tests a CQL test which replaces a collection with a new value (there's a CQL syntax for this!). In the CDC log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp TS. Think if that is what you want to see - or not.

No, I'll see one entry with one timestamp. We changed it some time ago (#6491).

Oh good, I didn't remember this. So, any particular reason why you changed this specifically just for collection tombstones and not for the similar issue of entire-row tombstones? I guess the main reason is that CQL only does the former, and doesn't have syntax for the latter (for overwriting an entire item)?

@kbr-
Copy link
Contributor

kbr- commented Aug 19, 2020

@elcallio

we already do ts-1 detection in cdc code, to deal with collections tombstones and avoid generating to many cdc rows

This is done by simply adding 1 to every collection tombstone's timestamp.
The "moral justification" why we can/should do that is the following: if X is a non-frozen column, then the statement
update ... using timestamp T set X = V ...
creates a T-1 timestamp collection tombstone. So we show it in CDC log under T - 1 + 1 = T timestamp.

So we made a decision based on CQL specifics, because CDC was designed with CQL in mind.

EDIT: @nyh this should answer your last question

I don't think we should do such a thing for row tombstones, because
delete from ... using timestamp T where ...
creates a timestamp T row tombstone; showing it under timestamp T+1 in the CDC log would be very confusing.

@elcallio
Copy link
Contributor

@kbr - I agree. We should in fact not do any of these things. The tombstones should be marked as "artificial", "created internally" to express something. Then we could safely instead generate more expressive cdc info for whatever is happening.

@kbr-
Copy link
Contributor

kbr- commented Aug 19, 2020

Today, CDC takes a base table mutation and returns a log table mutation. It's very simple.

What you're proposing would require a significant redesign: we would have to take additional data as input.
It would require changes in other places in the code, not just CDC. For example, we would have to plumb down the information about an "internally generated" collection tombstone all the way from CQL layer down to storage_proxy together with the mutation.

@elcallio
Copy link
Contributor

Yes. And I think it might be worth considering doing so. Having more explicit track-keeping of what we are doing is not a bad thing.

@haaawk
Copy link
Contributor

haaawk commented Aug 24, 2020

@nyh

Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

You can use a collection tombstone for that, you don't need a row tombstone.

This is a very good observation, which I already made for other reasons in #6918 (comment). Whereas we currently use a row tombstone to delete the item, we could use individual cell tombstones (for regular columns, if not overwritten by a new value) plus a collection tombstone (for the ":attrs" column) to delete the item. If CDC already has special handling for collection tombstones at ts-1 - but doesn't have the same handling for row tombstones - then it will indeed solve this problem.

It seems to be the best solution here to just use cell/collection tombstones instead of row tombstones. That would solve the issue and not add any complexity to the CDC itself.

@slivne slivne modified the milestones: 4.3, 4.x Feb 16, 2021
@slivne
Copy link
Contributor

slivne commented Feb 16, 2021

Its clear this is a difference - and we except it till we find there is an issue with the approach we have taken - pushing this out till we get user feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants