Alternator Streams thinks that PutItem is a REMOVE+MODIFY #6930

nyh · 2020-07-27T21:32:33Z

This issue is similar to issue #6918 (perhaps it should be included into that issue? but I'm not sure), but even more surprising:
It turns out that every PutItem operation appears today in the Alternator Stream as two events - a REMOVE and a MODIFY! This is instead of just one event - a MODIFY or an INSERT (depending on whether the item existed before - see #6918).

This "REMOVE" event appears because our implementation of PutItem inserts a row tombstone with timestamp ts-1 together with the new value with timestamp ts. We need this tombstone because the row is supposed to completely replace the old row - not be merged into it.

I'm not sure how we can or should solve this. CDC already gives special attention to tombstones preceeding a value's timestamp by exactly one, maybe this situation can be recognized already by CDC and the "REMOVE" event avoided from CDC in the lower level. But if we can't do that, at least the Alternator layer should notice this situation - a REMOVE followed by a MODIFY with a difference of one in timestamp, and hide the REMOVE event from its output.

CC @elcallio

The text was updated successfully, but these errors were encountered:

This patch adds additional tests for Alternator Streams, which helped uncover 9 new issues. The stream tests are noticibly slower than most other Alternator tests - test_streams.py now has 27 tests taking a total of 20 seconds. Much of this slowness is attributed to Alternator Stream's 512 "shards" per stream in the single-node test setup with 256 vnodes, meaning that we need over 1000 API requests per test using GetRecords. These tests could be made significantly faster (as little as 4 seconds) by setting a lower number of vnodes. Issue #6979 is about doing this in the future. The tests in this patch have comments explaining clearly (I hope) what they test, and also pointing to issues I opened about the problems discovered through these tests. In particular, the tests reproduce the following bugs: Refs #6918 Refs #6926 Refs #6930 Refs #6933 Refs #6935 Refs #6939 Refs #6942 The tests also work around the following issues (and can be changed to be more strict and reproduce these issues): Refs #6918 Refs #6931 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200804154755.1461309-1-nyh@scylladb.com>

elcallio · 2020-08-19T08:48:23Z

On paper it looks like it would be solvable in get_records, but there are some caveats, all due to selection windows and result limit. Since the tombstone marking the "it's a new row" is ts-1, it would be perfectly possible to get a query window that includes the tombstone, but not the actual insert. So, to handle this we would pretty much be forced to never always skip a trailing REMOVE and adjust the next query window to see if it fit with an insert on remove ts+1.
But this would instead somewhat break cases where we have a lot of actual deletes in a row, in which case we would be constantly staggering. Perhaps manageable, but not nice.

The same more or less applied, but much more easily handled, for the query limit.
It strikes me that the pattern "tombstone at ts-1" is common enough that we should maybe have it formalized and recognized in cdc, and generate something identifiable for it. It is after all prevalent for all non-atomics etc as well. Maybe coalesce such a pattern into a cdc_op::replace? @kbr, @haaawk - thoughts?

Otoh also, I don't think alternator giving a slightly different event stream in these cases are a super huge deal. It does technically express the same thing. Would be nice to know if it really would hamper a real user...

haaawk · 2020-08-19T09:43:17Z

On paper it looks like it would be solvable in get_records, but there are some caveats, all due to selection windows and result limit. Since the tombstone marking the "it's a new row" is ts-1, it would be perfectly possible to get a query window that includes the tombstone, but not the actual insert. So, to handle this we would pretty much be forced to never always skip a trailing REMOVE and adjust the next query window to see if it fit with an insert on remove ts+1.
But this would instead somewhat break cases where we have a lot of actual deletes in a row, in which case we would be constantly staggering. Perhaps manageable, but not nice.

The same more or less applied, but much more easily handled, for the query limit.
It strikes me that the pattern "tombstone at ts-1" is common enough that we should maybe have it formalized and recognized in cdc, and generate something identifiable for it. It is after all prevalent for all non-atomics etc as well. Maybe coalesce such a pattern into a cdc_op::replace? @kbr, @haaawk - thoughts?

Otoh also, I don't think alternator giving a slightly different event stream in these cases are a super huge deal. It does technically express the same thing. Would be nice to know if it really would hamper a real user...

At some point we were thinking about marking this tombstone with ts instead of ts-1. This works fine with CQL because update SET x = null using timestamp T actually uses T-1. I don't remember whether we did this in the end or not. @kbr- do you remember? I think we did.

kbr- · 2020-08-19T09:49:50Z

This should happen only for collections, because that's where the T-1 tombstone appears. The OP says this happens for every PutItem. Does PutItem refer to collections only?
Actually this shouldn't happen for non-TTLed operations even with collections. Today if we replace a collection without a TTL, we show it as a single delta entry in the CDC log, we don't split anymore.

The issue is reported 23 days ago, @nyh can you confirm this is still happening?

nyh · 2020-08-19T10:00:10Z

The test still xfails (this is why I love test-first development).

test/alternator/run --runxfail test_streams.py::test_streams_putitem_keys_only

elcallio · 2020-08-19T11:07:44Z

@kbr Alternator put_item explicitly puts a row tombstone at ts-1 when doing insert. See executor.cc:1277

nyh · 2020-08-19T12:21:54Z

@kbr Alternator put_item explicitly puts a row tombstone at ts-1 when doing insert. See executor.cc:1277

Right, I'll explain. When Alternator wishes to replace an item completely (not merge some new attributes with old attributes) it needs to delete the old item and then write the new item. If a write and delete are done with the same timestamp the delete wins, which is why we must do the delete at ts-1. This is not a new Alternator invention - Scylla already does exactly the same thing when replacing a collection, it puts a range tombstone for the collection at timestamp ts-1 and the new collection cells with timestamp ts=1.

nyh · 2020-08-19T12:36:42Z

Otoh also, I don't think alternator giving a slightly different event stream in these cases are a super huge deal. It does technically express the same thing. Would be nice to know if it really would hamper a real user...

This is a good question. If the item did exist before, replacing it is indeed equivalent to a REMOVE followed by an INSERT (See #6918 on the separate issues of an item which didn't exist before, and on INSERT vs MODIFY). I think we need to learn more about how real users use DynamoDB streams, and how much they might care about this issue and about #6918.

So maybe these issues should be considered very low priority. On the other hand, it would have been nicer to have perfect DynamoDB compatibility, without making excuses on why we chose not to be compatible.

kbr- · 2020-08-19T12:37:54Z

Writing a cell completely replaces the older versions of this cell.
I don't understand why you need a row tombstone?

Unless:

you have multiple columns in the schema
you want to write a completely new row that is empty for all columns except one column with some new value.

So if I understand correctly

those alternator "items" are rows composed of multiple cells which represent attributes,
changing a single attribute is adding a single cell
replacing an item is putting a row tombstone and adding a cell

Okay, if you're asking me whether we should give special treatment to mutations which have a row tombstone with timestamp T-1 and a cell with timestamp T, my answer is: I don't know, but I'd probably refrain from doing that.

First, it would also affect updates coming from CQL. So if someone in CQL creates a batch as follows:

begin batch
delete from ks.t using timestamp 42 where pk = 0 and ck = 0;
update ks.t using timestamp 43 set v = 0 where pk = 0 and ck = 0;
apply batch;

they would get a single entry in CDC (instead of two) with a completely new type of operation due to our special handling.

Second, CDC code is complex already. Believe me, you don't want to have any more edge cases there.

Generally speaking, CDC was written with CQL in mind. That's why it has special handling for the collection tombstone case. If now we mix-in Alternator-frontend-specific logic to the code, we're going to make a mess. And I'm afraid the problem discussed in this thread is not the first problem we'll run into.

I think we should have a translation layer between CDC log table and Alternator streams which detects such Alternator-frontend-specific stuff (maybe before even considering query windows etc. so we don't run into problems such as the one Calle described) and makes the necessary adjustments.

elcallio · 2020-08-19T12:57:34Z

Instead of just setting ts-1 for tombstone, we could for internal purposes mark it in such a way that it is recognized as not being user generated. For all the cases in scylla, i.e. collections etc. It should effectively make cdc easier.
Not sure how such a mark would look, if we have any status bits or anything to play with, but it would extend the expressiveness of our internal manipulations so we can detect what we mean, not what we do.

nyh · 2020-08-19T13:01:39Z

Writing a cell completely replaces the older versions of this cell.
I don't understand why you need a row tombstone?

Because we want to replace the entire row - not a single cell.
Imagine that no matter what this row previously contained you want to set "a=3". Even if it had previously "b=4", you need to remove it as well. Not merge and have both "a=3" and "b=4".

As I said, Scylla also has exactly the same need in a CQL operation which wants to set - i.e., replace - a collection (e.g., a set). It also needs to delete the previous contents of the collection (this is a range tombstone) and write new contents. In that existing code originating in Cassandra, Scylla uses timestamp ts-1 for the range tombstone, and in Alternator we did exactly the same when replacing an item.

Unless:

* you have multiple columns in the schema
* you want to write a completely new row that is empty for all columns except one column with some new value.

Exactly, but 1. It doesn't have to be "except one column", and 2. Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

First, it would also affect updates coming from CQL.

I suggest you please add to your tests a CQL test which replaces a collection with a new value (there's a CQL syntax for this!). In the CDC log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp TS. Think if that is what you want to see - or not.

So if someone in CQL creates a batch as follows:
begin batch
delete from ks.t using timestamp 42 where pk = 0 and ck = 0;
update ks.t using timestamp 43 set v = 0 where pk = 0 and ck = 0;
apply batch;
they would get a single entry in CDC (instead of two) with a completely new type of operation due to our special handling.

Right. Note that in Alternator we assume the user cannot choose the timestamp on their own. So a user cannot manufacture this case.

Second, CDC code is complex already. Believe me, you don't want to have any more edge cases there.

Generally speaking, CDC was written with CQL in mind. That's why it has special handling for the collection tombstone case. If now we mix-in Alternator-frontend-specific logic to the code, we're going to make a mess. And I'm afraid the problem discussed in this thread is not the first problem we'll run into.

I think we should have a translation layer between CDC log table and Alternator streams which detects such Alternator-frontend-specific stuff (maybe before even considering query windows etc. so we don't run into problems such as the one Calle described) and makes the necessary adjustments.

Yes, sounds like a reasonable approach, but might not be easy because of the time window issues that Calle described.

haaawk · 2020-08-19T13:05:06Z

On Wed, Aug 19, 2020 at 3:01 PM nyh ***@***.***> wrote: Writing a cell completely replaces the older versions of this cell. I don't understand why you need a row tombstone? Because we want to replace the entire row - not a single cell. Imagine that no matter what this row previously contained you want to set "a=3". Even if it had previously "b=4", you need to remove it as well. Not merge and have both "a=3" and "b=4". As I said, Scylla also has exactly the same need in a CQL operation which wants to set - i.e., replace - a collection (e.g., a set). It also needs to delete the previous contents of the collection (this is a range tombstone) and write new contents. In that existing code originating in Cassandra, Scylla uses timestamp ts-1 for the range tombstone, and in Alternator we did exactly the same when replacing an item. Unless: * you have multiple columns in the schema * you want to write a completely new row that is empty for all columns except one column with some new value. Exactly, but 1. It doesn't have to be "except one column", and 2. Some of the columns are *collections* (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone. First, it would also affect updates coming from CQL. I suggest you please add to your tests a *CQL* test which *replaces* a collection with a new value (there's a CQL syntax for this!). In the CDC log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp TS. Think if that is what you want to see - or not. So if someone in CQL creates a batch as follows: begin batchdelete from ks.t using timestamp 42 where pk = 0 and ck = 0;update ks.t using timestamp 43 set v = 0 where pk = 0 and ck = 0; apply batch; they would get a single entry in CDC (instead of two) with a completely new type of operation due to our special handling. Right. Note that in Alternator we assume the user cannot choose the timestamp on their own. So a user cannot manufacture this case. Second, CDC code is complex already. Believe me, you don't want to have any more edge cases there. Generally speaking, CDC was written with CQL in mind. That's why it has special handling for the collection tombstone case. If now we mix-in Alternator-frontend-specific logic to the code, we're going to make a mess. And I'm afraid the problem discussed in this thread is not the first problem we'll run into. I think we should have a translation layer between CDC log table and Alternator streams which detects such Alternator-frontend-specific stuff (maybe before even considering query windows etc. so we don't run into problems such as the one Calle described) and makes the necessary adjustments. Yes, sounds like a reasonable approach, but might not be easy because of the time window issues that Calle described.

+1 to Kamil's comment. This is Alternator stream specific and better not to inject it into core CDC code.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6930 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKADLO3IUZAHWF4HHSRUBDSBPEMHANCNFSM4PJKHF4A> .

elcallio · 2020-08-19T13:21:12Z

@haaawk we already do ts-1 detection in cdc code, to deal with collections tombstones and avoid generating to many cdc rows. I don't think it unreasonable that even this code would benefit from the ts-1 pattern being marked as "generated by internal code" and thus treated explicitly. In fact, it would be safer.
Again, I am not advocating we add more assumptions to cdc. I am advocating that the ts-1 tombstone patterns needs to be better identifiable as "created by scylla code to emulate certain ops" (like replacing a collection, or row). It's right now that we are making assumptions on just timestamps, which is generally bad. Because even in alternator, where user timestamps does not exist, it would still be perfectly possible to generate a delete + put with exactly one timestamp tick between them, just not very likely.

kbr- · 2020-08-19T13:58:12Z

@nyh

Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

You can use a collection tombstone for that, you don't need a row tombstone.

I suggest you please add to your tests a CQL test which replaces a collection with a new value (there's a CQL syntax for this!). In the CDC log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp TS. Think if that is what you want to see - or not.

No, I'll see one entry with one timestamp. We changed it some time ago (#6491).

cqlsh> create keyspace ks with replication = {'class':'SimpleStrategy', 'replication_factor': 1};
cqlsh> create table ks.t (pk int, ck int, v map<int, int>, primary key (pk, ck)) with cdc = {'enabled': true};
cqlsh> update ks.t set v = {0:0} where pk = 0 and ck = 0;
cqlsh> select "cdc$time", v, "cdc$deleted_v", "cdc$deleted_elements_v" from ks.t_scylla_cdc_log;

 cdc$time                             | v      | cdc$deleted_v | cdc$deleted_elements_v
--------------------------------------+--------+---------------+------------------------
 5f1930ec-e223-11ea-1385-76387c3931ec | {0: 0} |          True |                   null

That's the special collection handling that I mentioned in my previous post.

nyh · 2020-08-19T14:04:51Z

@nyh

Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

You can use a collection tombstone for that, you don't need a row tombstone.

This is a very good observation, which I already made for other reasons in #6918 (comment). Whereas we currently use a row tombstone to delete the item, we could use individual cell tombstones (for regular columns, if not overwritten by a new value) plus a collection tombstone (for the ":attrs" column) to delete the item. If CDC already has special handling for collection tombstones at ts-1 - but doesn't have the same handling for row tombstones - then it will indeed solve this problem.

I suggest you please add to your tests a CQL test which replaces a collection with a new value (there's a CQL syntax for this!). In the CDC log, you'll see a deletion at timestamp TS-1 and the new cells at timestamp TS. Think if that is what you want to see - or not.

No, I'll see one entry with one timestamp. We changed it some time ago (#6491).

Oh good, I didn't remember this. So, any particular reason why you changed this specifically just for collection tombstones and not for the similar issue of entire-row tombstones? I guess the main reason is that CQL only does the former, and doesn't have syntax for the latter (for overwriting an entire item)?

kbr- · 2020-08-19T14:08:01Z

@elcallio

we already do ts-1 detection in cdc code, to deal with collections tombstones and avoid generating to many cdc rows

This is done by simply adding 1 to every collection tombstone's timestamp.
The "moral justification" why we can/should do that is the following: if X is a non-frozen column, then the statement
update ... using timestamp T set X = V ...
creates a T-1 timestamp collection tombstone. So we show it in CDC log under T - 1 + 1 = T timestamp.

So we made a decision based on CQL specifics, because CDC was designed with CQL in mind.

EDIT: @nyh this should answer your last question

I don't think we should do such a thing for row tombstones, because
delete from ... using timestamp T where ...
creates a timestamp T row tombstone; showing it under timestamp T+1 in the CDC log would be very confusing.

elcallio · 2020-08-19T14:17:17Z

@kbr - I agree. We should in fact not do any of these things. The tombstones should be marked as "artificial", "created internally" to express something. Then we could safely instead generate more expressive cdc info for whatever is happening.

kbr- · 2020-08-19T14:29:03Z

Today, CDC takes a base table mutation and returns a log table mutation. It's very simple.

What you're proposing would require a significant redesign: we would have to take additional data as input.
It would require changes in other places in the code, not just CDC. For example, we would have to plumb down the information about an "internally generated" collection tombstone all the way from CQL layer down to storage_proxy together with the mutation.

elcallio · 2020-08-19T14:37:18Z

Yes. And I think it might be worth considering doing so. Having more explicit track-keeping of what we are doing is not a bad thing.

haaawk · 2020-08-24T08:06:39Z

@nyh

Some of the columns are collections (in Alternator, we have one main map column) and the entire collection must be clearer - we don't know which specific items we need to delete, and we need a tombstone.

You can use a collection tombstone for that, you don't need a row tombstone.

This is a very good observation, which I already made for other reasons in #6918 (comment). Whereas we currently use a row tombstone to delete the item, we could use individual cell tombstones (for regular columns, if not overwritten by a new value) plus a collection tombstone (for the ":attrs" column) to delete the item. If CDC already has special handling for collection tombstones at ts-1 - but doesn't have the same handling for row tombstones - then it will indeed solve this problem.

It seems to be the best solution here to just use cell/collection tombstones instead of row tombstones. That would solve the issue and not add any complexity to the CDC itself.

slivne · 2021-02-16T12:51:27Z

Its clear this is a difference - and we except it till we find there is an issue with the approach we have taken - pushing this out till we get user feedback

nyh added area/alternator Alternator related Issues area/alternator-streams labels Jul 27, 2020

slivne assigned elcallio Jul 28, 2020

slivne added this to the 4.3 milestone Jul 30, 2020

slivne modified the milestones: 4.3, 4.x Feb 16, 2021

slivne unassigned elcallio Feb 16, 2021

nyh mentioned this issue Dec 11, 2023

GA Alternator Streams #16367

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternator Streams thinks that PutItem is a REMOVE+MODIFY #6930

Alternator Streams thinks that PutItem is a REMOVE+MODIFY #6930

nyh commented Jul 27, 2020

elcallio commented Aug 19, 2020

haaawk commented Aug 19, 2020

kbr- commented Aug 19, 2020

nyh commented Aug 19, 2020 •

edited

Loading

elcallio commented Aug 19, 2020

nyh commented Aug 19, 2020

nyh commented Aug 19, 2020

kbr- commented Aug 19, 2020

elcallio commented Aug 19, 2020

nyh commented Aug 19, 2020

haaawk commented Aug 19, 2020 via email

elcallio commented Aug 19, 2020 •

edited

Loading

kbr- commented Aug 19, 2020

nyh commented Aug 19, 2020

kbr- commented Aug 19, 2020 •

edited

Loading

elcallio commented Aug 19, 2020

kbr- commented Aug 19, 2020

elcallio commented Aug 19, 2020

haaawk commented Aug 24, 2020

slivne commented Feb 16, 2021

Alternator Streams thinks that PutItem is a REMOVE+MODIFY #6930

Alternator Streams thinks that PutItem is a REMOVE+MODIFY #6930

Comments

nyh commented Jul 27, 2020

elcallio commented Aug 19, 2020

haaawk commented Aug 19, 2020

kbr- commented Aug 19, 2020

nyh commented Aug 19, 2020 • edited Loading

elcallio commented Aug 19, 2020

nyh commented Aug 19, 2020

nyh commented Aug 19, 2020

kbr- commented Aug 19, 2020

elcallio commented Aug 19, 2020

nyh commented Aug 19, 2020

haaawk commented Aug 19, 2020 via email

elcallio commented Aug 19, 2020 • edited Loading

kbr- commented Aug 19, 2020

nyh commented Aug 19, 2020

kbr- commented Aug 19, 2020 • edited Loading

elcallio commented Aug 19, 2020

kbr- commented Aug 19, 2020

elcallio commented Aug 19, 2020

haaawk commented Aug 24, 2020

slivne commented Feb 16, 2021

nyh commented Aug 19, 2020 •

edited

Loading

elcallio commented Aug 19, 2020 •

edited

Loading

kbr- commented Aug 19, 2020 •

edited

Loading