atomic_cell: compare value last #14183

bhalevy · 2023-06-08T09:09:52Z

Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 (CASSANDRA-14592),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:

Cassandra compatibility
Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

Fixes #14182

Also, this series:

updates dml documentation
updates internal documentation
updates and adds unit tests and cql pytest reproducing compare_cells_for_merge may wrongly prefer an expired cell over a live and expiring one #14182

avikivity · 2023-06-08T09:54:56Z

When did the change in the algorithm happen (commit)?

bhalevy · 2023-06-08T10:07:08Z

When did the change in the algorithm happen (commit)?

The problem was there early on, when f43836e initially added handling of expirations, already evaluated after comparing the cell value.

avikivity · 2023-06-08T10:23:16Z

When did the change in the algorithm happen (commit)?

The problem was there early on, when f43836e initially added handling of expirations, already evaluated after comparing the cell value.

Ok, so backport all the way to 0.10.

avikivity · 2023-06-08T10:24:19Z

@tgrabiec please review. I don't have intuition about whether the old or new algorithm are more correct.

nyh · 2023-06-08T10:28:48Z

A comment I wrote in the issue:
"If we change the order in our code, we may need to do something to avoid the above "oscillation" problem on mixed clusters during upgrade".
Do you think it's fine to not handle this issue?

avikivity · 2023-06-08T10:30:36Z

I think it's rare enough that we can allow oscillation during upgrade.

nyh · 2023-06-08T10:49:38Z

also add a cql pytest to reproducing #14182.

We already have a test file, "test-timestamp.py", for tests of USING TIMESTAMP. Please add your test to that file - and feel free to rename it in your patch (I admit test_using_timestamp.py is a better name, and I don't know why this test ended up the only one having a minus instead of an underscore in its name...)

scylladb-promoter · 2023-06-08T10:54:49Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1721/

bhalevy · 2023-06-08T11:13:12Z

also add a cql pytest to reproducing #14182.

We already have a test file, "test-timestamp.py", for tests of USING TIMESTAMP. Please add your test to that file - and feel free to rename it in your patch (I admit test_using_timestamp.py is a better name, and I don't know why this test ended up the only one having a minus instead of an underscore in its name...)

Done in fdf5c99

scylladb-promoter · 2023-06-08T12:50:18Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1725/

scylladb-promoter · 2023-06-08T14:29:32Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1730/

test/boost/mutation_test.cc

vladzcloudius · 2023-06-08T21:02:15Z

test/boost/mutation_test.cc

+    testlog.debug("Live cells are ordered before expiring cells, regardless of their value");
+    assert_order(
+        atomic_cell::make_live(*bytes_type, 1, bytes("value2")),
+        atomic_cell::make_live(*bytes_type, 1, bytes("value1"), expiry_1, ttl_1));


This test/condition is not obvious. Cells should always be ordered by their write time if their "timestamp" is the same, regardless if they are alive, expired, dead (tombstone) or expiring.

The last written value in this case should always win.

Only expiring/dead cells have a write time. Live cells don't have one.

I see. It kinda makes sense to me now too: they want to prefer the value that has the longest "expire_at" and the alive cell has an infinite value for this (formally).

According to CASSANDRA-14592 they keep expiring cells because they'll become tombstones in the future and then they'll trump live cells. If they kept the non-expiring cell (as it will survive longer), the expiration would be lost. Their description of the issue indicates that this inconsistency depends on WHEN the comparison took place, as if it was made early enough, before the cell expired, then the live cell could have been picked (based on its value), but if the comparison took place later on, after the cell expired, then it would be considered as a tombstone and then it would prevail, so preferring expiring cells over live ones is there to be consistent with the rule the prefers tombstones over live cells.

vladzcloudius · 2023-06-08T21:05:44Z

mutation/atomic_cell.cc

@@ -79,23 +79,20 @@ compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
        return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater;


I think that giving a live cell preference to not-live cell even if the not-live cell was written after the live cell is not obvious at all.

We have no way to determine this. We use timestamps to determine order and we are in a tie. Our ordering broke down, and we just need some order (bonus if it is also intuitive).

Also, being compatible with C* is a plus.

@denesb Don't we always store a coordinator time together with a "timestamp" (which can be overridden with USING TIMESTAMP)?
Or we only have it if data is TTLed and we have an expire_at?

@bhalevy is the code with this patch copies the latest C* logic 1-to-1?

@vladzcloudius it doesn't copy the C* logic but it is compatible with it, using a different implementation.
Also, it extend C*'s algorithm as we examine the cell remaining ttl, while C* doesn't so that. They stop processing at the expiry time and if both cells expire at the same time, their value is compared.

Dead cell wins with a live cell because expiring cells turn into dead cells of the same timestamp, so they are later in the lifecycle of a cell.

test/boost/mutation_test.cc

vladzcloudius · 2023-06-09T14:33:48Z

mutation/atomic_cell.cc

        if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
            // prefer expiring cells.
            return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less;
        }
        if (left.is_live_and_has_ttl()) {
            if (left.expiry() != right.expiry()) {
                return left.expiry() <=> right.expiry();
-            } else {
+            } else if (right.ttl() != left.ttl()) {
                // prefer the cell that was written later,


This comment doesn't seem to reflect the new logic: the following comparison is going to order based on the TTL value - not the "expiry" (which would match the old comment).

You probably want to move this comment above line 87 and add a different comment here.

The comment does refer to selection by ttl, although it's not obvious.
Having shorter ttl with same expiry means that the derived write time is later.
With that in mind, the cell becomes purgeable gc_grace_seconds after its write time, hence it will survive for longer (as a tombstone).

vladzcloudius

The code makes sense to me logically.
Beyond a few cosmetic changes I requested yesterday and today it looks good to me.

avikivity · 2023-06-09T21:45:11Z

The patch fixes the issue, but that's only because they are phrased the same way.

Issue: we should do X
Patch: do X

That's convenient, but it avoids the question of what's the right behavior, what's the user impact, and what should be written in the release notes.

bhalevy · 2023-06-10T05:22:46Z

The patch fixes the issue, but that's only because they are phrased the same way.

Issue: we should do X Patch: do X

That's convenient, but it avoids the question of what's the right behavior, what's the user impact, and what should be written in the release notes.

I edited the issue title and description to describe the user visible impact rather than the technical details.

nyh

Although I understand this patch, I don't like the fact that it's motivation is ambiguous, and it makes it harder to review if the implementation lives up to the motivation. I can think of three related, but different motivations:

@vladzcloudius believes that this this patch fixes a real "error", and your commit message also claims it avoids a "weird corner case". What is this error / corner case, and without explaining what it is, how can I review that the code fixes it and the test reproduces it?
The fact that Cassandra's ordering is different from Scylla's was mentioned several times, and a possible motivation was to make them the same. But did we? The test is marked "scylla_only", suggesting it fails on Cassandra. But why?
Why I try to explain the motivation of this patch to myself, it's not about the TTL at all! Instead, it's about what to do when a user send different writes A and write B with the same timestamp - how can we try (if possible) to make sure either A or B gets written but not a mixture half-A-half-B? When writes have expiration-times, we are lucky: In CQL, when you do a write, all of it will have the same expiration time. So all the cells in A have the same expiration time tA, and all the cells in B have the same expiration tB. If tA != tB (e.g., we have default-TTL and more than a second passed between the two writes), if we base the reconciliation decision on the expiration time, not cell values - we can consistently choose either A and B without mixing them! If we're lucky, we shouldn't waste this luck!

I think the real motivation is, or should be, 3. Stating that this is the motivation isn't just nice for understanding, or for understanding how serious (or not) this issue is, it can also change the way you look at testing this issue. Benny, your test has a lot of sleeps waiting until the the data actually expires. Although it might make sense to have a slow test that exactly reproduces the same scenario that the user saw, I think you can actually reproduce the usefulness of this patch in one second (unfortunately, not less) - without waiting for anything to expire: Do a write A of two int cells with values 0, 1 with TTL 1000, sleep a whole second (unfortunately), then do a write B of the cells with values 1, 0 with TTL 1000. Now read. Before this patch, I believe you'll read a mixture of A and B (I think 0,0)). After this patch, you'll read one of A or B - not a mixture of the two. Also, unless I'm missing something, the same test should pass on Cassandra.

nyh · 2023-06-11T06:24:54Z

test/cql-pytest/test_using_timestamp.py

@@ -90,3 +90,56 @@ def test_key_writetime(cql, table1):
        cql.execute(f'SELECT writetime(k) FROM {table1}')
    with pytest.raises(InvalidRequest, match='PRIMARY KEY part k'):
        cql.execute(f'SELECT ttl(k) FROM {table1}')
+
+# Reproducer for https://github.com/scylladb/scylladb/issues/14182
+def test_rewrite_using_same_timeout_and_ttl(scylla_only, cql, table1):


Wait, what?
I thought the main point of this patch was to behave like Cassandra always did in this specific case. So why a scylla_only test case?

I wasn't sure at the time that we're indeed compatible.
Today I'm more confident that we are (with recent enough version of Cassandra).
I'll test that in the next version of this series.

nyh · 2023-06-11T06:27:30Z

test/cql-pytest/test_using_timestamp.py

+            errors.append(f"Expected (k={k}, v={expected}) after immediate rewrite in iteration {i}, but got {res[0]}")
+
+        delay = 4
+        time.sleep(delay)


Sad you need such a long sleep. To be honest, I don't understand why you need to sleep at all, can't you read the data and its ttl() immediately and see this problem without needing to wait for the TTL?
This also relates to my question about what it is that we are really fixing in this PR, I'll write it as a separate comment.

I'd like to reproduce the actual issue, rather than testing for indirect evidence.
But I'll see if it can be reproduced reliably using short timeouts.

Sometimes after you understand an issue you understand that although the customer noticed the problem only after the expiration, if they used "SELECT ttl(...)" they would have seen it immediately after the second write, so you don't need to wait for the expiration. But I guess I'm ok with this if a couple of 1-second sleeps are involved, and not a dozen 4-second sleeps.

nyh · 2023-06-11T06:28:47Z

test/cql-pytest/test_using_timestamp.py

+            errors.append(f"Expected (k={k}, v={expected}) after first expiration in iteration {i}, but got {res[0]}")
+
+        expire2 = t2 + ttl2
+        time.sleep(expire2 - time.time() + 2)


Please remember that typical cql-pytest tests take time measured in milliseconds, and think if there is no way to do this test without so many multi-second sleeps. I think no sleeps are really needed here (see above) but maybe I'm missing something?

I wanted to reproduce the null value issue, and that requires the first insert to expire.

bhalevy · 2023-06-11T09:41:37Z

In v2 (ea9cb20)

rebased
cleaned up changes to unit test (separating cosmetic changes from functional ones)
added and update source documentation for compare_atomic_cell_for_merge
Update user-facing doc: docs/cql/dml.rst with a new, Update ordering section

scylladb-promoter · 2023-06-11T09:44:30Z

CI state FAILURE - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1749/

bhalevy · 2023-06-11T09:47:40Z

Although I understand this patch, I don't like the fact that it's motivation is ambiguous, and it makes it harder to review if the implementation lives up to the motivation.

Good point. Let's clear that out.

I can think of three related, but different motivations:

@vladzcloudius believes that this this patch fixes a real "error", and your commit message also claims it avoids a "weird corner case". What is this error / corner case, and without explaining what it is, how can I review that the code fixes it and the test reproduces it?

Sure. I'll describe the case more clearly.

The fact that Cassandra's ordering is different from Scylla's was mentioned several times, and a possible motivation was to make them the same. But did we? The test is marked "scylla_only", suggesting it fails on Cassandra. But why?

Today, after reviewing both implementation, I convinced myself the implementation are now compatible and the scylla_only marker can be removed (will do in next version of the series)

Why I try to explain the motivation of this patch to myself, it's not about the TTL at all! Instead, it's about what to do when a user send different writes A and write B with the same timestamp - how can we try (if possible) to make sure either A or B gets written but not a mixture half-A-half-B? When writes have expiration-times, we are lucky: In CQL, when you do a write, all of it will have the same expiration time. So all the cells in A have the same expiration time tA, and all the cells in B have the same expiration tB. If tA != tB (e.g., we have default-TTL and more than a second passed between the two writes), if we base the reconciliation decision on the expiration time, not cell values - we can consistently choose either A and B without mixing them! If we're lucky, we shouldn't waste this luck!

That's a good point and another motivator for this change (that I didn't think about).

I think the real motivation is, or should be, 3. Stating that this is the motivation isn't just nice for understanding, or for understanding how serious (or not) this issue is, it can also change the way you look at testing this issue. Benny, your test has a lot of sleeps waiting until the the data actually expires. Although it might make sense to have a slow test that exactly reproduces the same scenario that the user saw, I think you can actually reproduce the usefulness of this patch in one second (unfortunately, not less) - without waiting for anything to expire: Do a write A of two int cells with values 0, 1 with TTL 1000, sleep a whole second (unfortunately), then do a write B of the cells with values 1, 0 with TTL 1000. Now read. Before this patch, I believe you'll read a mixture of A and B (I think 0,0)). After this patch, you'll read one of A or B - not a mixture of the two. Also, unless I'm missing something, the same test should pass on Cassandra.

I'll try that.

bhalevy · 2023-06-11T09:55:57Z

When did the change in the algorithm happen (commit)?

@avikivity our original implementation was based on a respectively old version of Cassandra.
They changed their algorithm in apache/cassandra@e225c88 (CASSANDRA-14592), but we didn't update our implementation correspondingly.

bhalevy · 2023-06-11T11:23:53Z

In v3 (3214dbf):

Improved user-facing docs
Improved annotations in git commits
cql-pytest
- reduce timeouts
- removed scylla_only marker
- added test case for updating multiple cells
updated description of this PR to clarify the motivation for the change.

bhalevy · 2023-06-19T05:17:33Z

This test file was already reamed when we noticed its odd name last week, you don't need to send another patch to do that.

@nyh I don't see such PR nor a patch on the mailing list.
Care to provide a reference so I know when to rebase and on what?

nyh · 2023-06-19T07:29:30Z

@bhalevy this is so funny, we managed to confuse each other. It was you, who already added a patch in this PR, to rename test-timestamp.py to test_using_timestamp.py. It wasn't another PR - it was this one :-) So you don't need to submit another PR to rename this patch, you already did, this one...

Of course, if you want you can split out this rename from this PR and send it as a separate PR. Maybe that's what you meant you wanted to do, and I just misunderstood you?

bhalevy · 2023-06-19T08:00:58Z

@bhalevy this is so funny, we managed to confuse each other. It was you, who already added a patch in this PR, to rename test-timestamp.py to test_using_timestamp.py. It wasn't another PR - it was this one :-) So you don't need to submit another PR to rename this patch, you already did, this one...

Of course, if you want you can split out this rename from this PR and send it as a separate PR. Maybe that's what you meant you wanted to do, and I just misunderstood you?

Yes, I want to separate it since it's fixing a preexisting issue and would have different backport requirements.

bhalevy · 2023-06-19T08:30:30Z

#14293

Currently, it is hard to tell which of the many sub-cases fail in this unit test, in case any of them fails. This change uses logging in debug and trace level to help with that by reproducing the error with --logger-log-level testlog=trace (The cases are deterministic so reproducing should not be a problem) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was changed in CASSANDRA-14592 for consistency with the preference for dead cells over live cells, as expiring cells will become tombstones at a future time and then they'd win over live cells with the same timestamp, hence they should win also before expiration. In addition, comparing the cell value before expiration can lead to unintuitive corner cases where rewriting a cell using the same timestamp but different TTL may cause scylla to return the cell with null value if it expired in the meanwhile. Also, when multiple columns are written using two upserts using the same write timestamp but with different expiration, selecting cells by their value may return a mixed result where each cell is selected individually from either upsert, by picking the cells with the largest values for each column, while using the expiration time to break tie will lead to a more consistent results where a set of cell from only one of the upserts will be selected. Fixes scylladb#14182 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

…e expiry is the same As in compare_atomic_cell_for_merge, we want to consider the row marker ttl for ordering, in case both are expiring and have the same expiration time. This was missed in a57c087 and a085ef7. With that in mind, add documentation to compare_row_marker_for_merge and a mutual note to both functions about their equivalence. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

…imestamp Add reproducers for scylladb#14182: test_rewrite_different_values_using_same_timestamp verifies expiration-based cell reconciliation. test_rewrite_different_values_using_same_timestamp_and_expiration is a scylla_only test, verifying that when two cells with same timestamp and same expiration are compared, the one with the lesser ttl prevails. test_rewrite_using_same_timestamp_select_after_expiration reproduces the specific issue hit in scylladb#14182 where a cell is selected after it expires since it has a lexicographically larger value than the other cell with later expiration. test_rewrite_multiple_cells_using_same_timestamp verifies atomicity of inserts of multiple columns, with a TTL. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy · 2023-06-20T07:41:10Z

39aea90: rebased

and add docs/dev/timestamp-conflict-resolution.md to document the details of the conflict resolution algorithm. Refs scylladb#14063 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy · 2023-06-20T08:57:06Z

26ff8f7: fixed warnings in docs/cql/dml.rst regarding empty lines in bullet-points list

scylladb-promoter · 2023-06-20T10:11:04Z

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1939/

DoronArazii · 2023-06-20T10:57:16Z

@avikivity please merge

Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. Fixes scylladb#14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing scylladb#14182 Closes scylladb#14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606)

Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes scylladb#14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing scylladb#14182 \Closes scylladb#14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606)

Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes #14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 \Closes #14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606) Closes #14647

Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes #14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 \Closes #14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606) Closes #14649

Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes #14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 \Closes #14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606) Closes #14651

Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes scylladb#14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing scylladb#14182 \Closes scylladb#14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606)

bhalevy requested review from nyh and tgrabiec as code owners June 8, 2023 09:09

bhalevy force-pushed the atomic_cell-compare-value-last branch from b7ae290 to 9eae75f Compare June 8, 2023 09:22

bhalevy force-pushed the atomic_cell-compare-value-last branch from 9eae75f to 8cd6755 Compare June 8, 2023 11:12

bhalevy force-pushed the atomic_cell-compare-value-last branch from 8cd6755 to fdf5c99 Compare June 8, 2023 11:14

vladzcloudius suggested changes Jun 8, 2023

View reviewed changes

vladzcloudius suggested changes Jun 9, 2023

View reviewed changes

nyh requested changes Jun 11, 2023

View reviewed changes

bhalevy force-pushed the atomic_cell-compare-value-last branch from fdf5c99 to ea9cb20 Compare June 11, 2023 09:38

bhalevy force-pushed the atomic_cell-compare-value-last branch from ea9cb20 to 3214dbf Compare June 11, 2023 11:21

bhalevy requested a review from nyh June 11, 2023 11:24

tgrabiec approved these changes Jun 19, 2023

View reviewed changes

bhalevy added 5 commits June 20, 2023 10:10

atomic_cell: compare_atomic_cell_for_merge: update and add documentation

6717e45

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy force-pushed the atomic_cell-compare-value-last branch from 6dd2e15 to 39aea90 Compare June 20, 2023 07:40

docs: dml: add update ordering section

26ff8f7

and add docs/dev/timestamp-conflict-resolution.md to document the details of the conflict resolution algorithm. Refs scylladb#14063 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy force-pushed the atomic_cell-compare-value-last branch from 39aea90 to 26ff8f7 Compare June 20, 2023 08:56

scylladb-promoter merged commit 87b4606 into scylladb:master Jun 20, 2023
4 checks passed

bhalevy mentioned this pull request Jul 12, 2023

[Backport 5.3] Merge 'atomic_cell: compare value last' from Benny Halevy #14647

Closed

bhalevy mentioned this pull request Jul 12, 2023

[Backport 5.2] atomic cell compare value last #14649

Closed

bhalevy mentioned this pull request Jul 12, 2023

[Backport 5.1] Merge 'atomic_cell: compare value last' from Benny Halevy #14651

Closed

bhalevy mentioned this pull request Jul 12, 2023

[Backport 5.0] Merge 'atomic_cell: compare value last' from Benny Halevy #14659

Closed

		@@ -79,23 +79,20 @@ compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
		return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater;

atomic_cell: compare value last #14183

atomic_cell: compare value last #14183

Conversation

bhalevy commented Jun 8, 2023 • edited

avikivity commented Jun 8, 2023

bhalevy commented Jun 8, 2023

avikivity commented Jun 8, 2023

avikivity commented Jun 8, 2023

nyh commented Jun 8, 2023

avikivity commented Jun 8, 2023

nyh commented Jun 8, 2023

scylladb-promoter commented Jun 8, 2023

bhalevy commented Jun 8, 2023 • edited

scylladb-promoter commented Jun 8, 2023

scylladb-promoter commented Jun 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladzcloudius Jun 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladzcloudius Jun 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladzcloudius left a comment

Choose a reason for hiding this comment

avikivity commented Jun 9, 2023

bhalevy commented Jun 10, 2023

nyh left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhalevy commented Jun 11, 2023

scylladb-promoter commented Jun 11, 2023

bhalevy commented Jun 11, 2023

bhalevy commented Jun 11, 2023

bhalevy commented Jun 11, 2023

bhalevy commented Jun 19, 2023

nyh commented Jun 19, 2023

bhalevy commented Jun 19, 2023

bhalevy commented Jun 19, 2023

bhalevy commented Jun 20, 2023

bhalevy commented Jun 20, 2023

scylladb-promoter commented Jun 20, 2023

DoronArazii commented Jun 20, 2023

bhalevy commented Jun 8, 2023 •

edited

bhalevy commented Jun 8, 2023 •

edited

vladzcloudius Jun 9, 2023 •

edited

vladzcloudius Jun 9, 2023 •

edited

nyh left a comment •

edited