-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
atomic_cell: compare value last #14183
atomic_cell: compare value last #14183
Conversation
b7ae290
to
9eae75f
Compare
When did the change in the algorithm happen (commit)? |
The problem was there early on, when f43836e initially added handling of expirations, already evaluated after comparing the cell value. |
Ok, so backport all the way to 0.10. |
@tgrabiec please review. I don't have intuition about whether the old or new algorithm are more correct. |
A comment I wrote in the issue: |
I think it's rare enough that we can allow oscillation during upgrade. |
We already have a test file, "test-timestamp.py", for tests of USING TIMESTAMP. Please add your test to that file - and feel free to rename it in your patch (I admit test_using_timestamp.py is a better name, and I don't know why this test ended up the only one having a minus instead of an underscore in its name...) |
CI state |
9eae75f
to
8cd6755
Compare
Done in fdf5c99 |
8cd6755
to
fdf5c99
Compare
CI state |
CI state |
testlog.debug("Live cells are ordered before expiring cells, regardless of their value"); | ||
assert_order( | ||
atomic_cell::make_live(*bytes_type, 1, bytes("value2")), | ||
atomic_cell::make_live(*bytes_type, 1, bytes("value1"), expiry_1, ttl_1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test/condition is not obvious. Cells should always be ordered by their write time if their "timestamp" is the same, regardless if they are alive, expired, dead (tombstone) or expiring.
The last written value in this case should always win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only expiring/dead cells have a write time. Live cells don't have one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It kinda makes sense to me now too: they want to prefer the value that has the longest "expire_at" and the alive cell has an infinite value for this (formally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to CASSANDRA-14592 they keep expiring cells because they'll become tombstones in the future and then they'll trump live cells. If they kept the non-expiring cell (as it will survive longer), the expiration would be lost. Their description of the issue indicates that this inconsistency depends on WHEN the comparison took place, as if it was made early enough, before the cell expired, then the live cell could have been picked (based on its value), but if the comparison took place later on, after the cell expired, then it would be considered as a tombstone and then it would prevail, so preferring expiring cells over live ones is there to be consistent with the rule the prefers tombstones over live cells.
@@ -79,23 +79,20 @@ compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) { | |||
return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that giving a live cell preference to not-live cell even if the not-live cell was written after the live cell is not obvious at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no way to determine this. We use timestamps to determine order and we are in a tie. Our ordering broke down, and we just need some order (bonus if it is also intuitive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, being compatible with C* is a plus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@denesb Don't we always store a coordinator time together with a "timestamp" (which can be overridden with USING TIMESTAMP
)?
Or we only have it if data is TTLed and we have an expire_at
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhalevy is the code with this patch copies the latest C* logic 1-to-1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladzcloudius it doesn't copy the C* logic but it is compatible with it, using a different implementation.
Also, it extend C*'s algorithm as we examine the cell remaining ttl, while C* doesn't so that. They stop processing at the expiry time and if both cells expire at the same time, their value is compared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead cell wins with a live cell because expiring cells turn into dead cells of the same timestamp, so they are later in the lifecycle of a cell.
mutation/atomic_cell.cc
Outdated
if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) { | ||
// prefer expiring cells. | ||
return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less; | ||
} | ||
if (left.is_live_and_has_ttl()) { | ||
if (left.expiry() != right.expiry()) { | ||
return left.expiry() <=> right.expiry(); | ||
} else { | ||
} else if (right.ttl() != left.ttl()) { | ||
// prefer the cell that was written later, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't seem to reflect the new logic: the following comparison is going to order based on the TTL value - not the "expiry" (which would match the old comment).
You probably want to move this comment above line 87 and add a different comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment does refer to selection by ttl, although it's not obvious.
Having shorter ttl with same expiry means that the derived write time is later.
With that in mind, the cell becomes purgeable gc_grace_seconds after its write time, hence it will survive for longer (as a tombstone).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code makes sense to me logically.
Beyond a few cosmetic changes I requested yesterday and today it looks good to me.
The patch fixes the issue, but that's only because they are phrased the same way. Issue: we should do X That's convenient, but it avoids the question of what's the right behavior, what's the user impact, and what should be written in the release notes. |
I edited the issue title and description to describe the user visible impact rather than the technical details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I understand this patch, I don't like the fact that it's motivation is ambiguous, and it makes it harder to review if the implementation lives up to the motivation. I can think of three related, but different motivations:
- @vladzcloudius believes that this this patch fixes a real "error", and your commit message also claims it avoids a "weird corner case". What is this error / corner case, and without explaining what it is, how can I review that the code fixes it and the test reproduces it?
- The fact that Cassandra's ordering is different from Scylla's was mentioned several times, and a possible motivation was to make them the same. But did we? The test is marked "scylla_only", suggesting it fails on Cassandra. But why?
- Why I try to explain the motivation of this patch to myself, it's not about the TTL at all! Instead, it's about what to do when a user send different writes A and write B with the same timestamp - how can we try (if possible) to make sure either A or B gets written but not a mixture half-A-half-B? When writes have expiration-times, we are lucky: In CQL, when you do a write, all of it will have the same expiration time. So all the cells in A have the same expiration time tA, and all the cells in B have the same expiration tB. If tA != tB (e.g., we have default-TTL and more than a second passed between the two writes), if we base the reconciliation decision on the expiration time, not cell values - we can consistently choose either A and B without mixing them! If we're lucky, we shouldn't waste this luck!
I think the real motivation is, or should be, 3. Stating that this is the motivation isn't just nice for understanding, or for understanding how serious (or not) this issue is, it can also change the way you look at testing this issue. Benny, your test has a lot of sleeps waiting until the the data actually expires. Although it might make sense to have a slow test that exactly reproduces the same scenario that the user saw, I think you can actually reproduce the usefulness of this patch in one second (unfortunately, not less) - without waiting for anything to expire: Do a write A of two int cells with values 0, 1 with TTL 1000, sleep a whole second (unfortunately), then do a write B of the cells with values 1, 0 with TTL 1000. Now read. Before this patch, I believe you'll read a mixture of A and B (I think 0,0)). After this patch, you'll read one of A or B - not a mixture of the two. Also, unless I'm missing something, the same test should pass on Cassandra.
@@ -90,3 +90,56 @@ def test_key_writetime(cql, table1): | |||
cql.execute(f'SELECT writetime(k) FROM {table1}') | |||
with pytest.raises(InvalidRequest, match='PRIMARY KEY part k'): | |||
cql.execute(f'SELECT ttl(k) FROM {table1}') | |||
|
|||
# Reproducer for https://github.com/scylladb/scylladb/issues/14182 | |||
def test_rewrite_using_same_timeout_and_ttl(scylla_only, cql, table1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, what?
I thought the main point of this patch was to behave like Cassandra always did in this specific case. So why a scylla_only test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure at the time that we're indeed compatible.
Today I'm more confident that we are (with recent enough version of Cassandra).
I'll test that in the next version of this series.
errors.append(f"Expected (k={k}, v={expected}) after immediate rewrite in iteration {i}, but got {res[0]}") | ||
|
||
delay = 4 | ||
time.sleep(delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sad you need such a long sleep. To be honest, I don't understand why you need to sleep at all, can't you read the data and its ttl() immediately and see this problem without needing to wait for the TTL?
This also relates to my question about what it is that we are really fixing in this PR, I'll write it as a separate comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to reproduce the actual issue, rather than testing for indirect evidence.
But I'll see if it can be reproduced reliably using short timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes after you understand an issue you understand that although the customer noticed the problem only after the expiration, if they used "SELECT ttl(...)" they would have seen it immediately after the second write, so you don't need to wait for the expiration. But I guess I'm ok with this if a couple of 1-second sleeps are involved, and not a dozen 4-second sleeps.
errors.append(f"Expected (k={k}, v={expected}) after first expiration in iteration {i}, but got {res[0]}") | ||
|
||
expire2 = t2 + ttl2 | ||
time.sleep(expire2 - time.time() + 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remember that typical cql-pytest tests take time measured in milliseconds, and think if there is no way to do this test without so many multi-second sleeps. I think no sleeps are really needed here (see above) but maybe I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to reproduce the null value issue, and that requires the first insert to expire.
fdf5c99
to
ea9cb20
Compare
In v2 (ea9cb20)
|
CI state |
Good point. Let's clear that out.
Sure. I'll describe the case more clearly.
Today, after reviewing both implementation, I convinced myself the implementation are now compatible and the scylla_only marker can be removed (will do in next version of the series)
That's a good point and another motivator for this change (that I didn't think about).
I'll try that. |
@avikivity our original implementation was based on a respectively old version of Cassandra. |
ea9cb20
to
3214dbf
Compare
In v3 (3214dbf):
|
@nyh I don't see such PR nor a patch on the mailing list. |
@bhalevy this is so funny, we managed to confuse each other. It was you, who already added a patch in this PR, to rename test-timestamp.py to test_using_timestamp.py. It wasn't another PR - it was this one :-) So you don't need to submit another PR to rename this patch, you already did, this one... Of course, if you want you can split out this rename from this PR and send it as a separate PR. Maybe that's what you meant you wanted to do, and I just misunderstood you? |
Yes, I want to separate it since it's fixing a preexisting issue and would have different backport requirements. |
Currently, it is hard to tell which of the many sub-cases fail in this unit test, in case any of them fails. This change uses logging in debug and trace level to help with that by reproducing the error with --logger-log-level testlog=trace (The cases are deterministic so reproducing should not be a problem) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was changed in CASSANDRA-14592 for consistency with the preference for dead cells over live cells, as expiring cells will become tombstones at a future time and then they'd win over live cells with the same timestamp, hence they should win also before expiration. In addition, comparing the cell value before expiration can lead to unintuitive corner cases where rewriting a cell using the same timestamp but different TTL may cause scylla to return the cell with null value if it expired in the meanwhile. Also, when multiple columns are written using two upserts using the same write timestamp but with different expiration, selecting cells by their value may return a mixed result where each cell is selected individually from either upsert, by picking the cells with the largest values for each column, while using the expiration time to break tie will lead to a more consistent results where a set of cell from only one of the upserts will be selected. Fixes scylladb#14182 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…e expiry is the same As in compare_atomic_cell_for_merge, we want to consider the row marker ttl for ordering, in case both are expiring and have the same expiration time. This was missed in a57c087 and a085ef7. With that in mind, add documentation to compare_row_marker_for_merge and a mutual note to both functions about their equivalence. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…imestamp Add reproducers for scylladb#14182: test_rewrite_different_values_using_same_timestamp verifies expiration-based cell reconciliation. test_rewrite_different_values_using_same_timestamp_and_expiration is a scylla_only test, verifying that when two cells with same timestamp and same expiration are compared, the one with the lesser ttl prevails. test_rewrite_using_same_timestamp_select_after_expiration reproduces the specific issue hit in scylladb#14182 where a cell is selected after it expires since it has a lexicographically larger value than the other cell with later expiration. test_rewrite_multiple_cells_using_same_timestamp verifies atomicity of inserts of multiple columns, with a TTL. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
6dd2e15
to
39aea90
Compare
39aea90: rebased |
and add docs/dev/timestamp-conflict-resolution.md to document the details of the conflict resolution algorithm. Refs scylladb#14063 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
39aea90
to
26ff8f7
Compare
26ff8f7: fixed warnings in docs/cql/dml.rst regarding empty lines in bullet-points list |
CI state |
@avikivity please merge |
Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. Fixes scylladb#14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing scylladb#14182 Closes scylladb#14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606)
Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes scylladb#14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing scylladb#14182 \Closes scylladb#14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606)
Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes scylladb#14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing scylladb#14182 \Closes scylladb#14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606)
Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes scylladb#14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing scylladb#14182 \Closes scylladb#14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606)
Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes #14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 \Closes #14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606) Closes #14647
Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes #14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 \Closes #14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606) Closes #14649
Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes #14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 \Closes #14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606) Closes #14651
Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes scylladb#14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing scylladb#14182 \Closes scylladb#14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit 87b4606)
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.
This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 (CASSANDRA-14592),
where the cell expiration is considered before the cell value.
To summarize, the motivation for this change is three fold:
Fixes #14182
Also, this series: