Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

atomic_cell: compare value last #14183

Merged

Conversation

bhalevy
Copy link
Member

@bhalevy bhalevy commented Jun 8, 2023

Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 (CASSANDRA-14592),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:

  1. Cassandra compatibility
  2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
  3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

Fixes #14182

Also, this series:

@bhalevy bhalevy requested review from nyh and tgrabiec as code owners June 8, 2023 09:09
@bhalevy bhalevy force-pushed the atomic_cell-compare-value-last branch from b7ae290 to 9eae75f Compare June 8, 2023 09:22
@avikivity
Copy link
Member

When did the change in the algorithm happen (commit)?

@bhalevy
Copy link
Member Author

bhalevy commented Jun 8, 2023

When did the change in the algorithm happen (commit)?

The problem was there early on, when f43836e initially added handling of expirations, already evaluated after comparing the cell value.

@avikivity
Copy link
Member

When did the change in the algorithm happen (commit)?

The problem was there early on, when f43836e initially added handling of expirations, already evaluated after comparing the cell value.

Ok, so backport all the way to 0.10.

@avikivity
Copy link
Member

@tgrabiec please review. I don't have intuition about whether the old or new algorithm are more correct.

@nyh
Copy link
Contributor

nyh commented Jun 8, 2023

A comment I wrote in the issue:
"If we change the order in our code, we may need to do something to avoid the above "oscillation" problem on mixed clusters during upgrade".
Do you think it's fine to not handle this issue?

@avikivity
Copy link
Member

I think it's rare enough that we can allow oscillation during upgrade.

@nyh
Copy link
Contributor

nyh commented Jun 8, 2023

also add a cql pytest to reproducing #14182.

We already have a test file, "test-timestamp.py", for tests of USING TIMESTAMP. Please add your test to that file - and feel free to rename it in your patch (I admit test_using_timestamp.py is a better name, and I don't know why this test ended up the only one having a minus instead of an underscore in its name...)

@scylladb-promoter
Copy link
Contributor

@bhalevy bhalevy force-pushed the atomic_cell-compare-value-last branch from 9eae75f to 8cd6755 Compare June 8, 2023 11:12
@bhalevy
Copy link
Member Author

bhalevy commented Jun 8, 2023

also add a cql pytest to reproducing #14182.

We already have a test file, "test-timestamp.py", for tests of USING TIMESTAMP. Please add your test to that file - and feel free to rename it in your patch (I admit test_using_timestamp.py is a better name, and I don't know why this test ended up the only one having a minus instead of an underscore in its name...)

Done in fdf5c99

@bhalevy bhalevy force-pushed the atomic_cell-compare-value-last branch from 8cd6755 to fdf5c99 Compare June 8, 2023 11:14
@scylladb-promoter
Copy link
Contributor

@scylladb-promoter
Copy link
Contributor

test/boost/mutation_test.cc Show resolved Hide resolved
test/boost/mutation_test.cc Show resolved Hide resolved
test/boost/mutation_test.cc Outdated Show resolved Hide resolved
testlog.debug("Live cells are ordered before expiring cells, regardless of their value");
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes("value2")),
atomic_cell::make_live(*bytes_type, 1, bytes("value1"), expiry_1, ttl_1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test/condition is not obvious. Cells should always be ordered by their write time if their "timestamp" is the same, regardless if they are alive, expired, dead (tombstone) or expiring.

The last written value in this case should always win.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only expiring/dead cells have a write time. Live cells don't have one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It kinda makes sense to me now too: they want to prefer the value that has the longest "expire_at" and the alive cell has an infinite value for this (formally).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to CASSANDRA-14592 they keep expiring cells because they'll become tombstones in the future and then they'll trump live cells. If they kept the non-expiring cell (as it will survive longer), the expiration would be lost. Their description of the issue indicates that this inconsistency depends on WHEN the comparison took place, as if it was made early enough, before the cell expired, then the live cell could have been picked (based on its value), but if the comparison took place later on, after the cell expired, then it would be considered as a tombstone and then it would prevail, so preferring expiring cells over live ones is there to be consistent with the rule the prefers tombstones over live cells.

@@ -79,23 +79,20 @@ compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that giving a live cell preference to not-live cell even if the not-live cell was written after the live cell is not obvious at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no way to determine this. We use timestamps to determine order and we are in a tie. Our ordering broke down, and we just need some order (bonus if it is also intuitive).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, being compatible with C* is a plus.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@denesb Don't we always store a coordinator time together with a "timestamp" (which can be overridden with USING TIMESTAMP)?
Or we only have it if data is TTLed and we have an expire_at?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhalevy is the code with this patch copies the latest C* logic 1-to-1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladzcloudius it doesn't copy the C* logic but it is compatible with it, using a different implementation.
Also, it extend C*'s algorithm as we examine the cell remaining ttl, while C* doesn't so that. They stop processing at the expiry time and if both cells expire at the same time, their value is compared.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead cell wins with a live cell because expiring cells turn into dead cells of the same timestamp, so they are later in the lifecycle of a cell.

test/boost/mutation_test.cc Show resolved Hide resolved
test/boost/mutation_test.cc Show resolved Hide resolved
test/boost/mutation_test.cc Show resolved Hide resolved
test/boost/mutation_test.cc Show resolved Hide resolved
test/boost/mutation_test.cc Show resolved Hide resolved
if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
// prefer expiring cells.
return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less;
}
if (left.is_live_and_has_ttl()) {
if (left.expiry() != right.expiry()) {
return left.expiry() <=> right.expiry();
} else {
} else if (right.ttl() != left.ttl()) {
// prefer the cell that was written later,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem to reflect the new logic: the following comparison is going to order based on the TTL value - not the "expiry" (which would match the old comment).

You probably want to move this comment above line 87 and add a different comment here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment does refer to selection by ttl, although it's not obvious.
Having shorter ttl with same expiry means that the derived write time is later.
With that in mind, the cell becomes purgeable gc_grace_seconds after its write time, hence it will survive for longer (as a tombstone).

Copy link
Contributor

@vladzcloudius vladzcloudius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code makes sense to me logically.
Beyond a few cosmetic changes I requested yesterday and today it looks good to me.

@avikivity
Copy link
Member

The patch fixes the issue, but that's only because they are phrased the same way.

Issue: we should do X
Patch: do X

That's convenient, but it avoids the question of what's the right behavior, what's the user impact, and what should be written in the release notes.

@bhalevy
Copy link
Member Author

bhalevy commented Jun 10, 2023

The patch fixes the issue, but that's only because they are phrased the same way.

Issue: we should do X Patch: do X

That's convenient, but it avoids the question of what's the right behavior, what's the user impact, and what should be written in the release notes.

I edited the issue title and description to describe the user visible impact rather than the technical details.

Copy link
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I understand this patch, I don't like the fact that it's motivation is ambiguous, and it makes it harder to review if the implementation lives up to the motivation. I can think of three related, but different motivations:

  1. @vladzcloudius believes that this this patch fixes a real "error", and your commit message also claims it avoids a "weird corner case". What is this error / corner case, and without explaining what it is, how can I review that the code fixes it and the test reproduces it?
  2. The fact that Cassandra's ordering is different from Scylla's was mentioned several times, and a possible motivation was to make them the same. But did we? The test is marked "scylla_only", suggesting it fails on Cassandra. But why?
  3. Why I try to explain the motivation of this patch to myself, it's not about the TTL at all! Instead, it's about what to do when a user send different writes A and write B with the same timestamp - how can we try (if possible) to make sure either A or B gets written but not a mixture half-A-half-B? When writes have expiration-times, we are lucky: In CQL, when you do a write, all of it will have the same expiration time. So all the cells in A have the same expiration time tA, and all the cells in B have the same expiration tB. If tA != tB (e.g., we have default-TTL and more than a second passed between the two writes), if we base the reconciliation decision on the expiration time, not cell values - we can consistently choose either A and B without mixing them! If we're lucky, we shouldn't waste this luck!

I think the real motivation is, or should be, 3. Stating that this is the motivation isn't just nice for understanding, or for understanding how serious (or not) this issue is, it can also change the way you look at testing this issue. Benny, your test has a lot of sleeps waiting until the the data actually expires. Although it might make sense to have a slow test that exactly reproduces the same scenario that the user saw, I think you can actually reproduce the usefulness of this patch in one second (unfortunately, not less) - without waiting for anything to expire: Do a write A of two int cells with values 0, 1 with TTL 1000, sleep a whole second (unfortunately), then do a write B of the cells with values 1, 0 with TTL 1000. Now read. Before this patch, I believe you'll read a mixture of A and B (I think 0,0)). After this patch, you'll read one of A or B - not a mixture of the two. Also, unless I'm missing something, the same test should pass on Cassandra.

@@ -90,3 +90,56 @@ def test_key_writetime(cql, table1):
cql.execute(f'SELECT writetime(k) FROM {table1}')
with pytest.raises(InvalidRequest, match='PRIMARY KEY part k'):
cql.execute(f'SELECT ttl(k) FROM {table1}')

# Reproducer for https://github.com/scylladb/scylladb/issues/14182
def test_rewrite_using_same_timeout_and_ttl(scylla_only, cql, table1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, what?
I thought the main point of this patch was to behave like Cassandra always did in this specific case. So why a scylla_only test case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure at the time that we're indeed compatible.
Today I'm more confident that we are (with recent enough version of Cassandra).
I'll test that in the next version of this series.

errors.append(f"Expected (k={k}, v={expected}) after immediate rewrite in iteration {i}, but got {res[0]}")

delay = 4
time.sleep(delay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sad you need such a long sleep. To be honest, I don't understand why you need to sleep at all, can't you read the data and its ttl() immediately and see this problem without needing to wait for the TTL?
This also relates to my question about what it is that we are really fixing in this PR, I'll write it as a separate comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to reproduce the actual issue, rather than testing for indirect evidence.
But I'll see if it can be reproduced reliably using short timeouts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes after you understand an issue you understand that although the customer noticed the problem only after the expiration, if they used "SELECT ttl(...)" they would have seen it immediately after the second write, so you don't need to wait for the expiration. But I guess I'm ok with this if a couple of 1-second sleeps are involved, and not a dozen 4-second sleeps.

errors.append(f"Expected (k={k}, v={expected}) after first expiration in iteration {i}, but got {res[0]}")

expire2 = t2 + ttl2
time.sleep(expire2 - time.time() + 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remember that typical cql-pytest tests take time measured in milliseconds, and think if there is no way to do this test without so many multi-second sleeps. I think no sleeps are really needed here (see above) but maybe I'm missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to reproduce the null value issue, and that requires the first insert to expire.

@bhalevy bhalevy force-pushed the atomic_cell-compare-value-last branch from fdf5c99 to ea9cb20 Compare June 11, 2023 09:38
@bhalevy
Copy link
Member Author

bhalevy commented Jun 11, 2023

In v2 (ea9cb20)

  • rebased
  • cleaned up changes to unit test (separating cosmetic changes from functional ones)
  • added and update source documentation for compare_atomic_cell_for_merge
  • Update user-facing doc: docs/cql/dml.rst with a new, Update ordering section

@scylladb-promoter
Copy link
Contributor

@bhalevy
Copy link
Member Author

bhalevy commented Jun 11, 2023

Although I understand this patch, I don't like the fact that it's motivation is ambiguous, and it makes it harder to review if the implementation lives up to the motivation.

Good point. Let's clear that out.

I can think of three related, but different motivations:

  1. @vladzcloudius believes that this this patch fixes a real "error", and your commit message also claims it avoids a "weird corner case". What is this error / corner case, and without explaining what it is, how can I review that the code fixes it and the test reproduces it?

Sure. I'll describe the case more clearly.

  1. The fact that Cassandra's ordering is different from Scylla's was mentioned several times, and a possible motivation was to make them the same. But did we? The test is marked "scylla_only", suggesting it fails on Cassandra. But why?

Today, after reviewing both implementation, I convinced myself the implementation are now compatible and the scylla_only marker can be removed (will do in next version of the series)

  1. Why I try to explain the motivation of this patch to myself, it's not about the TTL at all! Instead, it's about what to do when a user send different writes A and write B with the same timestamp - how can we try (if possible) to make sure either A or B gets written but not a mixture half-A-half-B? When writes have expiration-times, we are lucky: In CQL, when you do a write, all of it will have the same expiration time. So all the cells in A have the same expiration time tA, and all the cells in B have the same expiration tB. If tA != tB (e.g., we have default-TTL and more than a second passed between the two writes), if we base the reconciliation decision on the expiration time, not cell values - we can consistently choose either A and B without mixing them! If we're lucky, we shouldn't waste this luck!

That's a good point and another motivator for this change (that I didn't think about).

I think the real motivation is, or should be, 3. Stating that this is the motivation isn't just nice for understanding, or for understanding how serious (or not) this issue is, it can also change the way you look at testing this issue. Benny, your test has a lot of sleeps waiting until the the data actually expires. Although it might make sense to have a slow test that exactly reproduces the same scenario that the user saw, I think you can actually reproduce the usefulness of this patch in one second (unfortunately, not less) - without waiting for anything to expire: Do a write A of two int cells with values 0, 1 with TTL 1000, sleep a whole second (unfortunately), then do a write B of the cells with values 1, 0 with TTL 1000. Now read. Before this patch, I believe you'll read a mixture of A and B (I think 0,0)). After this patch, you'll read one of A or B - not a mixture of the two. Also, unless I'm missing something, the same test should pass on Cassandra.

I'll try that.

@bhalevy
Copy link
Member Author

bhalevy commented Jun 11, 2023

When did the change in the algorithm happen (commit)?

@avikivity our original implementation was based on a respectively old version of Cassandra.
They changed their algorithm in apache/cassandra@e225c88 (CASSANDRA-14592), but we didn't update our implementation correspondingly.

@bhalevy bhalevy force-pushed the atomic_cell-compare-value-last branch from ea9cb20 to 3214dbf Compare June 11, 2023 11:21
@bhalevy
Copy link
Member Author

bhalevy commented Jun 11, 2023

In v3 (3214dbf):

  • Improved user-facing docs
  • Improved annotations in git commits
  • cql-pytest
    • reduce timeouts
    • removed scylla_only marker
    • added test case for updating multiple cells
  • updated description of this PR to clarify the motivation for the change.

@bhalevy bhalevy requested a review from nyh June 11, 2023 11:24
@bhalevy
Copy link
Member Author

bhalevy commented Jun 19, 2023

  1. This test file was already reamed when we noticed its odd name last week, you don't need to send another patch to do that.

@nyh I don't see such PR nor a patch on the mailing list.
Care to provide a reference so I know when to rebase and on what?

@nyh
Copy link
Contributor

nyh commented Jun 19, 2023

@bhalevy this is so funny, we managed to confuse each other. It was you, who already added a patch in this PR, to rename test-timestamp.py to test_using_timestamp.py. It wasn't another PR - it was this one :-) So you don't need to submit another PR to rename this patch, you already did, this one...

Of course, if you want you can split out this rename from this PR and send it as a separate PR. Maybe that's what you meant you wanted to do, and I just misunderstood you?

@bhalevy
Copy link
Member Author

bhalevy commented Jun 19, 2023

@bhalevy this is so funny, we managed to confuse each other. It was you, who already added a patch in this PR, to rename test-timestamp.py to test_using_timestamp.py. It wasn't another PR - it was this one :-) So you don't need to submit another PR to rename this patch, you already did, this one...

Of course, if you want you can split out this rename from this PR and send it as a separate PR. Maybe that's what you meant you wanted to do, and I just misunderstood you?

Yes, I want to separate it since it's fixing a preexisting issue and would have different backport requirements.

@bhalevy
Copy link
Member Author

bhalevy commented Jun 19, 2023

#14293

Currently, it is hard to tell which of the many sub-cases
fail in this unit test, in case any of them fails.

This change uses logging in debug and trace level
to help with that by reproducing the error
with --logger-log-level testlog=trace
(The cases are deterministic so reproducing should not
be a problem)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was changed in CASSANDRA-14592
for consistency with the preference for dead cells over live cells,
as expiring cells will become tombstones at a future time
and then they'd win over live cells with the same timestamp,
hence they should win also before expiration.

In addition, comparing the cell value before expiration
can lead to unintuitive corner cases where rewriting
a cell using the same timestamp but different TTL
may cause scylla to return the cell with null value
if it expired in the meanwhile.

Also, when multiple columns are written using two upserts
using the same write timestamp but with different expiration,
selecting cells by their value may return a mixed result
where each cell is selected individually from either upsert,
by picking the cells with the largest values for each column,
while using the expiration time to break tie will lead
to a more consistent results where a set of cell from
only one of the upserts will be selected.

Fixes scylladb#14182

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…e expiry is the same

As in compare_atomic_cell_for_merge, we want to consider
the row marker ttl for ordering, in case both are expiring
and have the same expiration time.

This was missed in a57c087
and a085ef7.

With that in mind, add documentation to compare_row_marker_for_merge
and a mutual note to both functions about their
equivalence.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
…imestamp

Add reproducers for scylladb#14182:

test_rewrite_different_values_using_same_timestamp verifies
expiration-based cell reconciliation.

test_rewrite_different_values_using_same_timestamp_and_expiration
is a scylla_only test, verifying that when
two cells with same timestamp and same expiration
are compared, the one with the lesser ttl prevails.

test_rewrite_using_same_timestamp_select_after_expiration
reproduces the specific issue hit in scylladb#14182
where a cell is selected after it expires since
it has a lexicographically larger value than
the other cell with later expiration.

test_rewrite_multiple_cells_using_same_timestamp verifies
atomicity of inserts of multiple columns, with a TTL.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@bhalevy bhalevy force-pushed the atomic_cell-compare-value-last branch from 6dd2e15 to 39aea90 Compare June 20, 2023 07:40
@bhalevy
Copy link
Member Author

bhalevy commented Jun 20, 2023

39aea90: rebased

and add docs/dev/timestamp-conflict-resolution.md
to document the details of the conflict resolution algorithm.

Refs scylladb#14063

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@bhalevy bhalevy force-pushed the atomic_cell-compare-value-last branch from 39aea90 to 26ff8f7 Compare June 20, 2023 08:56
@bhalevy
Copy link
Member Author

bhalevy commented Jun 20, 2023

26ff8f7: fixed warnings in docs/cql/dml.rst regarding empty lines in bullet-points list

@scylladb-promoter
Copy link
Contributor

CI state SUCCESS - https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1939/

@DoronArazii
Copy link

@avikivity please merge

@scylladb-promoter scylladb-promoter merged commit 87b4606 into scylladb:master Jun 20, 2023
4 checks passed
bhalevy pushed a commit to bhalevy/scylla that referenced this pull request Jul 12, 2023
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

Fixes scylladb#14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing scylladb#14182

Closes scylladb#14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606)
bhalevy pushed a commit to bhalevy/scylla that referenced this pull request Jul 12, 2023
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes scylladb#14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing scylladb#14182

\Closes scylladb#14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606)
bhalevy pushed a commit to bhalevy/scylla that referenced this pull request Jul 12, 2023
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes scylladb#14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing scylladb#14182

\Closes scylladb#14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606)
bhalevy pushed a commit to bhalevy/scylla that referenced this pull request Jul 12, 2023
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes scylladb#14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing scylladb#14182

\Closes scylladb#14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606)
denesb pushed a commit that referenced this pull request Jul 12, 2023
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes #14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

\Closes #14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606)

Closes #14647
denesb pushed a commit that referenced this pull request Jul 12, 2023
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes #14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

\Closes #14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606)

Closes #14649
denesb pushed a commit that referenced this pull request Jul 12, 2023
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes #14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

\Closes #14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606)

Closes #14651
bhalevy pushed a commit to bhalevy/scylla that referenced this pull request Jul 12, 2023
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes scylladb#14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing scylladb#14182

\Closes scylladb#14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compare_cells_for_merge may wrongly prefer an expired cell over a live and expiring one
10 participants