Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge 'atomic_cell: compare value last' from Benny Halevy
Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed in apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. Fixes #14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 Closes #14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability
- Loading branch information
Showing
6 changed files
with
324 additions
and
40 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Timestamp conflict resolution | ||
|
||
The fundamental rule for ordering cells that insert, update, or delete data in a given row and column | ||
is that the cell with the highest timestamp wins. | ||
|
||
However, it is possible that multiple such cells will carry the same `TIMESTAMP`. | ||
In this case, conflicts must be resolved in a consistent way by all nodes. | ||
Otherwise, if nodes would have picked an arbitrary cell in case of a conflict and they would | ||
reach different results, reading from different replicas would detect the inconsistency and trigger | ||
read-repair that will generate yet another cell that would still conflict with the existing cells, | ||
with no guarantee for convergence. | ||
|
||
The first tie-breaking rule when two cells have the same write timestamp is that | ||
dead cells win over live cells; and if both cells are deleted, the one with the later deletion time prevails. | ||
|
||
If both cells are alive, their expiration time is examined. | ||
Cells that are written with a non-zero TTL (either implicit, as determined by | ||
the table's default TTL, or explicit, `USING TTL`) are due to expire | ||
TTL seconds after the time they were written (as determined by the coordinator, | ||
and rounded to 1 second resolution). That time is the cell's expiration time. | ||
When cells expire, they become tombstones, shadowing any data written with a write timestamp | ||
less than or equal to the timestamp of the expiring cell. | ||
Therefore, cells that have an expiration time win over cells with no expiration time. | ||
|
||
If both cells have an expiration time, the one with the latest expiration time wins; | ||
and if they have the same expiration time (in whole second resolution), | ||
their write time is derived from the expiration time less the original time-to-live value | ||
and the one that was written at a later time prevails. | ||
|
||
Finally, if both cells are live and have no expiration, or have the same expiration time and time-to-live, | ||
the cell with the lexicographically bigger value prevails. | ||
|
||
Note that when multiple columns are INSERTed or UPDATEed using the same timestamp, | ||
SELECTing those columns might return a result that mixes cells from either upsert. | ||
This may happen when both upserts have no expiration time, or both their expiration time and TTL are the | ||
same, respectively (in whole second resolution). In such a case, cell selection would be based on the cell values | ||
in each column, independently of each other. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.