Skip to content

Commit

Permalink
Merge 'atomic_cell: compare value last' from Benny Halevy
Browse files Browse the repository at this point in the history
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
apache/cassandra@e225c88 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

Fixes #14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

Closes #14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability
  • Loading branch information
tgrabiec committed Jun 20, 2023
2 parents 8bfe3ca + 26ff8f7 commit 87b4606
Show file tree
Hide file tree
Showing 6 changed files with 324 additions and 40 deletions.
61 changes: 55 additions & 6 deletions docs/cql/dml.rst
Expand Up @@ -592,7 +592,7 @@ of eventual consistency on an event of a timestamp collision:

``INSERT`` statements happening concurrently at different cluster
nodes proceed without coordination. Eventually cell values
supplied by a statement with the highest timestamp will prevail.
supplied by a statement with the highest timestamp will prevail (see :ref:`update ordering <update-ordering>`).

Unless a timestamp is provided by the client, Scylla will automatically
generate a timestamp with microsecond precision for each
Expand All @@ -601,7 +601,7 @@ by the same node are unique. Timestamps assigned at different
nodes are not guaranteed to be globally unique.
With a steadily high write rate timestamp collision
is not unlikely. If it happens, i.e. two ``INSERTS`` have the same
timestamp, the lexicographically bigger value prevails:
timestamp, a conflict resolution algorithm determines which of the inserted cells prevails (see :ref:`update ordering <update-ordering>`).

Please refer to the :ref:`UPDATE <update-parameters>` section for more information on the :token:`update_parameter`.

Expand Down Expand Up @@ -709,8 +709,8 @@ Similarly to ``INSERT``, ``UPDATE`` statement happening concurrently at differen
cluster nodes proceed without coordination. Cell values
supplied by a statement with the highest timestamp will prevail.
If two ``UPDATE`` statements or ``UPDATE`` and ``INSERT``
statements have the same timestamp,
lexicographically bigger value prevails.
statements have the same timestamp, a conflict resolution algorithm determines which cells prevails
(see :ref:`update ordering <update-ordering>`).

Regarding the :token:`assignment`:

Expand Down Expand Up @@ -749,7 +749,7 @@ parameters:
Scylla ensures that query timestamps created by the same coordinator node are unique (even across different shards
on the same node). However, timestamps assigned at different nodes are not guaranteed to be globally unique.
Note that with a steadily high write rate, timestamp collision is not unlikely. If it happens, e.g. two INSERTS
have the same timestamp, conflicting cell values are compared and the cells with the lexicographically bigger value prevail.
have the same timestamp, a conflict resolution algorithm determines which of the inserted cells prevails (see :ref:`update ordering <update-ordering>` for more information):
- ``TTL``: specifies an optional Time To Live (in seconds) for the inserted values. If set, the inserted values are
automatically removed from the database after the specified time. Note that the TTL concerns the inserted values, not
the columns themselves. This means that any subsequent update of the column will also reset the TTL (to whatever TTL
Expand All @@ -759,6 +759,55 @@ parameters:
- ``TIMEOUT``: specifies a timeout duration for the specific request.
Please refer to the :ref:`SELECT <using-timeout>` section for more information.

.. _update-ordering:

Update ordering
~~~~~~~~~~~~~~~

:ref:`INSERT <insert-statement>`, :ref:`UPDATE <update-statement>`, and :ref:`DELETE <delete_statement>`
operations are ordered by their ``TIMESTAMP``.

Ordering of such changes is done at the cell level, where each cell carries a write ``TIMESTAMP``,
other attributes related to its expiration when it has a non-zero time-to-live (``TTL``),
and the cell value.

The fundamental rule for ordering cells that insert, update, or delete data in a given row and column
is that the cell with the highest timestamp wins.

However, it is possible that multiple such cells will carry the same ``TIMESTAMP``.
There could be several reasons for ``TIMESTAMP`` collision:

* Benign collision can be caused by "replay" of a mutation, e.g., due to client retry, or due to internal processes.
In such cases, the cells are equivalent, and any of them can be selected arbitrarily.
* ``TIMESTAMP`` collisions might be normally caused by parallel queries that are served
by different coordinator nodes. The coordinators might calculate the same write ``TIMESTAMP``
based on their local time in microseconds.
* Collisions might also happen with user-provided timestamps if the application does not guarantee
unique timestamps with the ``USING TIMESTAMP`` parameter (see :ref:`Update parameters <update-parameters>` for more information).

As said above, in the replay case, ordering of cells should not matter, as they carry the same value
and same expiration attributes, so picking any of them will reach the same result.
However, other ``TIMESTAMP`` conflicts must be resolved in a consistent way by all nodes.
Otherwise, if nodes would have picked an arbitrary cell in case of a conflict and they would
reach different results, reading from different replicas would detect the inconsistency and trigger
read-repair that will generate yet another cell that would still conflict with the existing cells,
with no guarantee for convergence.

Therefore, Scylla implements an internal, consistent conflict-resolution algorithm
that orders cells with conflicting ``TIMESTAMP`` values based on other properties, like:

* whether the cell is a tombstone or a live cell,
* whether the cell has an expiration time,
* the cell ``TTL``,
* and finally, what value the cell carries.

The conflict-resolution algorithm is documented in `Scylla's internal documentation <https://github.com/scylladb/scylladb/blob/master/docs/dev/timestamp-conflict-resolution.md>`_
and it may be subject to change.

Reliable serialization can be achieved using unique write ``TIMESTAMP``
and by using :doc:`Lightweight Transactions (LWT) </using-scylla/lwt>` to ensure atomicity of
:ref:`INSERT <insert-statement>`, :ref:`UPDATE <update-statement>`, and :ref:`DELETE <delete_statement>`.

.. _delete_statement:

DELETE
Expand Down Expand Up @@ -798,7 +847,7 @@ For more information on the :token:`update_parameter` refer to the :ref:`UPDATE
In a ``DELETE`` statement, all deletions within the same partition key are applied atomically,
meaning either all columns mentioned in the statement are deleted or none.
If ``DELETE`` statement has the same timestamp as ``INSERT`` or
``UPDATE`` of the same primary key, delete operation prevails.
``UPDATE`` of the same primary key, delete operation prevails (see :ref:`update ordering <update-ordering>`).

A ``DELETE`` operation can be conditional through the use of an ``IF`` clause, similar to ``UPDATE`` and ``INSERT``
statements. Each such ``DELETE`` gets a globally unique timestamp.
Expand Down
37 changes: 37 additions & 0 deletions docs/dev/timestamp-conflict-resolution.md
@@ -0,0 +1,37 @@
# Timestamp conflict resolution

The fundamental rule for ordering cells that insert, update, or delete data in a given row and column
is that the cell with the highest timestamp wins.

However, it is possible that multiple such cells will carry the same `TIMESTAMP`.
In this case, conflicts must be resolved in a consistent way by all nodes.
Otherwise, if nodes would have picked an arbitrary cell in case of a conflict and they would
reach different results, reading from different replicas would detect the inconsistency and trigger
read-repair that will generate yet another cell that would still conflict with the existing cells,
with no guarantee for convergence.

The first tie-breaking rule when two cells have the same write timestamp is that
dead cells win over live cells; and if both cells are deleted, the one with the later deletion time prevails.

If both cells are alive, their expiration time is examined.
Cells that are written with a non-zero TTL (either implicit, as determined by
the table's default TTL, or explicit, `USING TTL`) are due to expire
TTL seconds after the time they were written (as determined by the coordinator,
and rounded to 1 second resolution). That time is the cell's expiration time.
When cells expire, they become tombstones, shadowing any data written with a write timestamp
less than or equal to the timestamp of the expiring cell.
Therefore, cells that have an expiration time win over cells with no expiration time.

If both cells have an expiration time, the one with the latest expiration time wins;
and if they have the same expiration time (in whole second resolution),
their write time is derived from the expiration time less the original time-to-live value
and the one that was written at a later time prevails.

Finally, if both cells are live and have no expiration, or have the same expiration time and time-to-live,
the cell with the lexicographically bigger value prevails.

Note that when multiple columns are INSERTed or UPDATEed using the same timestamp,
SELECTing those columns might return a result that mixes cells from either upsert.
This may happen when both upserts have no expiration time, or both their expiration time and TTL are the
same, respectively (in whole second resolution). In such a case, cell selection would be based on the cell values
in each column, independently of each other.
36 changes: 24 additions & 12 deletions mutation/atomic_cell.cc
Expand Up @@ -66,36 +66,48 @@ atomic_cell::atomic_cell(const abstract_type& type, atomic_cell_view other)
set_view(_data);
}

// Based on:
// - org.apache.cassandra.db.AbstractCell#reconcile()
// - org.apache.cassandra.db.BufferExpiringCell#reconcile()
// - org.apache.cassandra.db.BufferDeletedCell#reconcile()
// Based on Cassandra's resolveRegular function:
// - https://github.com/apache/cassandra/blob/e4f31b73c21b04966269c5ac2d3bd2562e5f6c63/src/java/org/apache/cassandra/db/rows/Cells.java#L79-L119
//
// Note: the ordering algorithm for cell is the same as for rows,
// except that the cell value is used to break a tie in case all other attributes are equal.
// See compare_row_marker_for_merge.
std::strong_ordering
compare_atomic_cell_for_merge(atomic_cell_view left, atomic_cell_view right) {
// Largest write timestamp wins.
if (left.timestamp() != right.timestamp()) {
return left.timestamp() <=> right.timestamp();
}
// Tombstones always win reconciliation with live cells of the same timestamp
if (left.is_live() != right.is_live()) {
return left.is_live() ? std::strong_ordering::less : std::strong_ordering::greater;
}
if (left.is_live()) {
auto c = compare_unsigned(left.value(), right.value()) <=> 0;
if (c != 0) {
return c;
}
// Prefer expiring cells (which will become tombstones at some future date) over live cells.
// See https://issues.apache.org/jira/browse/CASSANDRA-14592
if (left.is_live_and_has_ttl() != right.is_live_and_has_ttl()) {
// prefer expiring cells.
return left.is_live_and_has_ttl() ? std::strong_ordering::greater : std::strong_ordering::less;
}
// If both are expiring, choose the cell with the latest expiry or derived write time.
if (left.is_live_and_has_ttl()) {
// Prefer cell with latest expiry
if (left.expiry() != right.expiry()) {
return left.expiry() <=> right.expiry();
} else {
// prefer the cell that was written later,
// so it survives longer after it expires, until purged.
} else if (right.ttl() != left.ttl()) {
// The cell write time is derived by (expiry - ttl).
// Prefer the cell that was written later,
// so it survives longer after it expires, until purged,
// as it become purgeable gc_grace_seconds after it was written.
//
// Note that this is an extension to Cassandra's algorithm
// which stops at the expiration time, and if equal,
// move forward to compare the cell values.
return right.ttl() <=> left.ttl();
}
}
// The cell with the largest value wins, if all other attributes of the cells are identical.
// This is quite arbitrary, but still required to break the tie in a deterministic way.
return compare_unsigned(left.value(), right.value());
} else {
// Both are deleted

Expand Down
18 changes: 16 additions & 2 deletions mutation/mutation_partition.cc
Expand Up @@ -1108,20 +1108,34 @@ operator<<(std::ostream& os, const mutation_partition::printer& p) {
constexpr gc_clock::duration row_marker::no_ttl;
constexpr gc_clock::duration row_marker::dead;

// Note: the ordering algorithm for rows is the same as for cells,
// except that there is no cell value to break a tie in case all other attributes are equal.
// See compare_atomic_cell_for_merge.
int compare_row_marker_for_merge(const row_marker& left, const row_marker& right) noexcept {
// Largest write timestamp wins.
if (left.timestamp() != right.timestamp()) {
return left.timestamp() > right.timestamp() ? 1 : -1;
}
// Tombstones always win reconciliation with live rows of the same timestamp
if (left.is_live() != right.is_live()) {
return left.is_live() ? -1 : 1;
}
if (left.is_live()) {
// Prefer expiring rows (which will become tombstones at some future date) over live rows.
// See https://issues.apache.org/jira/browse/CASSANDRA-14592
if (left.is_expiring() != right.is_expiring()) {
// prefer expiring cells.
return left.is_expiring() ? 1 : -1;
}
if (left.is_expiring() && left.expiry() != right.expiry()) {
return left.expiry() < right.expiry() ? -1 : 1;
// If both are expiring, choose the cell with the latest expiry or derived write time.
if (left.is_expiring()) {
if (left.expiry() != right.expiry()) {
return left.expiry() < right.expiry() ? -1 : 1;
} else if (left.ttl() != right.ttl()) {
// The cell write time is derived by (expiry - ttl).
// Prefer row that was written later (and has a smaller ttl).
return left.ttl() < right.ttl() ? 1 : -1;
}
}
} else {
// Both are either deleted or missing
Expand Down
52 changes: 33 additions & 19 deletions test/boost/mutation_test.cc
Expand Up @@ -687,14 +687,11 @@ SEASTAR_TEST_CASE(test_cell_ordering) {
auto expiry_2 = now + ttl_2;

auto assert_order = [] (atomic_cell_view first, atomic_cell_view second) {
if (compare_atomic_cell_for_merge(first, second) >= 0) {
testlog.trace("Expected {} < {}", first, second);
abort();
}
if (compare_atomic_cell_for_merge(second, first) <= 0) {
testlog.trace("Expected {} < {}", second, first);
abort();
}
testlog.trace("Expected {} < {}", first, second);
BOOST_REQUIRE(compare_atomic_cell_for_merge(first, second) < 0);

testlog.trace("Expected {} > {}", second, first);
BOOST_REQUIRE(compare_atomic_cell_for_merge(second, first) > 0);
};

auto assert_equal = [] (atomic_cell_view c1, atomic_cell_view c2) {
Expand All @@ -703,68 +700,85 @@ SEASTAR_TEST_CASE(test_cell_ordering) {
BOOST_REQUIRE(compare_atomic_cell_for_merge(c2, c1) == 0);
};

testlog.debug("Live cells with same value are equal");
assert_equal(
atomic_cell::make_live(*bytes_type, 0, bytes("value")),
atomic_cell::make_live(*bytes_type, 0, bytes("value")));

testlog.debug("Non-expiring live cells are ordered before expiring cells");
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes("value")),
atomic_cell::make_live(*bytes_type, 1, bytes("value"), expiry_1, ttl_1));

testlog.debug("Non-expiring live cells are ordered before expiring cells, regardless of their value");
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes("value2")),
atomic_cell::make_live(*bytes_type, 1, bytes("value1"), expiry_1, ttl_1));

testlog.debug("Dead cells with same expiry are equal");
assert_equal(
atomic_cell::make_dead(1, expiry_1),
atomic_cell::make_dead(1, expiry_1));

testlog.debug("Non-expiring live cells are ordered before expiring cells, with empty value");
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes()),
atomic_cell::make_live(*bytes_type, 1, bytes(), expiry_2, ttl_2));

// Origin doesn't compare ttl (is it wise?)
// But we do. See https://github.com/scylladb/scylla/issues/10156
// and https://github.com/scylladb/scylla/issues/10173
testlog.debug("Expiring cells with higher ttl are ordered before expiring cells with smaller ttl and same expiry time");
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes("value"), expiry_1, ttl_2),
atomic_cell::make_live(*bytes_type, 1, bytes("value"), expiry_1, ttl_1));

testlog.debug("Cells are ordered by value if all else is equal");
assert_order(
atomic_cell::make_live(*bytes_type, 0, bytes("value1")),
atomic_cell::make_live(*bytes_type, 0, bytes("value2")));

testlog.debug("Cells are ordered by value in lexicographical order if all else is equal");
assert_order(
atomic_cell::make_live(*bytes_type, 0, bytes("value12")),
atomic_cell::make_live(*bytes_type, 0, bytes("value2")));

// Live cells are ordered first by timestamp...
testlog.debug("Live cells are ordered first by timestamp...");
assert_order(
atomic_cell::make_live(*bytes_type, 0, bytes("value2")),
atomic_cell::make_live(*bytes_type, 1, bytes("value1")));

// ..then by value
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes("value1"), expiry_2, ttl_2),
atomic_cell::make_live(*bytes_type, 1, bytes("value2"), expiry_1, ttl_1));

// ..then by expiry
testlog.debug("...then by expiry");
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes(), expiry_1, ttl_1),
atomic_cell::make_live(*bytes_type, 1, bytes(), expiry_2, ttl_1));

// Dead wins
testlog.debug("...then by ttl (in reverse)");
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes(), expiry_1, ttl_2),
atomic_cell::make_live(*bytes_type, 1, bytes(), expiry_1, ttl_1));

testlog.debug("...then by value");
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes("value1"), expiry_1, ttl_1),
atomic_cell::make_live(*bytes_type, 1, bytes("value2"), expiry_1, ttl_1));

testlog.debug("Dead wins");
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes("value")),
atomic_cell::make_dead(1, expiry_1));

// Dead wins with expiring cell
testlog.debug("Dead wins with expiring cell");
assert_order(
atomic_cell::make_live(*bytes_type, 1, bytes("value"), expiry_2, ttl_2),
atomic_cell::make_dead(1, expiry_1));

// Deleted cells are ordered first by timestamp
testlog.debug("Deleted cells are ordered first by timestamp...");
assert_order(
atomic_cell::make_dead(1, expiry_2),
atomic_cell::make_dead(2, expiry_1));

// ...then by expiry
testlog.debug("...then by expiry");
assert_order(
atomic_cell::make_dead(1, expiry_1),
atomic_cell::make_dead(1, expiry_2));
Expand Down

0 comments on commit 87b4606

Please sign in to comment.