Skip to content

Commit

Permalink
Delta validation (#130)
Browse files Browse the repository at this point in the history
* Bugfix: validation of delta before applying them.

Due to the nature UDP, the existence of resets and the fact that we are
gossipping to several nodes at the same time, it is possible for our
obsolete deltas to arrive.

This PR adds some validation to detect if the delta is valid, and
whether it will bring us to a better state or not.

It also removes the nodes to reset information, which was actually
taking a large amount of the MTU on large clusters.
(For 20 nodes, around 1KB)

Reset is now just expressed by sending the delta with `from_version = 0`.

Closes #129

Removing hidden contract

We avoid computing tombstone's Instant upon deserialization.
It was hiding a very hidden contract forcing us to deserialize mutation
in the order of their version.

With this change, we defer the computation of the instant to the
call of the apply_delta method. All of the tombstone from a delta
get the exact same `Instant`.

* Apply suggestions from code review

Co-authored-by: Raphaël Marinier <raphaelm@google.com>

---------

Co-authored-by: Raphaël Marinier <raphael.marinier@gmail.com>
Co-authored-by: Raphaël Marinier <raphaelm@google.com>
  • Loading branch information
3 people committed Mar 7, 2024
1 parent a88cbf7 commit 78f8aff
Show file tree
Hide file tree
Showing 9 changed files with 801 additions and 385 deletions.
68 changes: 68 additions & 0 deletions ALGORITHM.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
Chitchat relies on an extension of the Scuttlebutt reconciliation algorithm, to handle:
- Key-Value deletion
- Member leaving the cluster.

For simplification, we will just discuss the replication of a single map on a cluster of n node, on which only one node can write to.

(In scuttlebutt, each node has ownership over such a map.)

There are two kinds of update event on this map:
- Set(K, V)
- Deletion(K)

The writer associates an incremental version to all of the updates.

Let's assume the writer node has done V updates and its map has reached a final
state $\mathcal{S}_f$.
We can use the notation $ ({u_{0}}, ..., {u_i}, ..., {u}_{v=V}) $ to describe the sequence of updates, ordered by version.

At any point in time, each node in the cluster has a view of the state map $\mathcal{S}$. This view is not necessarily up-to-date.

Every second or so, a node $N$ will contact a peers $N'$, requesting for updates. The peer will then send back a subset of updates $(u_i, ..., u_j)$.

By design the algorithm attempts to work with a limited budget. The update is truncated to fit a maximum size called the `mtu`. In `chitchat`, the protocol used is UDP, and the default mtu is of around `65KB`

We call such an update a `delta`.

Upon reception of the updates, the node $N$ may apply this update to update its state.
We can write this operation.

$$\mathcal{S}' = \mathcal{S}~ \top \left(u_i, ..., u_j\right)$$

In order to inform the peer about where the updates should start from, the
node will send two parameters describing how much of the state is replicated:

- `max_version`: One big difference with reliable broadcast, is that this `max version` does NOT mean that the node's state reflects all of the events earlier than
`max_version`. Rather, it says that if the node were to receive all of the events within between `max_version` and $V$ then it will end up to date.
The difference is subtle, but comes from an important optimization of scuttlebutt.

If a node asks for an update above $max_version = m$, and that for a given key $k$ there are two updates after $m$, there is no need to send the first one: it will eventually be overridden by the second one.

- `last_gc_version`: Rather than being deleted right away, the algorithm is using a tombstone mechanism. Key entries are kept by marked as deleted. From the client point of view, everything looks like the key really has been deleted. Eventually, in order
to free memory, we garbage collect all of the key values marked for deletion that are older than a few hours. We then keep track of the maximum `last_gc_version`.

At each round of the protocol, a node maintains a pair $(last\_gc\_version, max\_version)$ which can only increase in lexicographical order.

We also defined the subsequence of missing updates $(u')$ extracted from $(u_i)$ by keeping only the update matching the predicate:

- $i > max\_version$ if $u_i$ is a set operation

- $i > max(max\_version, last\_gc\_version)$ if $u_i$ is a delete operation.

The key of the algorithm is to maintain the following invariant.
$$ \mathcal{S}~\top (u') = \mathcal{S}$$

In plain words, the protocol attempts to make sure that, if any nodes were to apply
- all insert updates above its `max_version`
- all delete updates above its `max(max_version, last_gc_version)`
it would have managed to fully replicate $S_f$.

Knowing the `max_version` and the `last_gc_version` of a peer, a node will emit a delta that, if applied to such valid state, would maintain the invariant above and
increase $(last\_gc\_version, max\_version)$ lexicographically.

Due to the nature of UDP and to the concurrent handshake, deltas may not arrive in order or be duplicated. For this reason, upon reception of a delta, a node must ensure that the delta can be applied to the current state, without breaking the invariant,
and increasing $(last\_gc\_version, max\_version)$.




Loading

0 comments on commit 78f8aff

Please sign in to comment.