Reduce common case resource overhead of update operation read-repairs by introducing metadata-only read phase #3703

vekterli · 2017-10-10T14:18:49Z

tl;dr: let us be able to trade off an extra roundtrip during updates to out-of-sync buckets for reduced network, I/O and CPU usage in the common case.

Today, when an update operation arrives on a distributor it may enter one of two code paths:

If bucket replicas are in sync, the update is sent directly to the replicas for execution directly against the backend content nodes. This is known as the "fast path".
Otherwise, there may be diverging versions of the document. If we send the update directly to the individual replicas we might introduce inconsistencies by applying partial updates to different versions. In this case we perform a read-repair where the document is fetched from all mutually diverging replicas and the update is performed on the distributor against the most recent version. A put operation with the result is sent to all replicas to force convergence to a shared version. This is known as the "safe/slow path".

The slow path, aside from being slower as its name implies, is highly susceptible to false positives. Since the distributor operates on a bucket-level granularity, it's enough for 1 out of 1000 docs in a bucket to be divergent for the entire bucket to be marked out of sync. Updates to the 999 other documents in the bucket will therefore trigger a slow path unnecessarily (but the distributor cannot know this a priori; for all it knows every single document in the bucket is divergent).

Today we unconditionally perform the update operation on the distributor when executing a slow path update. This because we've already expended the effort to read the document from disk, so we might as well use it instead of incurring further IO on the content nodes themselves. This works fine for small documents and/or updates, but breaks down when documents and/or updates are large. The single-threaded execution model of the distributor limits the number of operations it can perform per second, whereas the content nodes can run with arbitrarily large thread pools.

I suggest the following changes:

Fast-path update handling remains unchanged
Current two-phase update scheme is extended to three phases. Initial phase only fetches the document versions instead of the documents themselves. Iff all versions match, trigger fast path update. Otherwise, proceed with original two-phase slow path for replicas that have diverging document versions.

Document versions (timestamps) are kept in-memory in Proton and are therefore "free" to read. We still get an extra distributor<->content node roundtrip, but only in the case where buckets are out of sync.

bratseth · 2017-10-11T07:07:48Z

Sounds good to me.

vekterli · 2018-06-20T09:22:21Z

Seems like we should be able to support this without any wire format changes by sending down Gets with a field-set of [none] and only returning the last modified timestamp. Just have to verify that the timestamp is always set if present, regardless of field-set.

vekterli · 2019-11-15T12:45:28Z

A subset of this proposal is implemented as part of #11319. It does not implement the metadata-only additional phase, but does trigger fast path updates if documents are in sync across replicas after the existing read phase.

vekterli · 2019-12-17T12:22:05Z

We've very recently identified a race condition regression that was introduced with #11319 that may trigger inconsistent document versions to be created in the following scenario:

The update is sent with create: true
The replicas of the target bucket are already out of sync when the update arrives on the distributor
A concurrent mutation to the target bucket modifies the size of its replica set (i.e. adds a replica) in the time between sending the Get operations and the subsequent Update operations

A log warning of the following form should be emitted if this particular edge case is triggered:

Update operation for 'id:foo::bar' in bucket Bucket(BucketSpace(0x0000000000000001), BucketId(0x4000000000f00baa)) updated documents with different timestamps. This should not happen and may indicate undetected replica divergence. Found ts=0 on node 5, ts=1576000000000001 on node 6

A grep for updated documents with different timestamps in the vespa logs will show if this has happened. If it has, the safest option is to re-feed the given document ID, as it may have replica divergence that Vespa cannot detect and automatically fix.

This regression shall have been fixed with #11561, but note that this fix is not yet pushed as a public release. In the meantime, if this problem is observed, it's possible to disable the code path that triggers the bug by adding a config override to services.xml for the affected content cluster:

<config name="vespa.config.content.core.stor-distributormanager">
<restart_with_fast_update_path_if_all_get_timestamps_are_consistent>false</restart_with_fast_update_path_if_all_get_timestamps_are_consistent>
</config>

This is a live change and does not require a process restart.

vekterli added the enhancement label Oct 10, 2017

vekterli changed the title ~~Reduce common case resource overhead of update operations read-repairs by introducing metadata-only read phase~~ Reduce common case resource overhead of update operation read-repairs by introducing metadata-only read phase Oct 10, 2017

geirst added this to the soon milestone Oct 11, 2017

johans1 assigned vekterli Jun 20, 2018

vekterli mentioned this issue Jan 7, 2020

Regression introduced in Vespa 7.141 may cause data loss or inconsistencies when using 'create: true' updates #11686

Closed

vekterli mentioned this issue Mar 16, 2020

Add initial metadata-only phase to inconsistent update handling #12586

Merged

vekterli mentioned this issue Aug 31, 2022

Use three-phase updates by default #23873

Merged

vekterli closed this as completed in #23873 Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce common case resource overhead of update operation read-repairs by introducing metadata-only read phase #3703

Reduce common case resource overhead of update operation read-repairs by introducing metadata-only read phase #3703

vekterli commented Oct 10, 2017

bratseth commented Oct 11, 2017

vekterli commented Jun 20, 2018

vekterli commented Nov 15, 2019

vekterli commented Dec 17, 2019

Reduce common case resource overhead of update operation read-repairs by introducing metadata-only read phase #3703

Reduce common case resource overhead of update operation read-repairs by introducing metadata-only read phase #3703

Comments

vekterli commented Oct 10, 2017

bratseth commented Oct 11, 2017

vekterli commented Jun 20, 2018

vekterli commented Nov 15, 2019

vekterli commented Dec 17, 2019