Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce common case resource overhead of update operation read-repairs by introducing metadata-only read phase #3703

Closed
vekterli opened this issue Oct 10, 2017 · 4 comments · Fixed by #23873
Assignees
Milestone

Comments

@vekterli
Copy link
Member

tl;dr: let us be able to trade off an extra roundtrip during updates to out-of-sync buckets for reduced network, I/O and CPU usage in the common case.

Today, when an update operation arrives on a distributor it may enter one of two code paths:

  1. If bucket replicas are in sync, the update is sent directly to the replicas for execution directly against the backend content nodes. This is known as the "fast path".
  2. Otherwise, there may be diverging versions of the document. If we send the update directly to the individual replicas we might introduce inconsistencies by applying partial updates to different versions. In this case we perform a read-repair where the document is fetched from all mutually diverging replicas and the update is performed on the distributor against the most recent version. A put operation with the result is sent to all replicas to force convergence to a shared version. This is known as the "safe/slow path".

The slow path, aside from being slower as its name implies, is highly susceptible to false positives. Since the distributor operates on a bucket-level granularity, it's enough for 1 out of 1000 docs in a bucket to be divergent for the entire bucket to be marked out of sync. Updates to the 999 other documents in the bucket will therefore trigger a slow path unnecessarily (but the distributor cannot know this a priori; for all it knows every single document in the bucket is divergent).

Today we unconditionally perform the update operation on the distributor when executing a slow path update. This because we've already expended the effort to read the document from disk, so we might as well use it instead of incurring further IO on the content nodes themselves. This works fine for small documents and/or updates, but breaks down when documents and/or updates are large. The single-threaded execution model of the distributor limits the number of operations it can perform per second, whereas the content nodes can run with arbitrarily large thread pools.

I suggest the following changes:

  1. Fast-path update handling remains unchanged
  2. Current two-phase update scheme is extended to three phases. Initial phase only fetches the document versions instead of the documents themselves. Iff all versions match, trigger fast path update. Otherwise, proceed with original two-phase slow path for replicas that have diverging document versions.

Document versions (timestamps) are kept in-memory in Proton and are therefore "free" to read. We still get an extra distributor<->content node roundtrip, but only in the case where buckets are out of sync.

@vekterli vekterli changed the title Reduce common case resource overhead of update operations read-repairs by introducing metadata-only read phase Reduce common case resource overhead of update operation read-repairs by introducing metadata-only read phase Oct 10, 2017
@bratseth
Copy link
Member

Sounds good to me.

@geirst geirst added this to the soon milestone Oct 11, 2017
@vekterli
Copy link
Member Author

Seems like we should be able to support this without any wire format changes by sending down Gets with a field-set of [none] and only returning the last modified timestamp. Just have to verify that the timestamp is always set if present, regardless of field-set.

@vekterli
Copy link
Member Author

A subset of this proposal is implemented as part of #11319. It does not implement the metadata-only additional phase, but does trigger fast path updates if documents are in sync across replicas after the existing read phase.

@vekterli
Copy link
Member Author

We've very recently identified a race condition regression that was introduced with #11319 that may trigger inconsistent document versions to be created in the following scenario:

  • The update is sent with create: true
  • The replicas of the target bucket are already out of sync when the update arrives on the distributor
  • A concurrent mutation to the target bucket modifies the size of its replica set (i.e. adds a replica) in the time between sending the Get operations and the subsequent Update operations

A log warning of the following form should be emitted if this particular edge case is triggered:

Update operation for 'id:foo::bar' in bucket Bucket(BucketSpace(0x0000000000000001), BucketId(0x4000000000f00baa)) updated documents with different timestamps. This should not happen and may indicate undetected replica divergence. Found ts=0 on node 5, ts=1576000000000001 on node 6

A grep for updated documents with different timestamps in the vespa logs will show if this has happened. If it has, the safest option is to re-feed the given document ID, as it may have replica divergence that Vespa cannot detect and automatically fix.

This regression shall have been fixed with #11561, but note that this fix is not yet pushed as a public release. In the meantime, if this problem is observed, it's possible to disable the code path that triggers the bug by adding a config override to services.xml for the affected content cluster:

<config name="vespa.config.content.core.stor-distributormanager">
<restart_with_fast_update_path_if_all_get_timestamps_are_consistent>false</restart_with_fast_update_path_if_all_get_timestamps_are_consistent>
</config>

This is a live change and does not require a process restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants