Avoid fast past update restart race with concurrently created replica #11561

vekterli · 2019-12-13T10:14:13Z

@geirst please review. This fix was hammered out in a cramped seat at 35,000 feet (occasionally interrupted by a high fidelity surround sound setup of screaming infants) so I will likely polish these changes a bit later...!

After the recent change to allow safe path updates to be restarted
as fast path updates iff all observed document timestamps are equal,
a race condition regression was introduced. If the bucket that the
update operation was scheduled towards got a new replica concurrently
created between the time that safe path Gets were sent and received,
it was possible for updates to be sent to inconsistent replicas. This
is because the Get and Update operations use the current database
state at their start time, not a stable snapshot state from the start
time of the two phase update operation itself.

Add an explicit check that the replica state between sending Gets and
Updates is unchanged. If it has changed, a fast path restart is not
permitted.

After the recent change to allow safe path updates to be restarted as fast path updates iff all observed document timestamps are equal, a race condition regression was introduced. If the bucket that the update operation was scheduled towards got a new replica concurrently created _between_ the time that safe path Gets were sent and received, it was possible for updates to be sent to inconsistent replicas. This is because the Get and Update operations use the current database state at _their_ start time, not a stable snapshot state from the start time of the two phase update operation itself. Add an explicit check that the replica state between sending Gets and Updates is unchanged. If it has changed, a fast path restart is _not_ permitted.

geirst

👍

Even with the fix in #11561 we are still observing replica divergence warnings in the logs. Disabling this feature entirely until the issue has been fully investigated and a complete fix has been implemented. Also emit a log message when the distributor has forced convergence of a detected inconsistent update.

vekterli requested a review from geirst December 13, 2019 10:14

vekterli assigned geirst Dec 13, 2019

aressem changed the title ~~Avoid fast past update restart race with concurrently created replica~~ Avoid fast past update restart race with concurrently created replica Dec 16, 2019

geirst approved these changes Dec 16, 2019

View reviewed changes

geirst assigned vekterli and unassigned geirst Dec 16, 2019

vekterli changed the title ~~Avoid fast past update restart race with concurrently created replica~~ Avoid fast past update restart race with concurrently created replica Dec 16, 2019

vekterli merged commit 6f5128b into master Dec 16, 2019

vekterli deleted the vekterli/avoid-fast-path-update-race-with-concurrent-replica-creation branch December 16, 2019 15:39

vekterli mentioned this pull request Dec 17, 2019

Reduce common case resource overhead of update operation read-repairs by introducing metadata-only read phase #3703

Closed

This was referenced Dec 20, 2019

Disable fast update path restarts by default #11612

Merged

Ensure missing documents on replicas are not erroneously considered consistent #11613

Merged

vekterli mentioned this pull request Jan 7, 2020

Regression introduced in Vespa 7.141 may cause data loss or inconsistencies when using 'create: true' updates #11686

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid fast past update restart race with concurrently created replica #11561

Avoid fast past update restart race with concurrently created replica #11561

vekterli commented Dec 13, 2019

geirst left a comment

Avoid fast past update restart race with concurrently created replica #11561

Avoid fast past update restart race with concurrently created replica #11561

Conversation

vekterli commented Dec 13, 2019

geirst left a comment

Choose a reason for hiding this comment