Strict repository lock handling #3569

MichaelEischer · 2021-11-06T22:25:18Z

What does this PR change? What problem does it solve?

restic operations did not check that their lock is still valid. This can lead to a situation where for example a backup client is paused, then in the meantime unlock is called for the repository, which removes the now stale lock, and some time later the backup continues. If in the meantime prune was called, then this results in a broken snapshot.

This PR changes the locking behavior to actually be strict. If restic is unable to refresh locks in time, then the whole operation will be cancelled. This is done by tying the context used by restic's commands to the lock lifetime. If the monitoring goroutine for the lock detects that the lock file was not refresh in time, then the context will be canceled. Thereby, the command is forcibly terminated.

To keep the implementation simple there are now two goroutines per lock. One which periodically refreshes the lock file and one which monitors that the expiry is done in time. The time limit to refresh the lock file is a few minutes shorter than the duration after which a lock file becomes stale. This is intended to compensate for a small amount of clock drift between clients.

Most code changes revolve around cleaning up the usage of the global context previously available via globalOptions.ctx. Depending on the command either that context was accessed via gopts.ctx or a local copy in ctx. Alternatively, a few commands also introduce an additional context using context.WithCancel. The latter is unnecessary as the global context is canceled when restic shuts down. In total, these different ways to use the context made it non-obvious which contexts have to be tied to the lock lifetime to ensure that the command properly terminates after a failed lock refresh.

This PR solves the "additional behavior changes" proposed in #2715. However, it does not check whether a lock file disappeared or not as the could lead to race conditions with storage backends which do not provide strongly consistent directory listings.

Was the change previously discussed in an issue or on the forum?

Fixes #2715

Checklist

I have read the contribution guidelines.
I have enabled maintainer edits.
I have added tests for all code changes.
I have added documentation for relevant changes (in the manual).
There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
I have run gofmt on the code in all commits.
All commit messages are formatted in the same style as the other commits in the repo.
I'm done! This pull request is ready for review.

MichaelEischer · 2021-11-20T15:56:09Z

I've added an extra commit which lets restic fail if it is unable to read any of the lock files in the repository. The idea is to prevent overlooking concurrent restic processes if a lock file is unreadable for some reason.

XLTechie · 2022-09-08T21:53:32Z

+1! This seems like a critical change.
I have recently started using restic on multiple large servers, backing up to a cloud storage provider on a schedule, and this almost immediately became a concern.
I hope this can be resolved and merged soon!

MichaelEischer · 2022-09-09T20:24:14Z

I have recently started using restic on multiple large servers, backing up to a cloud storage provider on a schedule, and this almost immediately became a concern.

In what regard? For servers, which don't hibernate/use standby, this PR is likely not too relevant.

MichaelEischer · 2022-09-10T17:46:59Z

I've changed the timer expiry checks as apparently timers are stopped during standby: golang/go#35012

XLTechie · 2022-09-11T01:28:07Z

I should have said "virtual machine" rather than "server". In one case (the reason I started using Restic, in fact), the cloud provider has started becoming unstable. (I.E. the company is failing as a concern and its support has become unreachable). VMs can restart just fine, and a few minutes of downtime here and there isn't critical for the functions they serve, but loss of data is. Because of the prepaid contracts we have with them, I am loathed to just dump and move to a more reliable provider such as Linode, until closer to the paid expiration, early next year. But the potential for a backup to be interrupted in that context is very strong. Unlike, say, on my home machine, where I also use it. If that fails out for some reason, I will absolutely know about it and can take manual corrective action if necessary.

fd0

I haven't done an in-depth review of the new lock code, but looked at the design and I like it a lot!

MichaelEischer · 2022-10-02T21:49:37Z

The globalContext is now passed through cobra. (credit goes to @fd0)

The gopts.ctx is cancelled when the main() method of restic exits.

Previously the global context was either accessed via gopts.ctx, stored in a local variable and then used within that function or sometimes both. This makes it very hard to follow which ctx or a wrapped version of it reaches which method. Thus just drop the context from the globalOptions struct and pass it explicitly to every command line handler method.

Restic continued e.g. a backup task even when it failed to renew the lock or failed to do so in time. For example if a backup client enters standby during the backup this can allow other operations like `prune` to run in the meantime (after calling `unlock`). After leaving standby the backup client will continue its backup and upload indexes which refer pack files that were removed in the meantime. This commit introduces a goroutine explicitly monitoring for locks that are not refreshed in time. To simplify the implementation there's now a separate goroutine to refresh the lock and monitor for timeouts for each lock. The monitoring goroutine would now cause the backup to fail as the client has lost it's lock in the meantime. The lock refresh goroutines are bound to the context used to lock the repository initially. The context returned by `lockRepo` is also cancelled when any of the goroutines exits. This ensures that the context is cancelled whenever for any reason the lock is no longer refreshed.

The tests check that the wrapped context is properly canceled whenever the repository is unlock or when the lock refresh fails.

While searching for lock file from concurrently running restic instances, restic ignored unreadable lock files. These can either be in fact invalid or just be temporarily unreadable. As it is not really possible to differentiate between both cases, just err on the side of caution and consider the repository as already locked. The code retries searching for other locks up to three times to smooth out temporarily unreadable lock files.

Monotonic timers are paused during standby. Thus these timers won't fire after waking up. Fall back to periodic polling to detect too large clock jumps. See golang/go#35012 for a discussion of go timers during standby.

virtadpt · 2023-05-08T15:29:50Z

I should have said "virtual machine" rather than "server".

It's not just virtual machines. I started seeing this on my physical servers a couple of days ago.

MichaelEischer · 2023-05-08T18:54:18Z

Please have a look at #4199 or #4262

MichaelEischer force-pushed the strict-locking branch 2 times, most recently from 4becb95 to 7fb3289 Compare November 14, 2021 16:52

MichaelEischer force-pushed the strict-locking branch from 1468ccf to 198e750 Compare February 12, 2022 19:36

MichaelEischer force-pushed the strict-locking branch from 198e750 to 2cfa0f2 Compare April 9, 2022 10:09

MichaelEischer force-pushed the strict-locking branch from 2cfa0f2 to 0c3fb79 Compare April 30, 2022 10:32

MichaelEischer mentioned this pull request May 14, 2022

Stale lock persists on cloud storage #3732

Closed

MichaelEischer force-pushed the strict-locking branch from 0c3fb79 to 8c494c9 Compare June 5, 2022 13:00

MichaelEischer linked an issue Jul 13, 2022 that may be closed by this pull request

prune not running because of missing data #3486

Closed

MichaelEischer force-pushed the strict-locking branch from 8c494c9 to 5c9a88a Compare September 10, 2022 17:14

fd0 approved these changes Sep 27, 2022

View reviewed changes

MichaelEischer force-pushed the strict-locking branch from a37c7f9 to 42e0e7a Compare October 2, 2022 21:24

MichaelEischer added 11 commits October 3, 2022 00:19

Remove unnecessary context.WithCancel calls

d0668b6

The gopts.ctx is cancelled when the main() method of restic exits.

key: Cleanup method signatures

ab819b2

Prepare for context bound to lock lifetime

928914f

lock: Use repository interface instead of struct

c3538b0

lock: Add integration test

9959190

The tests check that the wrapped context is properly canceled whenever the repository is unlock or when the lock refresh fails.

add changelog

aeed420

lock: fix timer expiry monitoring during standby

4912679

Monotonic timers are paused during standby. Thus these timers won't fire after waking up. Fall back to periodic polling to detect too large clock jumps. See golang/go#35012 for a discussion of go timers during standby.

pass global context through cobra

6d2d297

MichaelEischer force-pushed the strict-locking branch from 42e0e7a to 6d2d297 Compare October 2, 2022 22:19

MichaelEischer merged commit a61fbd2 into restic:master Oct 2, 2022

MichaelEischer deleted the strict-locking branch October 2, 2022 22:44

This was referenced Oct 7, 2022

lock: Use the correct duration to check for expired locks #3956

Merged

Prune repack: Failed to refresh lock #3954

Closed

MichaelEischer mentioned this pull request Oct 29, 2022

lock: add test to check that refreshing works #3990

Merged

6 tasks

MichaelEischer mentioned this pull request Jan 13, 2023

0.15.0 regression: Load(<lock/04804cba82>, 0, 0) returned error, retrying after 720.254544ms: load(<lock/04804cba82>): invalid data returned #4143

Closed

MichaelEischer mentioned this pull request Feb 9, 2023

Restic can fail with Fatal: failed to refresh lock in time on slow connections #4199

Closed

MichaelEischer mentioned this pull request Jun 17, 2023

Try to refresh stale locks #4374

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strict repository lock handling #3569

Strict repository lock handling #3569

MichaelEischer commented Nov 6, 2021 •

edited

Loading

MichaelEischer commented Nov 20, 2021

XLTechie commented Sep 8, 2022

MichaelEischer commented Sep 9, 2022

MichaelEischer commented Sep 10, 2022

XLTechie commented Sep 11, 2022 via email

fd0 left a comment

MichaelEischer commented Oct 2, 2022

virtadpt commented May 8, 2023

MichaelEischer commented May 8, 2023

Strict repository lock handling #3569

Strict repository lock handling #3569

Conversation

MichaelEischer commented Nov 6, 2021 • edited Loading

What does this PR change? What problem does it solve?

Was the change previously discussed in an issue or on the forum?

Checklist

MichaelEischer commented Nov 20, 2021

XLTechie commented Sep 8, 2022

MichaelEischer commented Sep 9, 2022

MichaelEischer commented Sep 10, 2022

XLTechie commented Sep 11, 2022 via email

fd0 left a comment

Choose a reason for hiding this comment

MichaelEischer commented Oct 2, 2022

virtadpt commented May 8, 2023

MichaelEischer commented May 8, 2023

MichaelEischer commented Nov 6, 2021 •

edited

Loading