Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strict repository lock handling #3569

Merged
merged 11 commits into from
Oct 2, 2022

Conversation

MichaelEischer
Copy link
Member

@MichaelEischer MichaelEischer commented Nov 6, 2021

What does this PR change? What problem does it solve?

restic operations did not check that their lock is still valid. This can lead to a situation where for example a backup client is paused, then in the meantime unlock is called for the repository, which removes the now stale lock, and some time later the backup continues. If in the meantime prune was called, then this results in a broken snapshot.

This PR changes the locking behavior to actually be strict. If restic is unable to refresh locks in time, then the whole operation will be cancelled. This is done by tying the context used by restic's commands to the lock lifetime. If the monitoring goroutine for the lock detects that the lock file was not refresh in time, then the context will be canceled. Thereby, the command is forcibly terminated.

To keep the implementation simple there are now two goroutines per lock. One which periodically refreshes the lock file and one which monitors that the expiry is done in time. The time limit to refresh the lock file is a few minutes shorter than the duration after which a lock file becomes stale. This is intended to compensate for a small amount of clock drift between clients.

Most code changes revolve around cleaning up the usage of the global context previously available via globalOptions.ctx. Depending on the command either that context was accessed via gopts.ctx or a local copy in ctx. Alternatively, a few commands also introduce an additional context using context.WithCancel. The latter is unnecessary as the global context is canceled when restic shuts down. In total, these different ways to use the context made it non-obvious which contexts have to be tied to the lock lifetime to ensure that the command properly terminates after a failed lock refresh.

This PR solves the "additional behavior changes" proposed in #2715. However, it does not check whether a lock file disappeared or not as the could lead to race conditions with storage backends which do not provide strongly consistent directory listings.

Was the change previously discussed in an issue or on the forum?

Fixes #2715

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

@MichaelEischer MichaelEischer force-pushed the strict-locking branch 2 times, most recently from 4becb95 to 7fb3289 Compare November 14, 2021 16:52
@MichaelEischer
Copy link
Member Author

I've added an extra commit which lets restic fail if it is unable to read any of the lock files in the repository. The idea is to prevent overlooking concurrent restic processes if a lock file is unreadable for some reason.

@XLTechie
Copy link

XLTechie commented Sep 8, 2022

+1! This seems like a critical change.
I have recently started using restic on multiple large servers, backing up to a cloud storage provider on a schedule, and this almost immediately became a concern.
I hope this can be resolved and merged soon!

@MichaelEischer
Copy link
Member Author

I have recently started using restic on multiple large servers, backing up to a cloud storage provider on a schedule, and this almost immediately became a concern.

In what regard? For servers, which don't hibernate/use standby, this PR is likely not too relevant.

@MichaelEischer
Copy link
Member Author

I've changed the timer expiry checks as apparently timers are stopped during standby: golang/go#35012

@XLTechie
Copy link

XLTechie commented Sep 11, 2022 via email

Copy link
Member

@fd0 fd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done an in-depth review of the new lock code, but looked at the design and I like it a lot!

@MichaelEischer
Copy link
Member Author

The globalContext is now passed through cobra. (credit goes to @fd0)

The gopts.ctx is cancelled when the main() method of restic exits.
Previously the global context was either accessed via gopts.ctx,
stored in a local variable and then used within that function or
sometimes both. This makes it very hard to follow which ctx or a wrapped
version of it reaches which method.

Thus just drop the context from the globalOptions struct and pass it
explicitly to every command line handler method.
Restic continued e.g. a backup task even when it failed to renew the
lock or failed to do so in time. For example if a backup client enters
standby during the backup this can allow other operations like `prune`
to run in the meantime (after calling `unlock`). After leaving standby
the backup client will continue its backup and upload indexes which
refer pack files that were removed in the meantime.

This commit introduces a goroutine explicitly monitoring for locks that
are not refreshed in time. To simplify the implementation there's now a
separate goroutine to refresh the lock and monitor for timeouts for each
lock. The monitoring goroutine would now cause the backup to fail as the
client has lost it's lock in the meantime.

The lock refresh goroutines are bound to the context used to lock the
repository initially. The context returned by `lockRepo` is also
cancelled when any of the goroutines exits. This ensures that the
context is cancelled whenever for any reason the lock is no longer
refreshed.
The tests check that the wrapped context is properly canceled whenever
the repository is unlock or when the lock refresh fails.
While searching for lock file from concurrently running restic
instances, restic ignored unreadable lock files. These can either be
in fact invalid or just be temporarily unreadable. As it is not really
possible to differentiate between both cases, just err on the side of
caution and consider the repository as already locked.

The code retries searching for other locks up to three times to smooth
out temporarily unreadable lock files.
Monotonic timers are paused during standby. Thus these timers won't fire
after waking up. Fall back to periodic polling to detect too large clock
jumps. See golang/go#35012 for a discussion of
go timers during standby.
@virtadpt
Copy link

virtadpt commented May 8, 2023

I should have said "virtual machine" rather than "server".

It's not just virtual machines. I started seeing this on my physical servers a couple of days ago.

@MichaelEischer
Copy link
Member Author

Please have a look at #4199 or #4262

@MichaelEischer MichaelEischer mentioned this pull request Jun 17, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prune not running because of missing data Improve error resilience of lock handling
4 participants