Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry loading of corrupted data from backend / cache #4800

Merged
merged 25 commits into from May 18, 2024

Conversation

MichaelEischer
Copy link
Member

@MichaelEischer MichaelEischer commented May 9, 2024

What does this PR change? What problem does it solve?

The PR cleans up how data is loaded from the backend and ensures that all places include a fallback to handle transient errors or a corrupt file in the cache.

The main component is repository.LoadRaw which replaces backend.LoadAll. In addition, almost all direct backend.Load() uses outside the backend/cache code have been replaced with calls to LoadRaw to benefit from its error handling. The only exceptions are LoadBlob, ListPack, checkPack and LoadBlobsFromPack. The first three include custom retry code to handle transient errors and the LoadBlobsFromPack falls back to LoadBlob.

Direct uses of Backend.Load() outside the repository package are discouraged now and will eventually be no longer possible.

The retry strategy from transient errors in general consists of explicitly forgetting the damaged file from the cache and retrying the operation once. The retries no longer happens as part of the retries performed by the RetryBackend but instead are handled separately. Thus, retries triggered by the RetryBackend now exclusively relate to backend / download errors. For example a pack file with corrupt blobs will now be downloaded at most twice.

In conjunction with the just mentioned retry strategy, the cache no longer automatically removes a file if the processing callback failed. In particular, since #4605 the automatic removal only worked in few cases. Now, the cached broken file is passed to cache.Forget(h) explicitly by LoadRaw etc. cache.Forget(h) only deletes a file at most once during the runtime of restic. This ensures that if a file is corrupted at the backend, then it will only be downloaded and forgotten at most once.

The cached file in the cache is now considered authoritative. Requesting a file section that is out of bounds of the cached file, now immediately yields an error. The retry strategy of LoadRaw etc. then forgets broken files. The main benefit of the new behavior is that out of bounds accesses of truncated files now fail immediately if the file is cacheable, instead of yielding endless retries.

The changes are complemented with a few cleanups to reduce code duplication:

  • LoadUnpacked now uses LoadRaw
  • LoadBlob uses the PackedBlobIterator
  • And several minor cleanups, like removing the duplicate blob loading fallback from repair packs.

Was the change previously discussed in an issue or on the forum?

Prerequisite for #4784, see #4627.
Related to #4774.

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • [ ] I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

The helper is only intended for usage by backend implementations.
LoadRaw also includes improved context cancellation handling similar to the
implementation in repository.LoadUnpacked.

The removed cache backend test will be added again later on.
This replaces calling the low-level backend.Load() method.
Both functions were using a similar implementation.
LoadBlobsFromPack already implements the same fallback behavior.
This warning should already have been removed once the feature flag was
dropped.
A file is always cached whole. Thus, any out of bounds access will also
fail when directed at the backend. To handle case in which the cached
file is broken, then caller must call Cache.Forget(h) for the file in
question.
This is inspired by the circuit breaker pattern used for distributed
systems. If too many requests fails, then it is better to immediately
fail new requests for a limited time to give the backend time to
recover.

By only forgetting a file in the cache at most once, we can ensure that
a broken file is only retrieved once again from the backend. If the file
stored there is broken, previously it would be cached and deleted
continuously. Now, it is retrieved only once again, all later requests
just use the cached copy and either succeed or fail immediately.
This ensures that the pack header is actually read completely.
Previously, for a truncated file it was possible to only read a part of
the header, as backend.Load(...) is not guaranteed to return as many
bytes as requested by the length parameter.
Copy link
Member Author

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MichaelEischer MichaelEischer merged commit eb6c653 into restic:master May 18, 2024
13 checks passed
@MichaelEischer MichaelEischer deleted the cleanup-load branch May 18, 2024 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

1 participant