Retry loading of corrupted data from backend / cache #4800

MichaelEischer · 2024-05-09T20:53:38Z

What does this PR change? What problem does it solve?

The PR cleans up how data is loaded from the backend and ensures that all places include a fallback to handle transient errors or a corrupt file in the cache.

The main component is repository.LoadRaw which replaces backend.LoadAll. In addition, almost all direct backend.Load() uses outside the backend/cache code have been replaced with calls to LoadRaw to benefit from its error handling. The only exceptions are LoadBlob, ListPack, checkPack and LoadBlobsFromPack. The first three include custom retry code to handle transient errors and the LoadBlobsFromPack falls back to LoadBlob.

Direct uses of Backend.Load() outside the repository package are discouraged now and will eventually be no longer possible.

The retry strategy from transient errors in general consists of explicitly forgetting the damaged file from the cache and retrying the operation once. The retries no longer happens as part of the retries performed by the RetryBackend but instead are handled separately. Thus, retries triggered by the RetryBackend now exclusively relate to backend / download errors. For example a pack file with corrupt blobs will now be downloaded at most twice.

In conjunction with the just mentioned retry strategy, the cache no longer automatically removes a file if the processing callback failed. In particular, since #4605 the automatic removal only worked in few cases. Now, the cached broken file is passed to cache.Forget(h) explicitly by LoadRaw etc. cache.Forget(h) only deletes a file at most once during the runtime of restic. This ensures that if a file is corrupted at the backend, then it will only be downloaded and forgotten at most once.

The cached file in the cache is now considered authoritative. Requesting a file section that is out of bounds of the cached file, now immediately yields an error. The retry strategy of LoadRaw etc. then forgets broken files. The main benefit of the new behavior is that out of bounds accesses of truncated files now fail immediately if the file is cacheable, instead of yielding endless retries.

The changes are complemented with a few cleanups to reduce code duplication:

LoadUnpacked now uses LoadRaw
LoadBlob uses the PackedBlobIterator
And several minor cleanups, like removing the duplicate blob loading fallback from repair packs.

Was the change previously discussed in an issue or on the forum?

Prerequisite for #4784, see #4627.
Related to #4774.

Checklist

I have read the contribution guidelines.
I have enabled maintainer edits.
I have added tests for all code changes.
~~[ ] I have added documentation for relevant changes (in the manual).~~
There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
I have run gofmt on the code in all commits.
All commit messages are formatted in the same style as the other commits in the repo.
I'm done! This pull request is ready for review.

The helper is only intended for usage by backend implementations.

LoadRaw also includes improved context cancellation handling similar to the implementation in repository.LoadUnpacked. The removed cache backend test will be added again later on.

This replaces calling the low-level backend.Load() method.

Both functions were using a similar implementation.

LoadBlobsFromPack already implements the same fallback behavior.

This warning should already have been removed once the feature flag was dropped.

A file is always cached whole. Thus, any out of bounds access will also fail when directed at the backend. To handle case in which the cached file is broken, then caller must call Cache.Forget(h) for the file in question.

This is inspired by the circuit breaker pattern used for distributed systems. If too many requests fails, then it is better to immediately fail new requests for a limited time to give the backend time to recover. By only forgetting a file in the cache at most once, we can ensure that a broken file is only retrieved once again from the backend. If the file stored there is broken, previously it would be cached and deleted continuously. Now, it is retrieved only once again, all later requests just use the cached copy and either succeed or fail immediately.

This ensures that the pack header is actually read completely. Previously, for a truncated file it was possible to only read a part of the header, as backend.Load(...) is not guaranteed to return as many bytes as requested by the length parameter.

MichaelEischer

LGTM

MichaelEischer mentioned this pull request May 9, 2024

Feature Request: Re-Read on checksum-fail (check --data-read) #4774

Open

MichaelEischer force-pushed the cleanup-load branch from a00785b to 17e0d1a Compare May 9, 2024 21:58

This was referenced May 10, 2024

Repository: Remove Backend() method #4802

Merged

Do not retry permanent backend failures #4803

Merged

MichaelEischer force-pushed the cleanup-load branch from 99bf392 to 6d58863 Compare May 18, 2024 18:09

MichaelEischer added 25 commits May 18, 2024 21:26

backend: move LimitReadCloser to util package

47232bf

The helper is only intended for usage by backend implementations.

repository: move backend.LoadAll to repository.LoadRaw

1d6d365

LoadRaw also includes improved context cancellation handling similar to the implementation in repository.LoadUnpacked. The removed cache backend test will be added again later on.

debug/repair packs/upgrade repo v2: use repository.LoadRaw

779c8d3

This replaces calling the low-level backend.Load() method.

repository: Implement repository.LoadUnpacked using LoadRaw

021fb49

Both functions were using a similar implementation.

repository: remove redundant debug log

6563f1d

repository: unify blob decoding code

503c814

cache: code cleanups

e939035

repository: make reloading broken files explicit

2ace242

repository: retry failed LoadBlob once

e33ce7f

repository: retry failed ListPack once

7017adb

check: fix error message formatting

e401af0

repository: remove redundant blob loading fallback from RepairPacks

433a6aa

LoadBlobsFromPack already implements the same fallback behavior.

repair packs: drop experimental warning

8cce06d

This warning should already have been removed once the feature flag was dropped.

cache: Always use cached file if it exists

97a307d

A file is always cached whole. Thus, any out of bounds access will also fail when directed at the backend. To handle case in which the cached file is broken, then caller must call Cache.Forget(h) for the file in question.

repository: fix caching of tree packs in LoadBlobsFromPack

385cee0

check: verify pack a second time if broken

3ff063e

cache: test forget behavior

5214af8

cache: cleanup debug logs

ac805d6

repository: rework and extend LoadRaw tests

4f45668

repository: test LoadBlob retries

bf16096

repository: test retries of ListPack

987c3b2

check: test checkPack retries

ff0744b

fix compatibility with go 1.19

8f8d872

MichaelEischer force-pushed the cleanup-load branch from 6d58863 to 74d9065 Compare May 18, 2024 19:29

MichaelEischer commented May 18, 2024

View reviewed changes

MichaelEischer merged commit eb6c653 into restic:master May 18, 2024
13 checks passed

MichaelEischer deleted the cleanup-load branch May 18, 2024 19:34

MichaelEischer mentioned this pull request May 18, 2024

add retries for corrupted blobs to changelog #4809

Merged

8 tasks

MichaelEischer mentioned this pull request May 25, 2024

Retry blob loading if backend returns corrupted data #4440

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry loading of corrupted data from backend / cache #4800

Retry loading of corrupted data from backend / cache #4800

MichaelEischer commented May 9, 2024 •

edited

MichaelEischer left a comment

Retry loading of corrupted data from backend / cache #4800

Retry loading of corrupted data from backend / cache #4800

Conversation

MichaelEischer commented May 9, 2024 • edited

What does this PR change? What problem does it solve?

Was the change previously discussed in an issue or on the forum?

Checklist

MichaelEischer left a comment

Choose a reason for hiding this comment

MichaelEischer commented May 9, 2024 •

edited