Stream packs in `check --read-data` and during repacking #3484

MichaelEischer · 2021-08-20T21:59:42Z

What does this PR change? What problem does it solve?

check --read-data and prune keep downloaded packs in temporary files which can end up being written to disk. This can cause a large amount of data being written to disk.

The PR converts both operations to stream the pack files directly from the backend without using temporary files. Note that for pack uploads during prune and backup temporary files are still used. The streaming operation is implemented using StreamPack which loads the requested blobs of a pack file and ListPacks in the MasterIndex which returns all blobs contained in a set of pack files, with the result being grouped by pack file.

The repack step of prune used DownloadAndHash to verify the pack and then extract blobs that should be kept. Now, only the relevant part of the pack file is loaded which reduces the amount of data downloaded during prune. Verifying the pack hash previously provided the assurance that the pack contains the list of blobs it was expected to have according the repository index. This assurance is now provided by directly accessing the pack file based on the index. As before each individual blob is checked to match the expected hash before any further processing.

The check read data implementation injects a special Load method into StreamPack to calculate the pack hash, verify individual blobs and extract the pack header for further processing in one pass.

In addition, the filerestorer code is switched to use StreamPack which removes the duplicate blob decryption implementation outside of repository.

ListPacks in the MasterIndex iterates multiple times over the repository index to keep the memory overhead low.

Was the change previously discussed in an issue or on the forum?

Fixes #3375
Related to #3465

Checklist

I have read the contribution guidelines.
I have enabled maintainer edits.
I have added tests for all code changes.
~~[ ] I have added documentation for relevant changes (in the manual).~~
There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
I have run gofmt on the code in all commits.
All commit messages are formatted in the same style as the other commits in the repo.
I'm done! This pull request is ready for review.

greatroar · 2021-08-30T20:17:42Z

internal/checker/checker.go

+			bufRd.Reset(hrd)
+
+			// skip to start of first blob, offset == 0 for correct pack files
+			_, err := bufRd.Discard(int(offset))


Is Discard the only reason for using a bufio.Reader here? Because then this could also be implemented as io.CopyN(ioutil.Discard, hrd, offset).

repository.StreamPack also uses a bufio.Reader. The reader in the hashingLoader gets reused by StreamPack, which is essential to later on be able to extract the pack file header. Without using a bufio.Reader here, we'd end up in a situation where the Reader in StreamPack might have read and cached a part of the pack file header already.

internal/checker/checker.go

greatroar · 2021-08-31T06:10:40Z

internal/repository/repository.go

-		return nil, restic.ID{}, -1, errors.Wrap(err, "TempFile")
+type BackendLoadFn func(ctx context.Context, h restic.Handle, length int, offset int64, fn func(rd io.Reader) error) error
+
+func StreamPack(ctx context.Context, beLoad BackendLoadFn, key *crypto.Key, packID restic.ID, blobs []restic.Blob, handleBlobFn func(blob restic.BlobHandle, buf []byte, err error) error) error {


Why not use goroutines instead of a callback? Like,

type BlobContent struct { restic.BlobHandle Plaintext []byte Err error } func StreamPack(ctx context.Context, beLoad BackendLoadFn, key *crypto.Key, packID restic.ID, blobs []restic.Blob) <-chan *BlobContent

The handleBlobFn can currently return an error to force StreamPack to try loading the blobs again, which won't work with goroutines. My idea here was to allow retrying broken downloads or temporary decryption errors. That functionality would currently only be used in Repack in the first if err != nil. However, I've just noticed that the error handling for SaveBlob in Repack was broken. As that error is one which cannot be retried.

So I guess, I'll remove at least the return value of handleBlobFn. The downside of letting StreamPack return a channel is that we also should make sure that the goroutine is shut down properly, which would probably also require using an ErrGroup as for StreamTrees.

Hmm, looks like I can't get rid of the return value of handleBlobFn. If the filerestorer or repack encounters an error then StreamPack should just abort with an error. I've changed StreamPack to just interpret any error returned by the handleBlobFn as a permanent error, to prevent retries. It would be possible to get something similar by using errgroups, but in a quick test I ended up with a really messy error handling in the filerestorer.

Another complication of returning a channel would be that the plaintext buffer can no longer be reused easily.

The function supports efficiently loading a specified list of blobs from a single pack in a streaming fashion. That is there's no need for temporary files independent of the pack size.

When storing a blob fails, this is a fatal error which must not be retried.

fd0

Very impressive, I like it a lot! I've read through the code and found no issues with it, so in principle we can merge it.

The only thing that bugs me is that StreamPacks() has no dedicated tests. Do you think that's a good idea to add tests for it?

fd0 · 2022-03-21T19:39:51Z

I've pushed a commit adding a few tests, please have a look!

MichaelEischer · 2022-03-22T21:18:11Z

The test looks fine. Thanks for adding it. Do you want to merge the PR or should I?

fd0 · 2022-03-24T20:14:06Z

Let's do the release first, then merge this PR.

MichaelEischer force-pushed the stream-check-repack branch 3 times, most recently from be1826d to a7bede4 Compare August 21, 2021 16:45

greatroar reviewed Aug 30, 2021

View reviewed changes

internal/checker/checker.go Outdated Show resolved Hide resolved

greatroar reviewed Aug 31, 2021

View reviewed changes

MichaelEischer force-pushed the stream-check-repack branch 3 times, most recently from ef2006e to a066c78 Compare September 5, 2021 10:26

MichaelEischer mentioned this pull request Sep 7, 2021

Add backup --file-read-concurrency flag #2750

Merged

8 tasks

MichaelEischer mentioned this pull request Sep 15, 2021

Speed-up copy command #3513

Merged

7 tasks

MichaelEischer mentioned this pull request Dec 29, 2021

Adjust worker goroutines to number of backend connections #3611

Merged

6 tasks

MichaelEischer added 11 commits February 12, 2022 20:18

repository: Implement lisiting blobs per pack file

153e2ba

repository: Add StreamPacks function

c4a2bfc

The function supports efficiently loading a specified list of blobs from a single pack in a streaming fashion. That is there's no need for temporary files independent of the pack size.

repository: stream packs during repacking

f00f690

restorer: convert to use StreamPack

f40abd9

checker: rewrite ReadData to stream packs

f1e58e7

repository: remove unused DownloadAndHash

becebf5

repository: don't crash if blob size is too short

34ebafb

checker: reuse bufio reader

930a00a

checker: cleanup header extraction

4b3dc41

repository: Fix error handling in repack

47554a3

When storing a blob fails, this is a fatal error which must not be retried.

repository: cancel streampack context after error

bba8ba7

MichaelEischer force-pushed the stream-check-repack branch from a066c78 to bba8ba7 Compare February 12, 2022 19:18

restorer: Remove dead code

2752497

MichaelEischer mentioned this pull request Mar 6, 2022

Implement compression support #3666

Merged

14 tasks

fd0 approved these changes Mar 20, 2022

View reviewed changes

Add tests for StreamPack

e682f7c

fd0 force-pushed the stream-check-repack branch from 5966b32 to e682f7c Compare March 21, 2022 20:15

fd0 merged commit 4d5db61 into restic:master Mar 26, 2022

MichaelEischer deleted the stream-check-repack branch March 26, 2022 22:35

MichaelEischer mentioned this pull request Apr 15, 2022

v0.13 gets stuck on repack step (s3 backend) #3710

Closed

MichaelEischer mentioned this pull request Apr 23, 2022

Fix stuck repack step #3717

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream packs in `check --read-data` and during repacking #3484

Stream packs in `check --read-data` and during repacking #3484

MichaelEischer commented Aug 20, 2021

greatroar Aug 30, 2021 •

edited

MichaelEischer Sep 4, 2021

greatroar Aug 31, 2021

MichaelEischer Sep 4, 2021

MichaelEischer Sep 5, 2021

fd0 left a comment

fd0 commented Mar 21, 2022

MichaelEischer commented Mar 22, 2022

fd0 commented Mar 24, 2022

Stream packs in check --read-data and during repacking #3484

Stream packs in check --read-data and during repacking #3484

Conversation

MichaelEischer commented Aug 20, 2021

What does this PR change? What problem does it solve?

Was the change previously discussed in an issue or on the forum?

Checklist

greatroar Aug 30, 2021 • edited

Choose a reason for hiding this comment

MichaelEischer Sep 4, 2021

Choose a reason for hiding this comment

greatroar Aug 31, 2021

Choose a reason for hiding this comment

MichaelEischer Sep 4, 2021

Choose a reason for hiding this comment

MichaelEischer Sep 5, 2021

Choose a reason for hiding this comment

fd0 left a comment

Choose a reason for hiding this comment

fd0 commented Mar 21, 2022

MichaelEischer commented Mar 22, 2022

fd0 commented Mar 24, 2022

Stream packs in `check --read-data` and during repacking #3484

Stream packs in `check --read-data` and during repacking #3484

greatroar Aug 30, 2021 •

edited