Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream packs in check --read-data and during repacking #3484

Merged
merged 13 commits into from
Mar 26, 2022

Conversation

MichaelEischer
Copy link
Member

What does this PR change? What problem does it solve?

check --read-data and prune keep downloaded packs in temporary files which can end up being written to disk. This can cause a large amount of data being written to disk.

The PR converts both operations to stream the pack files directly from the backend without using temporary files. Note that for pack uploads during prune and backup temporary files are still used. The streaming operation is implemented using StreamPack which loads the requested blobs of a pack file and ListPacks in the MasterIndex which returns all blobs contained in a set of pack files, with the result being grouped by pack file.

The repack step of prune used DownloadAndHash to verify the pack and then extract blobs that should be kept. Now, only the relevant part of the pack file is loaded which reduces the amount of data downloaded during prune. Verifying the pack hash previously provided the assurance that the pack contains the list of blobs it was expected to have according the repository index. This assurance is now provided by directly accessing the pack file based on the index. As before each individual blob is checked to match the expected hash before any further processing.

The check read data implementation injects a special Load method into StreamPack to calculate the pack hash, verify individual blobs and extract the pack header for further processing in one pass.

In addition, the filerestorer code is switched to use StreamPack which removes the duplicate blob decryption implementation outside of repository.

ListPacks in the MasterIndex iterates multiple times over the repository index to keep the memory overhead low.

Was the change previously discussed in an issue or on the forum?

Fixes #3375
Related to #3465

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • [ ] I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

@MichaelEischer MichaelEischer force-pushed the stream-check-repack branch 3 times, most recently from be1826d to a7bede4 Compare August 21, 2021 16:45
bufRd.Reset(hrd)

// skip to start of first blob, offset == 0 for correct pack files
_, err := bufRd.Discard(int(offset))
Copy link
Contributor

@greatroar greatroar Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Discard the only reason for using a bufio.Reader here? Because then this could also be implemented as io.CopyN(ioutil.Discard, hrd, offset).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repository.StreamPack also uses a bufio.Reader. The reader in the hashingLoader gets reused by StreamPack, which is essential to later on be able to extract the pack file header. Without using a bufio.Reader here, we'd end up in a situation where the Reader in StreamPack might have read and cached a part of the pack file header already.

return nil, restic.ID{}, -1, errors.Wrap(err, "TempFile")
type BackendLoadFn func(ctx context.Context, h restic.Handle, length int, offset int64, fn func(rd io.Reader) error) error

func StreamPack(ctx context.Context, beLoad BackendLoadFn, key *crypto.Key, packID restic.ID, blobs []restic.Blob, handleBlobFn func(blob restic.BlobHandle, buf []byte, err error) error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use goroutines instead of a callback? Like,

type BlobContent struct {
	restic.BlobHandle
	Plaintext []byte
	Err       error
}

func StreamPack(ctx context.Context, beLoad BackendLoadFn, key *crypto.Key, packID restic.ID, blobs []restic.Blob) <-chan *BlobContent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handleBlobFn can currently return an error to force StreamPack to try loading the blobs again, which won't work with goroutines. My idea here was to allow retrying broken downloads or temporary decryption errors. That functionality would currently only be used in Repack in the first if err != nil. However, I've just noticed that the error handling for SaveBlob in Repack was broken. As that error is one which cannot be retried.

So I guess, I'll remove at least the return value of handleBlobFn. The downside of letting StreamPack return a channel is that we also should make sure that the goroutine is shut down properly, which would probably also require using an ErrGroup as for StreamTrees.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looks like I can't get rid of the return value of handleBlobFn. If the filerestorer or repack encounters an error then StreamPack should just abort with an error. I've changed StreamPack to just interpret any error returned by the handleBlobFn as a permanent error, to prevent retries. It would be possible to get something similar by using errgroups, but in a quick test I ended up with a really messy error handling in the filerestorer.

Another complication of returning a channel would be that the plaintext buffer can no longer be reused easily.

@MichaelEischer MichaelEischer mentioned this pull request Mar 6, 2022
14 tasks
Copy link
Member

@fd0 fd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive, I like it a lot! I've read through the code and found no issues with it, so in principle we can merge it.

The only thing that bugs me is that StreamPacks() has no dedicated tests. Do you think that's a good idea to add tests for it?

@fd0
Copy link
Member

fd0 commented Mar 21, 2022

I've pushed a commit adding a few tests, please have a look!

@MichaelEischer
Copy link
Member Author

The test looks fine. Thanks for adding it. Do you want to merge the PR or should I?

@fd0
Copy link
Member

fd0 commented Mar 24, 2022

Let's do the release first, then merge this PR.

@fd0 fd0 merged commit 4d5db61 into restic:master Mar 26, 2022
@MichaelEischer MichaelEischer deleted the stream-check-repack branch March 26, 2022 22:35
@MichaelEischer MichaelEischer mentioned this pull request Apr 23, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants