Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve archiver performance for small files #3955

Merged

Conversation

MichaelEischer
Copy link
Member

@MichaelEischer MichaelEischer commented Oct 7, 2022

What does this PR change? What problem does it solve?

As discovered in #3917 (comment) the backup command is slower for small files (up to about 5-6 chunks) than it could be. The main bottleneck appears to be the FileSaver, as increasing the --read-concurrency also drastically improves the backup performance. The FileSaver proceeds in two steps, first it reads the file, splits it into chunks and passes it to the BlobSaver and secondly it waits for all blobs to complete saving (the upload happens asynchronously). Depending on the blob size it can take a few milliseconds to process it. During this time the FileSaver is blocked and not performing any useful work.

This PR changes the second phase of the FileSaver to work asynchronously using callbacks. For this the BlobSaver no longer returns a FutureBlob but instead calls a callback once a blob was saved. This provides the necessary means to asynchronously complete the FutureFile. Similar changes to FutureTree will have to wait at least until #3880 is merged.

The asynchronous file completion is protected using a mutex to avoid synchronization problems.

Asynchronously completing a file has a somewhat unexpected side effect. The status will now often list more than two files at the same time, even though the read concurrency is limited to two. Although this look a bit unexpected, it is actually correct. The status displays which files restic is currently working on, and not which file restic is reading from. However, I wonder whether we should change that to avoid confusing users.

The second commit in this PR changes the status output to clear no longer used status lines. With a frequently changing number of currently processed files, this ended up being confusing.

Performance measurements:
Command: restic backup -n --exclude .git linux using a checkout of Linux 6.0.
CPU: AMD Ryzen 7 PRO 5850U ; Repository stored on NVME disk (warmed up page cache).
Times reported are those printed in the backup command summary.

restic version / read concurrency 2 4 8
master 7112a13 0:10 0:06 0:04
PR 0:04 0:03 0:03

I'm aware that the test dataset is too small to properly differentiate the different variants, but the trends should be obvious :-) .

Was the change previously discussed in an issue or on the forum?

See #3917.

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • [ ] I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

@JsBergbau
Copy link
Contributor

@MichaelEischer thanks for working on this.

I used the same folder as described there #3917

default read concurrency of 2

repository b677436b opened (repository version 2) successfully, password is correct
created new cache in C:\Users\XXX\AppData\Local\restic
no parent snapshot found, will read all files

Files:         556 new,     0 changed,     0 unmodified
Dirs:            4 new,     0 changed,     0 unmodified
Added to the repository: 2.526 GiB (2.504 GiB stored)

processed 556 files, 2.526 GiB in 1:52
snapshot d1c73cdc saved

So much faster than before. I noticed that before the PR restic showed at maximum as many files beeing processed as read concurrency was. Now the number varies a lot and about 4 to 10 files are shown simultaneously.

read concurrency of 6

repository e5410fd6 opened (repository version 2) successfully, password is correct
created new cache in C:\Users\XXX\AppData\Local\restic
no parent snapshot found, will read all files

Files:         556 new,     0 changed,     0 unmodified
Dirs:            4 new,     0 changed,     0 unmodified
Added to the repository: 2.526 GiB (2.504 GiB stored)

processed 556 files, 2.526 GiB in 1:52
snapshot 58d7deda saved

So now using read concurrency doesn't improve speed. However overall speed is decreased a bit compared to previous implementation when increasing the read concurrency.

@MichaelEischer
Copy link
Member Author

Now the number varies a lot and about 4 to 10 files are shown simultaneously.

That is somewhat expected with this PR. However, showing more files than before does not mean that restic is concurrently reading from more files than before.

So now using read concurrency doesn't improve speed. However overall speed is decreased a bit compared to previous implementation when increasing the read concurrency.

I wouldn't consider that minimal performance difference to be too relevant. However, I have to admit that I don't have any real idea why the PR should result in a slightly higher backup time than before.

@JsBergbau
Copy link
Contributor

I've done a test with the previous version again. With read conccurrency of 6 it is now 1:41

I've also tested the new restic version again. With read concurrency of 6 it takes 1:47

The difference for the faster speed is that the VM that was building the PR build was still running. With VM closed I guess Windows is caching the whole folder in RAM.

So this PR is about 6 % slower than the older implementation when using it with a higher read concurrency.

@JsBergbau
Copy link
Contributor

JsBergbau commented Oct 9, 2022

@MichaelEischer

Did some tests again. Previously I used for the test https://beta.restic.net/latest/
This binary is about 30 MB in size. Now I've build restic directly from masterbranch, so without this PR. With read concurrency of 6 restic needs now 1:50. So this PR is even with read concurrency of 6 faster. Restic version restic 0.14.0 (v0.9.5-2107-g1a6160d15) compiled with go1.15.15 on windows/amd64 with file size of 22.7 MB.

Previously used restic version without this PR (all benchmarks before this post) restic 0.14.0-dev (compiled manually) compiled with go1.18 on windows/amd64 and file size of 29.2 MB.

So speed differences resp. faster results were because of newer go version. Sorry for not thinking about this before.

So regarding performance this PR can safely be merged.

@MichaelEischer
Copy link
Member Author

What remains is to decide how to handle the much higher number of files displayed as part of the status message. It will likely confuse users...

@JsBergbau
Copy link
Contributor

As far as I understood with this pull request restic still reads by default 2 files in parallel. So the question is how does this then happen that there are now displayed more files than previously?
Still it is a good question how to handle this. I have some ideas, but lets wait for your answer.

@MichaelEischer
Copy link
Member Author

So the question is how does this then happen that there are now displayed more files than previously?

That is somewhat easy to explain. Let's assume we have a small file which only consists of a single chunk. Then restic reads the file (at this point it shows up on the status bar), hands it over to the blob saver which stores it in the repository, once that is completed the file is reported as complete and removed from the status bar. Only the first step actually involves reading the file from the disk. In particular, for maximum compression it takes quite some time to store the chunk in the repository.

Previously the read concurrency applied to this whole sequence, which means that restic cannot make use of more CPU cores that the read concurrency, and that the next file is only requested from disk when the previous one has completed processing. That way the CPU and disk are both underutilized.

With this PR, the read concurrency only applies to the step which actually reads the file. Thus, reading new files and compressing the old one happens concurrently.

The status bar currently shows all files which are in any of these processing steps. This has the consequence that all files which are currently read or saved to the repository are reported.

Still it is a good question how to handle this. I have some ideas, but lets wait for your answer.

The least confusing way is probably to hide files from the status bar once the file reading step is complete. The main complexity of that approach is that this introduces a new intermediate state in the output. The verbose output must still be able to report the final amount of data stored for a file. This means that we'd have to send an additional signal to the progress output once the file reading part is complete. But I guess that's better than confusing everyone.

@JsBergbau
Copy link
Contributor

I think the problem is that restic users are accustomed that only the files are shown that restic reads currently from disk.
So one option would be to show a message that since restic 0.15 are lot more files are shown like "Note since restic 0.15 restic parallelizes file reading and compression so more files are shown here than before. Files read concurrently from disk: $readConcurrency"

Or the other option is the cosmetic implementation which needs quite some time for no real functional gain.

@MichaelEischer
Copy link
Member Author

MichaelEischer commented Oct 22, 2022

So one option would be to show a message that since restic 0.15 are lot more files are shown like "Note since restic 0.15 restic parallelizes file reading and compression so more files are shown here than before. Files read concurrently from disk: $readConcurrency"

That just feels wrong. It gives the impression that we're too lazy to implement a proper status output.

I've implemented the variant which just hides all files which already have completed reading from disk. It's actually less messy to implement than I've expected.

After reading and chunking all data in a file, the FutureFile still has
to wait until the FutureBlobs are completed. This was done synchronously
which results in blocking the file saver and prevents the next file from
being read.

By replacing the FutureBlob with a callback, it becomes possible to
complete the FutureFile asynchronously.
Previously, the old status text remained until it was overwritten.
As the FileSaver is asynchronously waiting for all blobs of a file to be
stored, the number of active files is higher than the number of files
from which restic is reading concurrently. Thus to not confuse users,
only display files in the status from which restic is currently reading.
@MichaelEischer
Copy link
Member Author

LGTM. I've slightly tweaked a comment and added the isCompleted sanity check also to completeError.

@mamoit
Copy link

mamoit commented Jan 13, 2023

Damn...
Just tried out release 0.15.0 and my home dir hourly backup time was cut in half!
From around ~55s to <30s for <100MB changed in a total of 500k files and 30GB.

Hats off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants