Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage for large directories/files #3773

Merged
merged 7 commits into from
Jul 23, 2022

Conversation

MichaelEischer
Copy link
Member

@MichaelEischer MichaelEischer commented May 29, 2022

What does this PR change? What problem does it solve?

restic requires a lot of memory to backup large files and directories. This PR reduces the peak memory usage by about 30-50% (depending on the GC settings). It is not intended as a complete solution to the memory usage problems, as that would require a repository format change as discussed in #2532.

Instead this PR takes the route of reducing the size of structs used during backup. Most notably the FutureNode/File/Tree structs. As the archiver preallocates the FutureNode array for each directory, the optimum would be to reduce the size of each FutureNode to a single reference. As we need a way to wait for the result, this would require a reference to a go channel. However, a channel seems to require about 96 bytes on 64-bit platforms which would regress the memory usage for files that are identical to the parent snapshot. Thus the FutureNode contains an additional reference to the result if it was already available.

The next problem is that to store a Tree in the repository, there are currently up to four items kept in-memory for each directory entry. First there is the FutureNode array, then the Nodes themselves, the serialized JSON of the tree and its encrypted variant while storing these in the repository.

As the FutureNode array is now rather small, it is sufficient to just clear all reference within the FutureNode to allow freeing most memory. The PR introduces a TreeBuilder which incrementally serialized the Tree and thereby ensures that each Node is only kept once in memory: either as Node or its serialized representation.

That way each node is represented in at most two items at a time in memory.

And finally the FutureBlob struct is slimmed down to a single reference to a channel. To reduce the memory usage further, FutureBlobs are collected as soon as possible.

Memory usage measurements:
using go 1.18.2, test data set 500k empty files in a single folder for i in {1..500000} ; do touch abcefghijklt$i ; done
test command: /usr/bin/time -v $restic backup -n ../test/huge

commit resident set size (KB) elapsed GOGC=1 rss (KB) GOGC=1 elapsed
master 1371208 0:18.72 830552 0:27.79
remove fileinfo 1305528 0:17.88 770456 0:29.20
unify node 1056048 0:17.62 589192 0:34.06
incremental tree 887124 0:17.75 403124 0:34.45
optimize large 921864 0:17.72 402248 0:34.85

The table reports the results of the first test run, I've repeated the tests another two times, which result in the same overall trends.
The memory usage decreases by roughly 35% and a bit over 50% using GOGC=1. The measurements without GOGC=1 are somewhat unstable as the GC can run at different points in time, but the general trend is stable across repetitions. The memory usage increase for optimize large appears to be an outlier as the test case uses no FutureBlobs. The elapsed time increases a bit for GOGC=1 which seems to be caused by the different dataset that has to be scanned during GC. Without GOGC the elapsed time seems to be largely unchanged, although it is also too noisy for any further conclusions.

Was the change previously discussed in an issue or on the forum?

Related to #2446, but the specific changes were not discussed.

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • [ ] I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

internal/restic/tree.go Outdated Show resolved Hide resolved
internal/restic/tree.go Outdated Show resolved Hide resolved
internal/restic/tree.go Outdated Show resolved Hide resolved
@greatroar
Copy link
Contributor

Minor nitpick: TreeBuilder isn't the perfect name for the new type as it doesn't produce trees. TreeJSONMarshaler? TreeJSONer? Otherwise, LGTM.

@MichaelEischer
Copy link
Member Author

What about TreeJSONBuilder? I've picked Builder as part of the name in an attempt to convey that it works incrementally.

@greatroar
Copy link
Contributor

Sounds good.

@MichaelEischer
Copy link
Member Author

Rebased to fix conflicts with #3733.

There is no real difference between the FutureTree and FutureFile
structs. However, differentiating both increases the size of the
FutureNode struct.

The FutureNode struct is now only 16 bytes large on 64bit platforms.
That way is has a very low overhead if the corresponding file/directory
was not processed yet.

There is a special case for nodes that were reused from the parent
snapshot, as a go channel seems to have 96 bytes overhead which would
result in a memory usage regression.
That way it is not necessary to keep both the Nodes forming a Tree and
the serialized JSON version in memory.
FutureBlob now uses a Take() method as a more memory-efficient way to
retrieve the futures result. In addition, futures are now collected
while saving the file. As only a limited number of blobs can be queued
for uploading, for a large file nearly all FutureBlobs already have
their result ready, such that the FutureBlob object just consumes
memory.
@MichaelEischer
Copy link
Member Author

I've rebased the PR to fix the conflict with #3830 and added a changelog entry

@MichaelEischer MichaelEischer merged commit 4ffd479 into restic:master Jul 23, 2022
@MichaelEischer MichaelEischer deleted the efficient-dir-json branch July 23, 2022 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants