Reduce index memory #2781

aawsome · 2020-06-12T08:29:24Z

What is the purpose of this change? What does it change?

Reduce the index memory by only applying small local code changes.

The discussion in #2523 seems to be stalled and I realize that core developers actually do not have time to discuss that many changes to improve index memory.

So I cherry-picked the ideas which will give the best improvement while needing the smallest code modifications.

Actually restic stores index entries in a map where the key length is 33 bytes and the entry length is 24 bytes (for the slice) + needs 48 bytes per really saved index entry.
Assuming a map load factor of 2/3 this gives 134 bytes per index entry.

This PR has three improvements:

save offset and length as uint32 instead of uint. This saves 8 bytes per index entry on 64bit architectures. By this change the pack size is effectively limited to 4GB and restic will now panic if used with larger packfiles.
save pack ID in a separate array and only save the index of this array in indexEntry. Assuming at least 8 blobs per pack in average this saves at least 20 bytes per index entry. The big advantage of this change is that repos with small files actually suffer a lot from the index memory consumption and this case will benefit much more from this improvement.
As these two modifications make Optimize in-memory index memory consumption #2338 much more attractive, I added the idea of @MichaelEischer to this PR (did not find a git branch containing the changes to base on). As @MichaelEischer pointed out, for smaller entries Optimize in-memory index memory consumption #2338 should be changed to inline saving the indexEntry, I changed this.
Together with the two other improvements this IMO leads to saving additionally 28 bytes if there exists no duplicates (which should be the usual case).

So using this PR, the memory usage is reduced by a total of 56 bytes per index entry (again under the assumptions) leading to a total reduction of ~41% of the memory used by the index.

Was the change discussed in an issue or in the forum before?

see #2523
May close some of the memory related issues.

Checklist

I have read the Contribution Guidelines
I have enabled maintainer edits for this PR
I have not added tests for all changes in this PR - all present index tests succeed
I have not added documentation for the changes (in the manual)
There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
I have run gofmt on the code in all commits
All commit messages are formatted in the same style as the other commits in the repo
I'm done, this Pull Request is ready for review

aawsome · 2020-06-12T09:17:16Z

Oh, just realized that I should rebase - now that #2749 was merged.
This will also need a modification if #2773 (which implements other ideas from #2523) is merged before.

aawsome · 2020-06-12T09:41:46Z

Is now rebased and contains three commits for each individual change.

@MichaelEischer / @rawtaz Thanks for giving the hint about git rebase -i! This command rocks!

internal/repository/index.go

aawsome · 2020-06-12T12:44:46Z

There was another issue when saving pointers to the packID array: Every append may copy the array. As there are already references to the old array, it will never be garbage collected...
Hence I switched this to saving the index of packs.
I did however something wrong when using git rebase. So, the commits about saving packs and saving duplicate blobs separately are now merged into one commit - sorry about that!

internal/repository/index.go

MichaelEischer

I actually had implemented a proof-of-concept of rather similar changes in https://github.com/MichaelEischer/restic/commits/opt-hash-idx . However, I never got around to clean-up the code. Feel free to "steal" useful parts.

For deduplicating packIDs to a packIndex, I've also used an additional hashmap during index construction to ensure that each packID is actually only added once. With the StorePack method from #2773 that's probably no longer necessary. See d2086a1 .

The duplicates overflow map is similar to MichaelEischer@8d2fbe5 . You might want to take the BenchmarkIndexAlloc from there. The commit message also contains a rather detailed reasoning how the changes work and why they were made. I can tell from my own experience with e.g. some config file changes from roughly a decade ago which had no explanation at all (and no, the original author didn't remember the why anymore), that it is extremely useful to include a detailed commit message when the reason for the change is not completely obvious.

The conversion to uint32 index/length fields is much less messy than my variant. 👍

internal/repository/index.go

aawsome · 2020-06-13T10:48:03Z

I actually had implemented a proof-of-concept of rather similar changes in https://github.com/MichaelEischer/restic/commits/opt-hash-idx . However, I never got around to clean-up the code. Feel free to "steal" useful parts.

Oh, I wasn't aware of this branch - I did steal some parts, thank you 😄

For deduplicating packIDs to a packIndex, I've also used an additional hashmap during index construction to ensure that each packID is actually only added once. With the StorePack method from #2773 that's probably no longer necessary. See d2086a1 .

I added this when using Store - maybe Store will be removed in future, then this code can also be cleaned.

You might want to take the BenchmarkIndexAlloc from there. The commit message also contains a rather detailed reasoning how the changes work and why they were made. I can tell from my own experience with e.g. some config file changes from roughly a decade ago which had no explanation at all (and no, the original author didn't remember the why anymore), that it is extremely useful to include a detailed commit message when the reason for the change is not completely obvious.

I added the benchmark (thanks for the hint!) together with a detailed comment of the current state in the first commit.
With the following commits I also changed this comment to make it clear, what the impact of the change is. Also added the explanation why using "inlining" of indexEntry to the map in the commit.

I hope all comments are understandable; happy to get feedback or suggestions for improvements!

MichaelEischer

I have a few more remarks on the comments, but besides that the PR seems finished.

internal/repository/index.go

A side remark to the definition of Index.blob: Another possibility would have been to use: blob map[restic.BlobHandle]*indexEntry This would have led to the following sizes: key: 32 + 1 = 33 bytes value: 8 bytes indexEntry: 8 + 4 + 4 = 16 bytes each packID: 32 bytes To save N index entries, we would therefore have needed: N * OF * (33 + 8) bytes + N * 16 + N * 32 bytes / BP = N * 82 bytes More precicely, using a pointer instead of a direct entry is the better memory choice if: OF * 8 bytes + entrysize < OF * entrysize <=> entrysize > 8 bytes * OF/(OF-1) Under the assumption of OF=1.5, this means using pointers would have been the better choice if sizeof(indexEntry) > 24 bytes.

aawsome · 2020-06-14T06:53:06Z

Here are the benchmark results on my laptop:

before the changes (only including commit 7419844):

BenchmarkPackerManager-4                           	       4	 265876473 ns/op	 198.21 MB/s	   93398 B/op	    1708 allocs/op
BenchmarkDecodeIndex-4                             	   77702	     16464 ns/op
BenchmarkIndexHasUnknown-4                         	13663558	        85.7 ns/op
BenchmarkIndexHasKnown-4                           	13718383	        87.5 ns/op
BenchmarkIndexAlloc-4                              	       1	1531219826 ns/op	904583360 B/op	 2707659 allocs/op
BenchmarkMasterIndexLookupSingleIndex-4            	 6282409	       217 ns/op
BenchmarkMasterIndexLookupMultipleIndex-4          	 1929421	       615 ns/op
BenchmarkMasterIndexLookupSingleIndexUnknown-4     	 8851906	       135 ns/op
BenchmarkMasterIndexLookupMultipleIndexUnknown-4   	 2059674	       564 ns/op
BenchmarkSaveAndEncrypt-4                          	 6194496	       191 ns/op	22009700.59 MB/s	       3 B/op	       0 allocs/op
BenchmarkLoadTree-4                                	    3322	    355004 ns/op
BenchmarkLoadBlob-4                                	     144	   8240706 ns/op	 121.35 MB/s
BenchmarkLoadAndDecrypt-4                          	     126	   9521347 ns/op	 105.03 MB/s
BenchmarkLoadIndex-4                               	      33	  40962411 ns/op
ok  	github.com/restic/restic/internal/repository	165.710s

with changes:

BenchmarkPackerManager-4                           	       4	 266619089 ns/op	 197.66 MB/s	   93466 B/op	    1709 allocs/op
BenchmarkDecodeIndex-4                             	   75967	     16353 ns/op
BenchmarkIndexHasUnknown-4                         	14869858	        85.5 ns/op
BenchmarkIndexHasKnown-4                           	13833954	        86.4 ns/op
BenchmarkIndexAlloc-4                              	       1	1130858274 ns/op	785641912 B/op	  976445 allocs/op
BenchmarkMasterIndexLookupSingleIndex-4            	 5429026	       223 ns/op
BenchmarkMasterIndexLookupMultipleIndex-4          	 1880902	       630 ns/op
BenchmarkMasterIndexLookupSingleIndexUnknown-4     	 8154114	       135 ns/op
BenchmarkMasterIndexLookupMultipleIndexUnknown-4   	 2070068	       556 ns/op
BenchmarkSaveAndEncrypt-4                          	 6143935	       192 ns/op	21827305.25 MB/s	       3 B/op	       0 allocs/op
BenchmarkLoadTree-4                                	    3522	    357187 ns/op
BenchmarkLoadBlob-4                                	     145	   8296829 ns/op	 120.53 MB/s
BenchmarkLoadAndDecrypt-4                          	     126	   9814544 ns/op	 101.89 MB/s
BenchmarkLoadIndex-4                               	      32	  40884265 ns/op

ok  	github.com/restic/restic/internal/repository	114.982s

MichaelEischer

LGTM. Thanks!

Grrr, it seems like Github is unable to just show the raw difference introduced by a force push. However, git diff b6d687e2..1361341c or git range-diff b6d687e2...1361341c (compares the patch series) come to the rescue.

greatroar · 2020-06-14T10:45:09Z

Confirmed that this works: restic mount of a 1TB repo used 1.714GiB peak RAM in 1d66bb9, now 1.325GiB, a 22.7% reduction. The discrepancy with the predicted 41% is largely due to the blob size cache.

aawsome · 2020-06-14T11:14:23Z

I see I should bring #2587 into a good shape again 😉 - will rebase it to master soon...

aawsome force-pushed the reduce-index-memory branch from d8572c4 to e314fd9 Compare June 12, 2020 09:36

greatroar reviewed Jun 12, 2020

View reviewed changes

internal/repository/index.go Outdated Show resolved Hide resolved

greatroar reviewed Jun 12, 2020

View reviewed changes

internal/repository/index.go Outdated Show resolved Hide resolved

aawsome force-pushed the reduce-index-memory branch from e314fd9 to 65c091b Compare June 12, 2020 10:21

greatroar reviewed Jun 12, 2020

View reviewed changes

internal/repository/index.go Show resolved Hide resolved

aawsome mentioned this pull request Jun 12, 2020

Comparison of approaches to tackle index memory usage #2523

Closed

aawsome force-pushed the reduce-index-memory branch from 8f8f9cf to a7df6bf Compare June 12, 2020 12:41

greatroar reviewed Jun 12, 2020

View reviewed changes

internal/repository/index.go Show resolved Hide resolved

MichaelEischer reviewed Jun 12, 2020

View reviewed changes

internal/repository/index.go Outdated Show resolved Hide resolved

aawsome force-pushed the reduce-index-memory branch from a7df6bf to f6ea096 Compare June 13, 2020 10:29

MichaelEischer reviewed Jun 13, 2020

View reviewed changes

aawsome and others added 5 commits June 14, 2020 07:50

add changelog, benchmark, memory calculation

7419844

simplify index code

d92e2c5

make offset and length uint32

cf979e2

don't save duplicate packIDs when using internal/repository/Index.Store

1361341

aawsome force-pushed the reduce-index-memory branch from b6d687e to 1361341 Compare June 14, 2020 06:16

MichaelEischer approved these changes Jun 14, 2020

View reviewed changes

MichaelEischer merged commit 74bc714 into restic:master Jun 14, 2020

aawsome deleted the reduce-index-memory branch June 14, 2020 09:02

aawsome mentioned this pull request Jun 27, 2020

Use low memory map for in-memory index - about 45% less memory for index #2811

Closed

8 tasks

greatroar mentioned this pull request Jun 28, 2020

Chaining hash table for repository.Index #2812

Merged

8 tasks

MichaelEischer mentioned this pull request Nov 23, 2021

panic: offset or length does not fit in uint32 #3583

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce index memory #2781

Reduce index memory #2781

aawsome commented Jun 12, 2020 •

edited

Loading

aawsome commented Jun 12, 2020 •

edited

Loading

aawsome commented Jun 12, 2020

aawsome commented Jun 12, 2020

MichaelEischer left a comment

aawsome commented Jun 13, 2020 •

edited

Loading

MichaelEischer left a comment

aawsome commented Jun 14, 2020

MichaelEischer left a comment

greatroar commented Jun 14, 2020

aawsome commented Jun 14, 2020

Reduce index memory #2781

Reduce index memory #2781

Conversation

aawsome commented Jun 12, 2020 • edited Loading

What is the purpose of this change? What does it change?

Was the change discussed in an issue or in the forum before?

Checklist

aawsome commented Jun 12, 2020 • edited Loading

aawsome commented Jun 12, 2020

aawsome commented Jun 12, 2020

MichaelEischer left a comment

Choose a reason for hiding this comment

aawsome commented Jun 13, 2020 • edited Loading

MichaelEischer left a comment

Choose a reason for hiding this comment

aawsome commented Jun 14, 2020

MichaelEischer left a comment

Choose a reason for hiding this comment

greatroar commented Jun 14, 2020

aawsome commented Jun 14, 2020

aawsome commented Jun 12, 2020 •

edited

Loading

aawsome commented Jun 12, 2020 •

edited

Loading

aawsome commented Jun 13, 2020 •

edited

Loading