Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebuild index in prune by using in-memory index #2842

Merged
merged 4 commits into from
Nov 6, 2020

Conversation

aawsome
Copy link
Contributor

@aawsome aawsome commented Jul 19, 2020

What is the purpose of this change? What does it change?

Optimize the rebuilding of the index during prune by using the in-memory index instead of reading all pack headers.

Was the change discussed in an issue or in the forum before?

This was a part of #2718 that has been extracted to be merged after #2718 has been merged.
To not save exact duplicates multiple times in the rebuilt index, either #2839 or #2863 should also be merged.

closes #1599
closes #3049

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have added tests for all changes in this PR
  • I have updated the documentation for the changes (in the manual)
  • There's an updated file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@aawsome aawsome mentioned this pull request Jul 19, 2020
11 tasks
@aawsome
Copy link
Contributor Author

aawsome commented Jul 19, 2020

Note that actually new index files are checked for "fullness" and saved only after having processed a complete index. This is not totally correct and will be a problem, if this is used on one big merged index. (see #2818 )

@aawsome
Copy link
Contributor Author

aawsome commented Jul 24, 2020

After #2818 has been merged, I rebased this PR. Now also the newly generated index is checked for "fullness" after writing the contents of pack.

@aawsome aawsome force-pushed the rebuild-index-inmem branch 2 times, most recently from bf6692c to af8ddbd Compare July 28, 2020 17:12
@aawsome aawsome mentioned this pull request Aug 2, 2020
8 tasks
@aawsome
Copy link
Contributor Author

aawsome commented Aug 2, 2020

rebased after #2840 is merged.

cmd/restic/cmd_prune.go Outdated Show resolved Hide resolved
Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented below the current implementation can forget blobs in rewritten packs and will cause check to warn about packs which are listed in multiple indexes. Please fix this.

This PR also needs tests. There is currently no test coverage for RebuildIndex. I think we also need a changelog entry, nothing too fancy as it will eventually be merged with the one for the main prune PR.

internal/repository/index.go Outdated Show resolved Hide resolved
internal/repository/master_index.go Outdated Show resolved Hide resolved
internal/repository/repository.go Outdated Show resolved Hide resolved
internal/repository/master_index.go Outdated Show resolved Hide resolved
cmd/restic/cmd_prune.go Outdated Show resolved Hide resolved
internal/repository/index.go Outdated Show resolved Hide resolved
internal/repository/index.go Outdated Show resolved Hide resolved
@aawsome
Copy link
Contributor Author

aawsome commented Oct 5, 2020

@MichaelEischer Thanks for your feedback!
If it's ok for you, I would suggest we first focus on #2718 and deal with this PR after.
Then the changelog of #2718 can be extended here.

Also I'm a bit short on time ATM and would like to spend this making #2718 merge-ready.

@MichaelEischer
Copy link
Member

Also I'm a bit short on time ATM and would like to spend this making #2718 merge-ready.

There's no need to hurry (too much). I actually hope to merge the VSS support (which shouldn't take too long anymore) before the prune rewrite PRs. That way we could release the VSS support soon (hopefully) and give the prune changes several weeks on master before getting them into a release.

Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After giving prune a try with the optimizations from this PR, #2718 and #2941, I've noticed that the index rebuild is missing a progress bar, please add one (Maybe based on the number of packs which are already added?). And obviously the performance is far better than before :-) .

cmd/restic/cmd_prune.go Outdated Show resolved Hide resolved
internal/repository/master_index.go Outdated Show resolved Hide resolved
internal/repository/master_index.go Show resolved Hide resolved
internal/repository/repository.go Outdated Show resolved Hide resolved
@aawsome
Copy link
Contributor Author

aawsome commented Oct 9, 2020

Ok, I just realized that (as you can see in the failing test - one of my "edge case repos") this PR should really only be merged after #2718!

The reason is that blobs not existing in the index are processed by current prune (as the first step is to rebuild the index) but then not saved in the rebuilt index (as this only saves the contents loaded from the index files).

This situation will change with #2718 which would simply fail for this setting and would request a manual rebuild-index. So I'll build this PR on top of #2718.

cmd/restic/cmd_prune.go Outdated Show resolved Hide resolved
internal/repository/master_index.go Outdated Show resolved Hide resolved
@aawsome
Copy link
Contributor Author

aawsome commented Oct 10, 2020

I rebased this to #2718
Now there are two open things left: The test case and the progress bar.

@aawsome aawsome force-pushed the rebuild-index-inmem branch 2 times, most recently from 8099cc1 to 384fa80 Compare October 10, 2020 17:08
@aawsome
Copy link
Contributor Author

aawsome commented Oct 10, 2020

I added the progress bar. Using packs as counter needs information from Repack how many packs have been written - which is unfortunately not available. I used blobs as counter. This information is also not available as we would need the information how many blobs have been actually written by Repack. However, if there are no duplicates or all duplicates have been repacked (one of those should be the usual case), prune knows the exact number.

I'm a bit struggling about the test case. Of course, the code is used in the integration tests and works there. As this saves the index to the repository, I cannot think about another good test except some kind of integration test. Which extra test case did you have in mind?

@MichaelEischer
Copy link
Member

MichaelEischer commented Oct 11, 2020

It would be possible to recalculate the number of packs by iterating over the index to collect all packs and then removing the rewritten ones. I haven't tested the blob counter yet, so i can't judge whether that's too noisy or not.

For the test case I have something like internal/index.TestIndexSave in mind.

@aawsome
Copy link
Contributor Author

aawsome commented Oct 13, 2020

It would be possible to recalculate the number of packs by iterating over the index to collect all packs and then removing the rewritten ones. I haven't tested the blob counter yet, so i can't judge whether that's too noisy or not.

I wanted to avoid the iteration and now count blobs before and after repacking to get the exact number of blobs that Repack created. This allows to exactly know how many blobs Save must save.
I don't know if I understood your "noisyness" concern correctly, but the progress bar will only update itself at most every minTickerTime (set to 1/60 s).

For the test case I have something like internal/index.TestIndexSave in mind.

I boldly copied (and adapted) that test function into internal/repositry 😉

@aawsome aawsome force-pushed the rebuild-index-inmem branch 2 times, most recently from b276c25 to 6dbca52 Compare October 16, 2020 18:16
@MichaelEischer
Copy link
Member

@aawsome I've added a hook some time ago that allows injecting / wrapping a backend into OpenRepository. Search for env.gopts.backendTestHook in the integration tests.

@aawsome
Copy link
Contributor Author

aawsome commented Oct 17, 2020

@aawsome I've added a hook some time ago that allows injecting / wrapping a backend into OpenRepository. Search for env.gopts.backendTestHook in the integration tests.

Great, that worked for me! The test is added (also tested that the test it failes without this PR; data is listed twice) and I also mention it in the changelog.

@aawsome
Copy link
Contributor Author

aawsome commented Oct 18, 2020

@MichaelEischer Struggling with the test yesterday, I forgot to mention another change:
When changing the progress bar from blobs to packs, I also changed internal/index.EachByPack to yield a structure such that idx.StorePack can be used in MasterIndex.Save instead of idx.Store - this is bit faster and it doesn't fill the packIDToIndex map.

cmd/restic/integration_test.go Outdated Show resolved Hide resolved
cmd/restic/integration_test.go Outdated Show resolved Hide resolved
changelog/unreleased/pull-2718 Outdated Show resolved Hide resolved
@aawsome
Copy link
Contributor Author

aawsome commented Nov 3, 2020

rebased this to the new version of #2718 (only docu and the PruneOptions in test was affected)

@aawsome
Copy link
Contributor Author

aawsome commented Nov 5, 2020

rebased after #2718 was merged.

Copy link
Member

@fd0 fd0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment about the computation of packs added, apart from that I'm happy. Thanks for your work!

Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High download bandwidth usage on prune Small changes, long prune time
4 participants