Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restic stats: print uncompressed size in mode raw-data #3915

Merged
merged 4 commits into from Oct 21, 2022

Conversation

plumbeo
Copy link
Contributor

@plumbeo plumbeo commented Sep 5, 2022

What does this PR change? What problem does it solve?

This PR prints the uncompressed size of a snapshot/repository when restic stats is called in mode raw-data. For example:

$ restic stats --mode=raw-data -r repotest/
repository ad894609 opened (repository version 2) successfully, password is correct
scanning...
Stats in raw-data mode:
    Snapshots processed:  41
       Total Blob Count:  150677
             Total Size:  71.085 GiB
Total Uncompressed Size:  77.307 GiB

Was the change previously discussed in an issue or on the forum?

No

Checklist

  • I have read the contribution guidelines.
  • I have enabled maintainer edits.
  • I have added tests for all code changes.
  • I have added documentation for relevant changes (in the manual).
  • There's a new file in changelog/unreleased/ that describes the changes for our users (see template).
  • I have run gofmt on the code in all commits.
  • All commit messages are formatted in the same style as the other commits in the repo.
  • I'm done! This pull request is ready for review.

Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR could also use a changelog entry once we've finished deciding on what exactly to count for TotalUncompressedSize.

cmd/restic/cmd_stats.go Outdated Show resolved Hide resolved
cmd/restic/cmd_stats.go Outdated Show resolved Hide resolved
@plumbeo
Copy link
Contributor Author

plumbeo commented Sep 8, 2022

When I was writing a changelog entry I realised that printing the number of compressed blobs could be useful too, especially in cases where the repository has been migrated from version 1 to version 2 but not yet repacked to compress the old data or to track the progress if it's being compressed in several steps.

Unconditionally printing the number of compressed blobs clutters the output a little though, because in some cases now and most cases in the future (since compression is default on for new repositories) the number of compressed blobs is the same as the total number of blobs, so I added a condition that skips it when that's the case. We should probably discuss it though.

@MichaelEischer
Copy link
Member

Wouldn't the size of the remaining uncompressed blobs be more useful to judge how much work is left to fully compress a repository? E.g. prune expects a size limit for repacking, not a blob count.

@plumbeo
Copy link
Contributor Author

plumbeo commented Sep 12, 2022

I considered that but since the compressed size of a pack is generally unrelated to the uncompressed size the ratio of compressed to uncompressed size didn't feel as informative as the ratio of the number of compressed blobs to the total number of blobs. Also "total size" and "total uncompressed size" can already be a little confusing and adding "total size of uncompressed blobs" didn't really make it any better.

Maybe it could be better to just add a new mode where all the compression related statistics may go, because there are a lot of things that could be useful to a user who specifically cares about them but not to everybody. I'm thinking about at least some of:

  • total size: the actual size of the repository/snapshot
  • total uncompressed size: the size of the repository/snapshot if compression wasn't enabled
  • global compression gain: computed from the previous values, the gain from compression with the current mix of compressed/uncompressed blobs
  • total number of blobs: self-explanatory
  • total number of uncompressed blobs: same
  • total number of compressed blobs: same, with the previous two it allows to track the compression state of a repository, if it wasn't compressed from the beginning
  • total size of uncompressed blobs: the space occupied by uncompressed blobs
  • total size of compressed blobs: the space occupied by compressed blobs
  • total uncompressed size of compressed blobs: the space that would be occupied by compressed blobs if they weren't compressed
  • compression ratio: computed from the previous two, the "real" compression ratio of compressed data

And some of these only make sense if a repository/snapshot is only partially compressed. Also better names would be probably needed.

@MichaelEischer
Copy link
Member

didn't feel as informative as the ratio of the number of compressed blobs to the total number of blobs.

But what does that ratio tell me? As the blob sizes can vary wildly, it's a bit random whether that ratio is representative or not. My feeling here is that the information a user is interested in, is rather how much data still has to be compressed. Or which fraction of the overall data size has already been compressed (size of still uncompressed blobs / total uncompressed size). Although that information might be more useful in the prune command.

Btw, the repack size given to prune is before compression. That is just summing up all still uncompressed blob sizes would be what's necessary there.

Maybe it could be better to just add a new mode where all the compression related statistics may go, because there are a lot of things that could be useful to a user who specifically cares about them but not to everybody.

I'd prefer to keep the number of different statistics somewhat limited. After all each new value means more code to maintain.

And some of these only make sense if a repository/snapshot is only partially compressed.

A partially compressed repository is intended to be a transitional state, thus we shouldn't add too many statistics which are only useful for such repositories.

Judging from the discussion in https://forum.restic.net/t/how-to-check-if-files-were-compressed/5392 TotalUncompressedSize is the most asked for information, in particular as it allows to calculate the compression ratio.

However, I'm still not convinced by TotalCompressedBlobCount. The longer I think about it, the more I'm convinced that it's only use would be to more or less guess how far the process of compressing the repository is already. So, what about replacing it with CompressionProgress (or some other name) which is defined as 1 - "total size of not compressed blobs" / "total uncompressed blob size". Then stat can just tell a user that the compression process is e.g. 40% done. That number would also allow extrapolating the size reduction to expect from compression: (1 - totalCompressedSize/totalUncompressedSize)*(1/CompressionProgress).

@plumbeo
Copy link
Contributor Author

plumbeo commented Oct 11, 2022

Yeah, it's better to only display what most users will want to see.

I just pushed some new commits that calculate and display the compression progress, calculated as percentage of data in the repository that has been compressed. I also added the compression ratio, which is not easily obtained because it must be calculated on the actual compressed data and we were not providing the necessary information, and the space saving, calculated on the actual state of the repository. The latter is just the ratio of uncompressed data and compressed data but we might as well print it and save the users some time.

@plumbeo
Copy link
Contributor Author

plumbeo commented Oct 11, 2022

This is the output on a partially compressed test repository:

$ restic stats --mode=raw-data -r test/ 
enter password for repository: 
repository 7571f6c2 opened (repository version 2) successfully, password is correct
scanning...
Stats in raw-data mode:
     Snapshots processed:  3
        Total Blob Count:  136
 Total Uncompressed Size:  166.959 MiB
              Total Size:  158.648 MiB
    Compression Progress:  48.15%
       Compression Ratio:  1.12x
Compression Space Saving:  4.98%

Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information provided to the user looks fine by now. However, I'd like to reduce the amount of new JSON fields a bit (see comments). Adding 6 new fields with varying degrees of redundancy is too much.

changelog/unreleased/pull-3915 Outdated Show resolved Hide resolved
cmd/restic/cmd_stats.go Outdated Show resolved Hide resolved
cmd/restic/cmd_stats.go Show resolved Hide resolved
Calculate and display compression ratio, space saving and progress
Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for keeping up with all the review comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants