Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add content hash to ls --json #2870

Closed
wojas opened this issue Aug 4, 2020 · 7 comments
Closed

Add content hash to ls --json #2870

wojas opened this issue Aug 4, 2020 · 7 comments
Labels
state: need feedback waiting for feedback, e.g. from the submitter state: need triaging need categorizing, labeling, next-step decision

Comments

@wojas
Copy link
Contributor

wojas commented Aug 4, 2020

Output of restic version

restic 0.9.6 (v0.9.6-337-g0b21ec44-dirty)

What should restic do differently? Which functionality do you think we should add?

Add a content hash to the ls --json output.

When the file was stored in a single chunk, the real sha256 has is available and the reported hash will have the format "sha256:...".

When the file was split into multiple chunks, it is not possible to show a real content hash, because restic does not currently store this information. We can however construct a hash out of the chunk hashes. To distinguish this from a real sha256 of the contents, this hash will have the format "multi:...".

This 'multi' hash can only be compared within a single repo, as different repos will split files in different locations.

I have a PR ready that I add in a moment.

What are you trying to do?

I am trying to figure out if any original pictures have succumbed to bitrot since my initial restic backup of them.

Did restic help you today? Did it make you happy in any way?

Keeping the full backup history of my pictures in an efficient way makes it possible to recover from bitrot in the future.

@aawsome
Copy link
Contributor

aawsome commented Aug 4, 2020

What you are basically looking for is a possibility to check if your local files match the state of a backup snapshot. I agree that this would be a nice and useful extension to restic. It is similar to #2011.

However, I don't agree that listing hashes in ls is a good option to solve this. As you already pointed out, hashes for files in the repository which have been split into more than one chunks are not available to restic.
Honestly, I dislike a lot the "hack" to create an artificial hash using the chunks hashes.

A solution using ls would need to print all chunk hashes together with the sizes of the chunks. This would allow an external program to split the files itself into the same chunks and check the hashes of these chunks.

But I would prefer to have this implemented into restic - either as extension to the restore command or as a new command, e.g. verify.

@greatroar
Copy link
Contributor

greatroar commented Aug 5, 2020

I share @aawsome's objection. The suggestion is to introduce a hash that doesn't correspond to anything in the restic object model and is also not the hash of a file on disk, so its usefulness is very limited. There must be a cleaner way to compare files that also works with a file outside the repo.

@wojas
Copy link
Contributor Author

wojas commented Aug 5, 2020

I'm also not too happy with the 'multi' solution. Ideally restic would store the full file hash in the metadata, but this would impose a performance cost during backup. I don't think it's this is really worth it just for this.

My original approach just included the list of content hashes in the output. Unfortunately these content hashes are not very useful by themselves, unless of course you want to fetch the content. In order to check any local files, you would need to either know the chunk sizes or know the split secret and apply the same algorithm yourself.

This size information is currently unfortunately not available. Perhaps this is something we could add to the metadata and then print both Content and ContentSize slices? This would only work for newer snapshots, but that is OK I guess.

For my purposes of comparing files between different snapshots just having the list of content hashes would suffice. Would my PR be acceptable if would simply expose the list Content IDs in the JSON and get rid of the weird multi hash? I guess this is useful anyway, because it allows you to reconstruct the file contents.

I can also add a new issue to discuss the addition of a ContentSizes slice.

@aawsome
Copy link
Contributor

aawsome commented Aug 5, 2020

This size information is currently unfortunately not available. Perhaps this is something we could add to the metadata and then print both Content and ContentSize slices? This would only work for newer snapshots, but that is OK I guess.

The chunk size information is available to restic, it is saved in the index. Simply use something like

for i, id := range node.Content {
   size[i] = repo.Index().Lookup(id,restic.DataBlob)[0].Length
}

However, I would still prefer to have a verify command or something similar within restic...

@wojas
Copy link
Contributor Author

wojas commented Aug 5, 2020

The chunk size information is available to restic, it is saved in the index. Simply use something like

Thanks! I will try this and see how it performs.

However, I would still prefer to have a verify command or something similar within restic...

I agree that this would be very useful. I currently do not have enough time to make any promises, but I may have a look at how much effort that would take.

This does however not preclude making the ls JSON output more useful for those that want to do some restic data mining.

@MichaelEischer
Copy link
Member

@wojas Your use case sounds a lot like #805 to me. Storing a hash covering a complete file was already request in #1620.

@MichaelEischer MichaelEischer added state: need feedback waiting for feedback, e.g. from the submitter state: need triaging need categorizing, labeling, next-step decision labels Oct 10, 2020
@wojas
Copy link
Contributor Author

wojas commented Jan 6, 2022

I have closed the PR implementing this clunky solution and will close this issue. #805 and #1620 would both solve the original issue I wanted to address. Thanks!

@wojas wojas closed this as completed Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: need feedback waiting for feedback, e.g. from the submitter state: need triaging need categorizing, labeling, next-step decision
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants