New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
capture and validate backed up files checksums #1620
Comments
For The The Is there anything in the above which does not cover the things you want? I'm not sure it's clear what this issue is really about. It would be great if in the future you use the issue template, as it helps structuring the content of the issue in a way such that we get a clearer picture of it. And if it's a more general "discussion-like" topic you want to bring up, the forum is a great place for that. Thanks! |
Say I backup a 10MB file named "PICTURE.JPEG". This file will be split in ~10 blobs and stored in 2+ pack files in the repository. While restic does checksum and validate individual blobs and pack files used to store the file data in the repository (as well as other internal data structures), it does not currently checksum/validate the file as a whole, and this is what I am asking in this user story. |
Right, understood. So, what problem does this solve? If the verification of those ten different parts of the file is good, then what you want to check is basically that restic restored them in the right order, or that there is no hardware issue causing the restored file to be corrupt? |
Right, I want to be able to check if restic restored parts in the right order and didn't lose any parts. I also want to manually check if files I have locally match files in the backup. |
Any update on this issue? I think having SHA256 checksums for the files (e.g., stored as an attribute within the node data structure) will be handy. Applications can use this field to detect duplicated files and to validate the files independently etc. This field can be optional and turned off by default. One use case is to scan all the trees and build a SHA256 map of all the files without reading the blobs. |
This is a diff to add file IDs to the backup process. If this looks acceptable I can prepare a pull request. index dff1d31d..d34f8f1b 100644
--- a/cmd/restic/cmd_backup.go
+++ b/cmd/restic/cmd_backup.go
@@ -82,6 +82,7 @@ type BackupOptions struct {
FilesFrom []string
TimeStamp string
WithAtime bool
+ FileIDHasher string
}
var backupOptions BackupOptions
@@ -108,6 +109,8 @@ func init() {
f.StringArrayVar(&backupOptions.FilesFrom, "files-from", nil, "read the files to backup from file (can be combined with file args/can be specified multiple
times)")
f.StringVar(&backupOptions.TimeStamp, "time", "", "time of the backup (ex. '2012-11-01 22:08:41') (default: now)")
f.BoolVar(&backupOptions.WithAtime, "with-atime", false, "store the atime for all files and directories")
+
+ f.StringVar(&backupOptions.FileIDHasher, "file-id-hasher", "", "hash algorithm for file id (default: none)")
}
// filterExisting returns a slice of all existing items, or an error if no
@@ -497,7 +500,8 @@ func runBackup(opts BackupOptions, gopts GlobalOptions, term *termstatus.Termina
p.V("start scan on %v", targets)
t.Go(func() error { return sc.Scan(t.Context(gopts.ctx), targets) })
- arch := archiver.New(repo, targetFS, archiver.Options{})
+ archOpts := archiver.Options{FileIDHasher: opts.FileIDHasher}
+ arch := archiver.New(repo, targetFS, archOpts)
arch.SelectByName = selectByNameFilter
arch.Select = selectFilter
arch.WithAtime = opts.WithAtime
diff --git a/internal/archiver/archiver.go b/internal/archiver/archiver.go
index 4ce9ef59..1447294f 100644
--- a/internal/archiver/archiver.go
+++ b/internal/archiver/archiver.go
@@ -96,6 +96,8 @@ type Options struct {
// SaveTreeConcurrency sets how many trees are marshalled and saved to the
// repo concurrently.
SaveTreeConcurrency uint
+
+ FileIDHasher string
}
// ApplyDefaults returns a copy of o with the default options set for all unset
@@ -745,6 +747,9 @@ func (arch *Archiver) runWorkers(ctx context.Context, t *tomb.Tomb) {
arch.Options.FileReadConcurrency, arch.Options.SaveBlobConcurrency)
arch.fileSaver.CompleteBlob = arch.CompleteBlob
arch.fileSaver.NodeFromFileInfo = arch.nodeFromFileInfo
+ if arch.Options.FileIDHasher != "" {
+ arch.fileSaver.FileIDHasher, _ = restic.GetFileIDHasher(arch.Options.FileIDHasher)
+ }
arch.treeSaver = NewTreeSaver(ctx, t, arch.Options.SaveTreeConcurrency, arch.saveTree, arch.Error)
}
diff --git a/internal/archiver/file_saver.go b/internal/archiver/file_saver.go
index 66defe35..c2d1ac20 100644
--- a/internal/archiver/file_saver.go
+++ b/internal/archiver/file_saver.go
@@ -2,6 +2,8 @@ package archiver
import (
"context"
+ "encoding/hex"
+ "hash"
"io"
"os"
@@ -65,6 +67,8 @@ type FileSaver struct {
CompleteBlob func(filename string, bytes uint64)
NodeFromFileInfo func(filename string, fi os.FileInfo) (*restic.Node, error)
+
+ FileIDHasher restic.FileIDHasher
}
// NewFileSaver returns a new file saver. A worker pool with fileWorkers is
@@ -169,6 +173,11 @@ func (s *FileSaver) saveFile(ctx context.Context, chnker *chunker.Chunker, snPat
var results []FutureBlob
+ var fileID hash.Hash
+ if s.FileIDHasher != nil {
+ fileID = s.FileIDHasher.NewHash()
+ }
+
node.Content = []restic.ID{}
var size uint64
for {
@@ -182,6 +191,9 @@ func (s *FileSaver) saveFile(ctx context.Context, chnker *chunker.Chunker, snPat
buf.Data = chunk.Data
size += uint64(chunk.Length)
+ if fileID != nil {
+ fileID.Write(buf.Data)
+ }
if err != nil {
_ = f.Close()
@@ -223,6 +235,10 @@ func (s *FileSaver) saveFile(ctx context.Context, chnker *chunker.Chunker, snPat
node.Size = size
+ if fileID != nil {
+ node.FileID = hex.EncodeToString(fileID.Sum(nil))
+ }
+
return saveFileResponse{
node: node,
stats: stats,
diff --git a/internal/restic/node.go b/internal/restic/node.go
index 638306ea..b2ac41d3 100644
--- a/internal/restic/node.go
+++ b/internal/restic/node.go
@@ -46,6 +46,7 @@ type Node struct {
ExtendedAttributes []ExtendedAttribute `json:"extended_attributes,omitempty"`
Device uint64 `json:"device,omitempty"` // in case of Type == "dev", stat.st_rdev
Content IDs `json:"content"`
+ FileID string `json:"file_id,omitempty"`
Subtree *ID `json:"subtree,omitempty"`
Error string `json:"error,omitempty"` The hasher file
|
That's one way to implement this, yes. Thanks for the proposal. A few random thoughts from me:
|
Thanks for the feedback! The reason I think multiple hash algorithms should be supported is that users may have chosen their own file hashing algorithms and switching is probably not convenient. For example, Google Drive and some Smugmug are providing MD5 checksums, and git is using SHA1. If a user want to build a HTTP frontend on top of a restic backup (similar to the dump command) and using the same hashing algorithm (to check duplications and/or look up content with a database where they stored the file hashes), I think having such option is important and we can make SHA256 the default. Agreed on other points. The code changes are meant for illustrating the idea and the user interface needs more thinking. It seems to me like a secondary index (file id to node) is probably the ideal outcome but requires more work and I haven't look into how to make the changes. A full scan of the index to get a file by ID seems inefficient but doable (and use the same dump command). Other options can be building an external index with mapping of file id to tree id which can accelerate the lookup but a little unintuitive. Or we can add an index (file id to node) to snapshots without persistence and populate it during LoadSnapshot but it requires loading all the sub trees (something like below). Any suggestions?
|
I'd also like to have file hashes captured for the purpose of having a better file version history. More discussions in this forum thread: https://forum.restic.net/t/per-file-version-history/929 I agree that an on/off switch with one algorithm is enough. Personally I prefer speed over security in this case, because restic already uses SHA-256 for blob level hashes, so file level hashes would be more of a convenience feature rather than a security feature. |
I would be glad to find this feature implemented soon. Advanatages relevant for me would be (some copied from above):
|
I would like to have sha256 hashes or the like available so that I can check for duplicate files that I don't want to sync into my respository. Storing these files is not a storage (due to restic deduplication) issue, it's an organization issue. The only manner of workaround I could think of is hashing the files in a mount, but for a multi-terabyte remote repository that's not an option - and this is inefficient anyway. Is there any reasonable workaround that doesn't involve trying to hash my data over a network connection? |
Coming from git-annex it always seemed useful to have a separate hash store, that can be queried before importing data. |
I'm not sure what the implementation would look like, but another interesting idea would be to have the option to store parity files. For example, Gareth George (author of the backrest UI) talks about using par2 in his archives here: https://gareth.page/post/2023-11-02-data-immortality/#erasure-codes-for-great-good-and-the-benefits-of-par2-archives |
@johnmaguire The parity discussion belongs into #804 . |
As a (paranoid) user, I want
backup
to capture SHA256 checksum of files I back up. I also wantrestore
to fail if the file checksum after restore does not match the checksum captured during backup. Likewise, I wantcheck
to have an option to check file data checksum matches the checksum captured during backup.The text was updated successfully, but these errors were encountered: