Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

capture and validate backed up files checksums #1620

Open
ifedorenko opened this issue Feb 16, 2018 · 14 comments
Open

capture and validate backed up files checksums #1620

ifedorenko opened this issue Feb 16, 2018 · 14 comments
Labels
category: backup category: resilience preventing and recovering from repository problems type: feature suggestion suggesting a new feature

Comments

@ifedorenko
Copy link
Contributor

As a (paranoid) user, I want backup to capture SHA256 checksum of files I back up. I also want restore to fail if the file checksum after restore does not match the checksum captured during backup. Likewise, I want check to have an option to check file data checksum matches the checksum captured during backup.

@rawtaz
Copy link
Contributor

rawtaz commented Feb 16, 2018

For backup, the blobs in which your data in the repository are stored when backing up are already saved with a name that corresponds to the checksum of the data they hold. It does use SHA256 for this.

The restore command will alert you if upon restoring a blob, the checksum of the data it derives/gets from it no longer corresponds to the checksum that was stored as part of the backup.

The check command does some basic checks, but if you pass it the --read-data option, it will make sure to check these things for you (by specifically read all the data and check that it still matches the expected checksums.

Is there anything in the above which does not cover the things you want? I'm not sure it's clear what this issue is really about.

It would be great if in the future you use the issue template, as it helps structuring the content of the issue in a way such that we get a clearer picture of it. And if it's a more general "discussion-like" topic you want to bring up, the forum is a great place for that. Thanks!

@fd0 fd0 added the type: feature suggestion suggesting a new feature label Feb 16, 2018
@ifedorenko
Copy link
Contributor Author

Say I backup a 10MB file named "PICTURE.JPEG". This file will be split in ~10 blobs and stored in 2+ pack files in the repository. While restic does checksum and validate individual blobs and pack files used to store the file data in the repository (as well as other internal data structures), it does not currently checksum/validate the file as a whole, and this is what I am asking in this user story.

@rawtaz
Copy link
Contributor

rawtaz commented Feb 16, 2018

Right, understood. So, what problem does this solve? If the verification of those ten different parts of the file is good, then what you want to check is basically that restic restored them in the right order, or that there is no hardware issue causing the restored file to be corrupt?

@ifedorenko
Copy link
Contributor Author

Right, I want to be able to check if restic restored parts in the right order and didn't lose any parts. I also want to manually check if files I have locally match files in the backup.

@jdeng
Copy link

jdeng commented Jan 17, 2019

Any update on this issue? I think having SHA256 checksums for the files (e.g., stored as an attribute within the node data structure) will be handy. Applications can use this field to detect duplicated files and to validate the files independently etc. This field can be optional and turned off by default.

One use case is to scan all the trees and build a SHA256 map of all the files without reading the blobs.

@jdeng
Copy link

jdeng commented Jan 18, 2019

This is a diff to add file IDs to the backup process. If this looks acceptable I can prepare a pull request.

index dff1d31d..d34f8f1b 100644
--- a/cmd/restic/cmd_backup.go
+++ b/cmd/restic/cmd_backup.go
@@ -82,6 +82,7 @@ type BackupOptions struct {
 	FilesFrom        []string
 	TimeStamp        string
 	WithAtime        bool
+	FileIDHasher     string
 }
 
 var backupOptions BackupOptions
@@ -108,6 +109,8 @@ func init() {
 	f.StringArrayVar(&backupOptions.FilesFrom, "files-from", nil, "read the files to backup from file (can be combined with file args/can be specified multiple
 times)")
 	f.StringVar(&backupOptions.TimeStamp, "time", "", "time of the backup (ex. '2012-11-01 22:08:41') (default: now)")
 	f.BoolVar(&backupOptions.WithAtime, "with-atime", false, "store the atime for all files and directories")
+
+	f.StringVar(&backupOptions.FileIDHasher, "file-id-hasher", "", "hash algorithm for file id (default: none)")
 }
 
 // filterExisting returns a slice of all existing items, or an error if no
@@ -497,7 +500,8 @@ func runBackup(opts BackupOptions, gopts GlobalOptions, term *termstatus.Termina
 	p.V("start scan on %v", targets)
 	t.Go(func() error { return sc.Scan(t.Context(gopts.ctx), targets) })
 
-	arch := archiver.New(repo, targetFS, archiver.Options{})
+	archOpts := archiver.Options{FileIDHasher: opts.FileIDHasher}
+	arch := archiver.New(repo, targetFS, archOpts)
 	arch.SelectByName = selectByNameFilter
 	arch.Select = selectFilter
 	arch.WithAtime = opts.WithAtime
diff --git a/internal/archiver/archiver.go b/internal/archiver/archiver.go
index 4ce9ef59..1447294f 100644
--- a/internal/archiver/archiver.go
+++ b/internal/archiver/archiver.go
@@ -96,6 +96,8 @@ type Options struct {
 	// SaveTreeConcurrency sets how many trees are marshalled and saved to the
 	// repo concurrently.
 	SaveTreeConcurrency uint
+
+	FileIDHasher string
 }
 
 // ApplyDefaults returns a copy of o with the default options set for all unset
@@ -745,6 +747,9 @@ func (arch *Archiver) runWorkers(ctx context.Context, t *tomb.Tomb) {
 		arch.Options.FileReadConcurrency, arch.Options.SaveBlobConcurrency)
 	arch.fileSaver.CompleteBlob = arch.CompleteBlob
 	arch.fileSaver.NodeFromFileInfo = arch.nodeFromFileInfo
+	if arch.Options.FileIDHasher != "" {
+		arch.fileSaver.FileIDHasher, _ = restic.GetFileIDHasher(arch.Options.FileIDHasher)
+	}
 
 	arch.treeSaver = NewTreeSaver(ctx, t, arch.Options.SaveTreeConcurrency, arch.saveTree, arch.Error)
 }
diff --git a/internal/archiver/file_saver.go b/internal/archiver/file_saver.go
index 66defe35..c2d1ac20 100644
--- a/internal/archiver/file_saver.go
+++ b/internal/archiver/file_saver.go
@@ -2,6 +2,8 @@ package archiver
 
 import (
 	"context"
+	"encoding/hex"
+	"hash"
 	"io"
 	"os"
 
@@ -65,6 +67,8 @@ type FileSaver struct {
 	CompleteBlob func(filename string, bytes uint64)
 
 	NodeFromFileInfo func(filename string, fi os.FileInfo) (*restic.Node, error)
+
+	FileIDHasher restic.FileIDHasher
 }
 
 // NewFileSaver returns a new file saver. A worker pool with fileWorkers is
@@ -169,6 +173,11 @@ func (s *FileSaver) saveFile(ctx context.Context, chnker *chunker.Chunker, snPat
 
 	var results []FutureBlob
 
+	var fileID hash.Hash
+	if s.FileIDHasher != nil {
+		fileID = s.FileIDHasher.NewHash()
+	}
+
 	node.Content = []restic.ID{}
 	var size uint64
 	for {
@@ -182,6 +191,9 @@ func (s *FileSaver) saveFile(ctx context.Context, chnker *chunker.Chunker, snPat
 		buf.Data = chunk.Data
 
 		size += uint64(chunk.Length)
+		if fileID != nil {
+			fileID.Write(buf.Data)
+		}
 
 		if err != nil {
 			_ = f.Close()
@@ -223,6 +235,10 @@ func (s *FileSaver) saveFile(ctx context.Context, chnker *chunker.Chunker, snPat
 
 	node.Size = size
 
+	if fileID != nil {
+		node.FileID = hex.EncodeToString(fileID.Sum(nil))
+	}
+
 	return saveFileResponse{
 		node:  node,
 		stats: stats,
diff --git a/internal/restic/node.go b/internal/restic/node.go
index 638306ea..b2ac41d3 100644
--- a/internal/restic/node.go
+++ b/internal/restic/node.go
@@ -46,6 +46,7 @@ type Node struct {
 	ExtendedAttributes []ExtendedAttribute `json:"extended_attributes,omitempty"`
 	Device             uint64              `json:"device,omitempty"` // in case of Type == "dev", stat.st_rdev
 	Content            IDs                 `json:"content"`
+	FileID             string              `json:"file_id,omitempty"`
 	Subtree            *ID                 `json:"subtree,omitempty"`
 
 	Error string `json:"error,omitempty"`

The hasher file

package restic

import (
	"crypto/md5"
	"crypto/sha1"
	"crypto/sha256"
	"fmt"
	"hash"
)

type FileIDHasher interface {
	NewHash() hash.Hash
}

type functionHasher struct {
	newHash func() hash.Hash
}

func (fh *functionHasher) NewHash() hash.Hash {
	return fh.newHash()
}

var hashers = make(map[string]FileIDHasher)

func init() {
	hashers["md5"] = &functionHasher{func() hash.Hash { return md5.New() }}
	hashers["sha1"] = &functionHasher{func() hash.Hash { return sha1.New() }}
	hashers["sha256"] = &functionHasher{func() hash.Hash { return sha256.New() }}
}

func RegisterFileIDHasher(name string, hasher FileIDHasher) error {
	if _, ok := hashers[name]; ok {
		return fmt.Errorf("hasher existed")
	}

	hashers[name] = hasher
	return nil
}

func GetFileIDHasher(name string) (FileIDHasher, error) {
	if h, ok := hashers[name]; ok {
		return h, nil
	}

	return nil, fmt.Errorf("no hasher")
}

@fd0
Copy link
Member

fd0 commented Jan 19, 2019

That's one way to implement this, yes. Thanks for the proposal.

A few random thoughts from me:

  • I think it's not a good idea to support several algorithms. When we add this we should decide which hash it should be, then we can amend the design document and be done with it. SHA256 would be my choice, since we're already using it everywhere.
  • Before this can be implemented we need to decide what to do with the user interface and the settings for this option. I think we should (at most) have an on/off switch, nothing more. If it's off by default, nobody will turn it on. If it's on by default, people can turn it off if it causes too much CPU load. I'm leaning towards activating this by default, with an off switch.
  • Adding a new (optional) field in the repository data structures is a non-breaking change, we just need to agree on the details. A backup with a new version of restic adding this field may re-write all the metadata.
  • Whenever code accesses the new field, there must be a graceful fallback if the field is unset.
  • It'd be great if the PR adding this field (or a follow-up PR proposed shortly after) adds functionality which uses the additional information, so we have something we can show users.

@jdeng
Copy link

jdeng commented Jan 19, 2019

Thanks for the feedback! The reason I think multiple hash algorithms should be supported is that users may have chosen their own file hashing algorithms and switching is probably not convenient. For example, Google Drive and some Smugmug are providing MD5 checksums, and git is using SHA1. If a user want to build a HTTP frontend on top of a restic backup (similar to the dump command) and using the same hashing algorithm (to check duplications and/or look up content with a database where they stored the file hashes), I think having such option is important and we can make SHA256 the default.

Agreed on other points. The code changes are meant for illustrating the idea and the user interface needs more thinking. It seems to me like a secondary index (file id to node) is probably the ideal outcome but requires more work and I haven't look into how to make the changes. A full scan of the index to get a file by ID seems inefficient but doable (and use the same dump command). Other options can be building an external index with mapping of file id to tree id which can accelerate the lookup but a little unintuitive. Or we can add an index (file id to node) to snapshots without persistence and populate it during LoadSnapshot but it requires loading all the sub trees (something like below). Any suggestions?

index 61467013..6316ab05 100644
--- a/internal/restic/snapshot.go
+++ b/internal/restic/snapshot.go
@@ -25,6 +25,7 @@ type Snapshot struct {
 	Original *ID       `json:"original,omitempty"`
 
 	id *ID // plaintext ID, used during restore
+	nodeIdx map[string]*Node // node index by file id
 }
 
 // NewSnapshot returns an initialized snapshot struct for the current user and
@@ -201,6 +202,17 @@ func (sn *Snapshot) HasPaths(paths []string) bool {
 	return true
 }
 
+func (sn *Snapshot) LoadNodeIndex() error {
+	sn.nodeIdx = make(map[string]Node)
+	//TODO: 
+	return nil
+}
+
+func (sn *Snapshot) FindFileNode(id string) (*Node, error) {
+	//TODO: 
+	return nil, nil
+}
+

@cfbao
Copy link

cfbao commented May 27, 2019

I'd also like to have file hashes captured for the purpose of having a better file version history.
Metadata like mtime, ctime are often unreliable at telling whether a file actually changed. This limits the usefulness of the find command. Having file level hashes can solve this problem.

More discussions in this forum thread: https://forum.restic.net/t/per-file-version-history/929

I agree that an on/off switch with one algorithm is enough. Personally I prefer speed over security in this case, because restic already uses SHA-256 for blob level hashes, so file level hashes would be more of a convenience feature rather than a security feature.

@Sina1910
Copy link

I would be glad to find this feature implemented soon.

Advanatages relevant for me would be (some copied from above):

  • check mechanism when restoring files
  • ability to manually check if files I have locally match files in the backup
  • basis to detect duplicated files
  • basis for building a map with checksumsof all the files in a repo
  • better file version history (Metadata like mtime, ctime often not reliable)
  • identify renamed/relocated files

@deliciouslytyped
Copy link

I would like to have sha256 hashes or the like available so that I can check for duplicate files that I don't want to sync into my respository. Storing these files is not a storage (due to restic deduplication) issue, it's an organization issue.

The only manner of workaround I could think of is hashing the files in a mount, but for a multi-terabyte remote repository that's not an option - and this is inefficient anyway.

Is there any reasonable workaround that doesn't involve trying to hash my data over a network connection?

@stp-ip
Copy link

stp-ip commented Mar 7, 2022

Coming from git-annex it always seemed useful to have a separate hash store, that can be queried before importing data.
The hash store could be cached on the client as well and additionally be on faster and more reliable storage compared to cheaper data archives. Could be seen as another step in splitting up data storage. Aka hashes > metadata > blobs.
A separate hash store could be optional as it does add duplicate metadata that needs to be kept in sync etc.

@johnmaguire
Copy link

I'm not sure what the implementation would look like, but another interesting idea would be to have the option to store parity files. For example, Gareth George (author of the backrest UI) talks about using par2 in his archives here: https://gareth.page/post/2023-11-02-data-immortality/#erasure-codes-for-great-good-and-the-benefits-of-par2-archives

@MichaelEischer
Copy link
Member

@johnmaguire The parity discussion belongs into #804 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: backup category: resilience preventing and recovering from repository problems type: feature suggestion suggesting a new feature
Projects
None yet
Development

No branches or pull requests

10 participants