capture and validate backed up files checksums #1620

ifedorenko · 2018-02-16T14:56:03Z

As a (paranoid) user, I want backup to capture SHA256 checksum of files I back up. I also want restore to fail if the file checksum after restore does not match the checksum captured during backup. Likewise, I want check to have an option to check file data checksum matches the checksum captured during backup.

The text was updated successfully, but these errors were encountered:

rawtaz · 2018-02-16T19:18:53Z

For backup, the blobs in which your data in the repository are stored when backing up are already saved with a name that corresponds to the checksum of the data they hold. It does use SHA256 for this.

The restore command will alert you if upon restoring a blob, the checksum of the data it derives/gets from it no longer corresponds to the checksum that was stored as part of the backup.

The check command does some basic checks, but if you pass it the --read-data option, it will make sure to check these things for you (by specifically read all the data and check that it still matches the expected checksums.

Is there anything in the above which does not cover the things you want? I'm not sure it's clear what this issue is really about.

It would be great if in the future you use the issue template, as it helps structuring the content of the issue in a way such that we get a clearer picture of it. And if it's a more general "discussion-like" topic you want to bring up, the forum is a great place for that. Thanks!

ifedorenko · 2018-02-16T23:27:04Z

Say I backup a 10MB file named "PICTURE.JPEG". This file will be split in ~10 blobs and stored in 2+ pack files in the repository. While restic does checksum and validate individual blobs and pack files used to store the file data in the repository (as well as other internal data structures), it does not currently checksum/validate the file as a whole, and this is what I am asking in this user story.

rawtaz · 2018-02-16T23:29:49Z

Right, understood. So, what problem does this solve? If the verification of those ten different parts of the file is good, then what you want to check is basically that restic restored them in the right order, or that there is no hardware issue causing the restored file to be corrupt?

ifedorenko · 2018-02-17T00:52:54Z

Right, I want to be able to check if restic restored parts in the right order and didn't lose any parts. I also want to manually check if files I have locally match files in the backup.

jdeng · 2019-01-17T06:57:53Z

Any update on this issue? I think having SHA256 checksums for the files (e.g., stored as an attribute within the node data structure) will be handy. Applications can use this field to detect duplicated files and to validate the files independently etc. This field can be optional and turned off by default.

One use case is to scan all the trees and build a SHA256 map of all the files without reading the blobs.

jdeng · 2019-01-18T21:26:32Z

This is a diff to add file IDs to the backup process. If this looks acceptable I can prepare a pull request.

index dff1d31d..d34f8f1b 100644
--- a/cmd/restic/cmd_backup.go
+++ b/cmd/restic/cmd_backup.go
@@ -82,6 +82,7 @@ type BackupOptions struct {
 	FilesFrom        []string
 	TimeStamp        string
 	WithAtime        bool
+	FileIDHasher     string
 }
 
 var backupOptions BackupOptions
@@ -108,6 +109,8 @@ func init() {
 	f.StringArrayVar(&backupOptions.FilesFrom, "files-from", nil, "read the files to backup from file (can be combined with file args/can be specified multiple
 times)")
 	f.StringVar(&backupOptions.TimeStamp, "time", "", "time of the backup (ex. '2012-11-01 22:08:41') (default: now)")
 	f.BoolVar(&backupOptions.WithAtime, "with-atime", false, "store the atime for all files and directories")
+
+	f.StringVar(&backupOptions.FileIDHasher, "file-id-hasher", "", "hash algorithm for file id (default: none)")
 }
 
 // filterExisting returns a slice of all existing items, or an error if no
@@ -497,7 +500,8 @@ func runBackup(opts BackupOptions, gopts GlobalOptions, term *termstatus.Termina
 	p.V("start scan on %v", targets)
 	t.Go(func() error { return sc.Scan(t.Context(gopts.ctx), targets) })
 
-	arch := archiver.New(repo, targetFS, archiver.Options{})
+	archOpts := archiver.Options{FileIDHasher: opts.FileIDHasher}
+	arch := archiver.New(repo, targetFS, archOpts)
 	arch.SelectByName = selectByNameFilter
 	arch.Select = selectFilter
 	arch.WithAtime = opts.WithAtime
diff --git a/internal/archiver/archiver.go b/internal/archiver/archiver.go
index 4ce9ef59..1447294f 100644
--- a/internal/archiver/archiver.go
+++ b/internal/archiver/archiver.go
@@ -96,6 +96,8 @@ type Options struct {
 	// SaveTreeConcurrency sets how many trees are marshalled and saved to the
 	// repo concurrently.
 	SaveTreeConcurrency uint
+
+	FileIDHasher string
 }
 
 // ApplyDefaults returns a copy of o with the default options set for all unset
@@ -745,6 +747,9 @@ func (arch *Archiver) runWorkers(ctx context.Context, t *tomb.Tomb) {
 		arch.Options.FileReadConcurrency, arch.Options.SaveBlobConcurrency)
 	arch.fileSaver.CompleteBlob = arch.CompleteBlob
 	arch.fileSaver.NodeFromFileInfo = arch.nodeFromFileInfo
+	if arch.Options.FileIDHasher != "" {
+		arch.fileSaver.FileIDHasher, _ = restic.GetFileIDHasher(arch.Options.FileIDHasher)
+	}
 
 	arch.treeSaver = NewTreeSaver(ctx, t, arch.Options.SaveTreeConcurrency, arch.saveTree, arch.Error)
 }
diff --git a/internal/archiver/file_saver.go b/internal/archiver/file_saver.go
index 66defe35..c2d1ac20 100644
--- a/internal/archiver/file_saver.go
+++ b/internal/archiver/file_saver.go
@@ -2,6 +2,8 @@ package archiver
 
 import (
 	"context"
+	"encoding/hex"
+	"hash"
 	"io"
 	"os"
 
@@ -65,6 +67,8 @@ type FileSaver struct {
 	CompleteBlob func(filename string, bytes uint64)
 
 	NodeFromFileInfo func(filename string, fi os.FileInfo) (*restic.Node, error)
+
+	FileIDHasher restic.FileIDHasher
 }
 
 // NewFileSaver returns a new file saver. A worker pool with fileWorkers is
@@ -169,6 +173,11 @@ func (s *FileSaver) saveFile(ctx context.Context, chnker *chunker.Chunker, snPat
 
 	var results []FutureBlob
 
+	var fileID hash.Hash
+	if s.FileIDHasher != nil {
+		fileID = s.FileIDHasher.NewHash()
+	}
+
 	node.Content = []restic.ID{}
 	var size uint64
 	for {
@@ -182,6 +191,9 @@ func (s *FileSaver) saveFile(ctx context.Context, chnker *chunker.Chunker, snPat
 		buf.Data = chunk.Data
 
 		size += uint64(chunk.Length)
+		if fileID != nil {
+			fileID.Write(buf.Data)
+		}
 
 		if err != nil {
 			_ = f.Close()
@@ -223,6 +235,10 @@ func (s *FileSaver) saveFile(ctx context.Context, chnker *chunker.Chunker, snPat
 
 	node.Size = size
 
+	if fileID != nil {
+		node.FileID = hex.EncodeToString(fileID.Sum(nil))
+	}
+
 	return saveFileResponse{
 		node:  node,
 		stats: stats,
diff --git a/internal/restic/node.go b/internal/restic/node.go
index 638306ea..b2ac41d3 100644
--- a/internal/restic/node.go
+++ b/internal/restic/node.go
@@ -46,6 +46,7 @@ type Node struct {
 	ExtendedAttributes []ExtendedAttribute `json:"extended_attributes,omitempty"`
 	Device             uint64              `json:"device,omitempty"` // in case of Type == "dev", stat.st_rdev
 	Content            IDs                 `json:"content"`
+	FileID             string              `json:"file_id,omitempty"`
 	Subtree            *ID                 `json:"subtree,omitempty"`
 
 	Error string `json:"error,omitempty"`

The hasher file

package restic

import (
	"crypto/md5"
	"crypto/sha1"
	"crypto/sha256"
	"fmt"
	"hash"
)

type FileIDHasher interface {
	NewHash() hash.Hash
}

type functionHasher struct {
	newHash func() hash.Hash
}

func (fh *functionHasher) NewHash() hash.Hash {
	return fh.newHash()
}

var hashers = make(map[string]FileIDHasher)

func init() {
	hashers["md5"] = &functionHasher{func() hash.Hash { return md5.New() }}
	hashers["sha1"] = &functionHasher{func() hash.Hash { return sha1.New() }}
	hashers["sha256"] = &functionHasher{func() hash.Hash { return sha256.New() }}
}

func RegisterFileIDHasher(name string, hasher FileIDHasher) error {
	if _, ok := hashers[name]; ok {
		return fmt.Errorf("hasher existed")
	}

	hashers[name] = hasher
	return nil
}

func GetFileIDHasher(name string) (FileIDHasher, error) {
	if h, ok := hashers[name]; ok {
		return h, nil
	}

	return nil, fmt.Errorf("no hasher")
}

fd0 · 2019-01-19T13:21:51Z

That's one way to implement this, yes. Thanks for the proposal.

A few random thoughts from me:

I think it's not a good idea to support several algorithms. When we add this we should decide which hash it should be, then we can amend the design document and be done with it. SHA256 would be my choice, since we're already using it everywhere.
Before this can be implemented we need to decide what to do with the user interface and the settings for this option. I think we should (at most) have an on/off switch, nothing more. If it's off by default, nobody will turn it on. If it's on by default, people can turn it off if it causes too much CPU load. I'm leaning towards activating this by default, with an off switch.
Adding a new (optional) field in the repository data structures is a non-breaking change, we just need to agree on the details. A backup with a new version of restic adding this field may re-write all the metadata.
Whenever code accesses the new field, there must be a graceful fallback if the field is unset.
It'd be great if the PR adding this field (or a follow-up PR proposed shortly after) adds functionality which uses the additional information, so we have something we can show users.

jdeng · 2019-01-19T21:00:59Z

Thanks for the feedback! The reason I think multiple hash algorithms should be supported is that users may have chosen their own file hashing algorithms and switching is probably not convenient. For example, Google Drive and some Smugmug are providing MD5 checksums, and git is using SHA1. If a user want to build a HTTP frontend on top of a restic backup (similar to the dump command) and using the same hashing algorithm (to check duplications and/or look up content with a database where they stored the file hashes), I think having such option is important and we can make SHA256 the default.

Agreed on other points. The code changes are meant for illustrating the idea and the user interface needs more thinking. It seems to me like a secondary index (file id to node) is probably the ideal outcome but requires more work and I haven't look into how to make the changes. A full scan of the index to get a file by ID seems inefficient but doable (and use the same dump command). Other options can be building an external index with mapping of file id to tree id which can accelerate the lookup but a little unintuitive. Or we can add an index (file id to node) to snapshots without persistence and populate it during LoadSnapshot but it requires loading all the sub trees (something like below). Any suggestions?

index 61467013..6316ab05 100644
--- a/internal/restic/snapshot.go
+++ b/internal/restic/snapshot.go
@@ -25,6 +25,7 @@ type Snapshot struct {
 	Original *ID       `json:"original,omitempty"`
 
 	id *ID // plaintext ID, used during restore
+	nodeIdx map[string]*Node // node index by file id
 }
 
 // NewSnapshot returns an initialized snapshot struct for the current user and
@@ -201,6 +202,17 @@ func (sn *Snapshot) HasPaths(paths []string) bool {
 	return true
 }
 
+func (sn *Snapshot) LoadNodeIndex() error {
+	sn.nodeIdx = make(map[string]Node)
+	//TODO: 
+	return nil
+}
+
+func (sn *Snapshot) FindFileNode(id string) (*Node, error) {
+	//TODO: 
+	return nil, nil
+}
+

cfbao · 2019-05-27T17:44:11Z

I'd also like to have file hashes captured for the purpose of having a better file version history.
Metadata like mtime, ctime are often unreliable at telling whether a file actually changed. This limits the usefulness of the find command. Having file level hashes can solve this problem.

More discussions in this forum thread: https://forum.restic.net/t/per-file-version-history/929

I agree that an on/off switch with one algorithm is enough. Personally I prefer speed over security in this case, because restic already uses SHA-256 for blob level hashes, so file level hashes would be more of a convenience feature rather than a security feature.

Sina1910 · 2019-10-24T13:42:31Z

I would be glad to find this feature implemented soon.

Advanatages relevant for me would be (some copied from above):

check mechanism when restoring files
ability to manually check if files I have locally match files in the backup
basis to detect duplicated files
basis for building a map with checksumsof all the files in a repo
better file version history (Metadata like mtime, ctime often not reliable)
identify renamed/relocated files

deliciouslytyped · 2022-03-05T18:37:57Z

I would like to have sha256 hashes or the like available so that I can check for duplicate files that I don't want to sync into my respository. Storing these files is not a storage (due to restic deduplication) issue, it's an organization issue.

The only manner of workaround I could think of is hashing the files in a mount, but for a multi-terabyte remote repository that's not an option - and this is inefficient anyway.

Is there any reasonable workaround that doesn't involve trying to hash my data over a network connection?

stp-ip · 2022-03-07T10:54:46Z

Coming from git-annex it always seemed useful to have a separate hash store, that can be queried before importing data.
The hash store could be cached on the client as well and additionally be on faster and more reliable storage compared to cheaper data archives. Could be seen as another step in splitting up data storage. Aka hashes > metadata > blobs.
A separate hash store could be optional as it does add duplicate metadata that needs to be kept in sync etc.

johnmaguire · 2024-04-06T19:08:48Z

I'm not sure what the implementation would look like, but another interesting idea would be to have the option to store parity files. For example, Gareth George (author of the backrest UI) talks about using par2 in his archives here: https://gareth.page/post/2023-11-02-data-immortality/#erasure-codes-for-great-good-and-the-benefits-of-par2-archives

MichaelEischer · 2024-04-10T17:54:11Z

@johnmaguire The parity discussion belongs into #804 .

fd0 added the type: feature suggestion suggesting a new feature label Feb 16, 2018

ifedorenko mentioned this issue Feb 19, 2018

Add --read-data-subset flag to check command #1556

Merged

7 tasks

This was referenced Dec 19, 2019

Option to detect hash collision borgbackup/borg#170

Closed

Borg may restore corrupt files without noticing borgbackup/borg#4884

Closed

MichaelEischer added the category: backup label Oct 6, 2020

MichaelEischer mentioned this issue Oct 10, 2020

Add content hash to ls --json #2870

Closed

MichaelEischer added the category: resilience preventing and recovering from repository problems label Oct 10, 2020

cfbao mentioned this issue Nov 8, 2020

Show history of file #3073

Open

oldium mentioned this issue Nov 21, 2020

Please document: What happens if there is a hash collision? #1732

Open

MichaelEischer mentioned this issue Jun 5, 2021

Re-Read all data on disk to verify data on disk is intact, not in repository, so vice versa check command #3412

Closed

pcouy mentioned this issue Nov 29, 2022

Compute and store Merkle tree of backed up data #4057

Closed

mamoit mentioned this issue Aug 1, 2023

Restore only changed files #2662

Open

git70 mentioned this issue Aug 19, 2023

Hash collisions (again) - universal solution for everyone? borgbackup/borg#7765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

capture and validate backed up files checksums #1620

capture and validate backed up files checksums #1620

ifedorenko commented Feb 16, 2018

rawtaz commented Feb 16, 2018

ifedorenko commented Feb 16, 2018

rawtaz commented Feb 16, 2018

ifedorenko commented Feb 17, 2018

jdeng commented Jan 17, 2019

jdeng commented Jan 18, 2019 •

edited by fd0

fd0 commented Jan 19, 2019

jdeng commented Jan 19, 2019 •

edited

cfbao commented May 27, 2019 •

edited

Sina1910 commented Oct 24, 2019

deliciouslytyped commented Mar 5, 2022

stp-ip commented Mar 7, 2022

johnmaguire commented Apr 6, 2024

MichaelEischer commented Apr 10, 2024

capture and validate backed up files checksums #1620

capture and validate backed up files checksums #1620

Comments

ifedorenko commented Feb 16, 2018

rawtaz commented Feb 16, 2018

ifedorenko commented Feb 16, 2018

rawtaz commented Feb 16, 2018

ifedorenko commented Feb 17, 2018

jdeng commented Jan 17, 2019

jdeng commented Jan 18, 2019 • edited by fd0

fd0 commented Jan 19, 2019

jdeng commented Jan 19, 2019 • edited

cfbao commented May 27, 2019 • edited

Sina1910 commented Oct 24, 2019

deliciouslytyped commented Mar 5, 2022

stp-ip commented Mar 7, 2022

johnmaguire commented Apr 6, 2024

MichaelEischer commented Apr 10, 2024

jdeng commented Jan 18, 2019 •

edited by fd0

jdeng commented Jan 19, 2019 •

edited

cfbao commented May 27, 2019 •

edited