Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restic 0.10.0 always reports all directories "changed", adds duplicate metadata, when run on ZFS snapshots #3041

Open
stephenedie opened this issue Oct 27, 2020 · 97 comments
Labels
category: backup state: need investigating cause unknown, need investigating/troubleshooting

Comments

@stephenedie
Copy link

I run my backups from a ZFS snapshot in order to ensure the entire file-system is in a consistent state. After I upgraded to restic 0.10.0 from the previous official release, the backup started adding a duplicate copy of all the directory meta-data while claiming that all the directories have been changed. For example (pardon my bash):

# restic version
restic 0.10.0 compiled with go1.15.2 on freebsd/amd64

# commands executed for bug (repeatedly on unchanging 
# /usr/local/bin/restic backup -H $HOSTNAME --verbose=1 --cache-dir=$RESTIC_CACHE --exclude-file                 "${path}/${EXCLUDE_NAME}" "$path"
scan finished in 2.928s: 1604 files, 341.787 GiB

Files:           0 new,     0 changed,  1604 unmodified
Dirs:            0 new,   231 changed,     0 unmodified
Data Blobs:      0 new
Tree Blobs:    219 new
Added to the repo: 17.876 MiB

The result occurs repeatedly after re-running the backup on a new ZFS snapshot of an otherwise static file-system. I expect it to work like the previous version in which directories were not seen to be "changed".

I tested this on the same file-system except without using a ZFS snapshot, and it does not report directories as "changed" or upload duplicate metadata. Therefore, this problem seems to be particular to using ZFS snapshots. My method for backing up from ZFS snapshots is as follows:

  • create ZFS snapshot of file-system labeled "restic"
  • run backup of "$basedir/.zfs/snapshot/restic", where the snapshot is mounted
  • destroy the ZFS snapshot labeled "restic"

I find it interesting that restic is uploading new/unique directory meta-data with every run, suggesting that something about the directory meta-data is actually changing between runs. However, earlier versions of restic did not "see" these changes. I'm at a loss as to what's causing this.

In terms of severity, this is merely a nuisance to me---about 30ish MiBs added to the repo each day. However, I could see this being a bigger problem on systems with a lot more small files. Is there any way I can find out what aspect of the directory is being identified as "changed" from the command-line? Adding verbosity did not appear to do the trick.

@stephenedie
Copy link
Author

Sorry, the formatting is a bit weird in my restic command example above. It should read:

# /usr/local/bin/restic backup -H $HOSTNAME --verbose=1 --cache-dir=$RESTIC_CACHE \
    --exclude-file "${path}/${EXCLUDE_NAME}" "$path"

@rawtaz
Copy link
Contributor

rawtaz commented Oct 27, 2020

Can you identify a file that is experiencing this problem, and stat it between runs so that we can see what the filesystem says about it? Also, there's a PR that adds a (for now named --trust-mtime) option that in practice ignores ctime changes and just looks at mtime changes. I have no indication that ctime is the problem, but if you want to you can try it to see if it makes a difference: #2823 . The stat information is probably more relevant though.

@rawtaz rawtaz added category: backup state: need investigating cause unknown, need investigating/troubleshooting labels Oct 27, 2020
@greatroar
Copy link
Contributor

#2823 only changes the behavior for regular files, not directories.

@aawsome
Copy link
Contributor

aawsome commented Oct 27, 2020

Is there any way I can find out what aspect of the directory is being identified as "changed" from the command-line?

You can run restic diff <old snapshot ID> <new snapshot ID> --metadata

@stephenedie
Copy link
Author

Thank you for the suggestions! The metadata diff just listed every file and directory with a U to its left. Boring.

The output of running stat on a contained directory is more insightful. All the fields remain the same between different ZFS snapshots, except st_dev. The st_dev field specifies the id of the underlying device, so it makes sense that this changes between ZFS snapshots. It also makes sense that Restic treats these directories as novel as a consequence, and this behavior appears to be more correct than in previous versions. Curiously, it seems that the st_dev field is also being stored by Restic in the directory meta-data, which is why I'm uploading brand new tree blobs with each run. However previously this did not occur, so the older versions must have ignored st_dev when checking for changes even though it's part of the stored meta-data!

Things get weirder when I run stat on files instead of directories. The st_dev field still changes between zfs snapshots but Restic 0.10.0 doesn't seem to care and behaves as earlier versions did for directories. This is an inconsistency that should perhaps be corrected. I can't tell from program behavior whether Restic is storing st_dev with the file meta-data too. If so, it would seem more correct to also check st_dev when comparing files for changes, but that would cause my runs to re-scan all the file contents. :(

With this in mind, perhaps a switch can be added to always ignore st_dev when comparing for changes to accommodate my use-case and others were st_dev might be changing between runs? Thoughts?

@rawtaz
Copy link
Contributor

rawtaz commented Oct 27, 2020

Can you please be more elaborate when you describe this, for example can you commands and output of the stat commands? Would be nice to see what you're talking about here.

@stephenedie
Copy link
Author

Here is a complete annotated session illustrating how the output of stat changes between ZFS snapshots and how this (presumably) affects Restic (starting with 0.10.0):

Step 1: Create new ZFS snapshot. Stat a file and a directory in that ZFS snapshot. Take note of value of the first field, which is st_dev:

# zfs snapshot main/media/video@restic
# stat /data/media/video/.zfs/snapshot/restic/download
10575765696535701816 27 drwxrwxr-t 2 xbmc media 18446744073709551615 19 "May 30 23:51:52 2020" "Jun 12 01:25:35 2017" "May 31 01:10:24 2020" "May 30 21:20:04 2020" 16384 49 0x800 /data/media/video/.zfs/snapshot/restic/download
# stat /data/media/video/.zfs/snapshot/restic/backups.txt
10575765696535701816 97 -rw-rw-r-T 1 root media 18446744073709551615 799 "May 30 21:20:12 2020" "May  1 18:53:09 2010" "May 30 21:20:12 2020" "May 30 21:20:12 2020" 4096 9 0x800 /data/media/video/.zfs/snapshot/restic/backups.txt

Step 2: Backup the contents of the ZFS snapshot. I presume Restic sees changed values for st_dev for all directories and adds tree blobs for them. I believe this behavior is new in 0.10.0. However, it still ignores st_dev for files:

# restic --cache-dir=./temp --verbose=1 backup /data/media/video/.zfs/snapshot/restic
open repository
repository XXXXXXXX opened successfully, password is correct
lock repository
load index files
using parent snapshot XXXXXXXX
start scan on [/data/media/video/.zfs/snapshot/restic]
start backup on [/data/media/video/.zfs/snapshot/restic]
scan finished in 3.199s: 1604 files, 341.787 GiB

Files:           0 new,     0 changed,  1604 unmodified
Dirs:            0 new,   231 changed,     0 unmodified
Data Blobs:      0 new
Tree Blobs:    219 new
Added to the repo: 17.878 MiB

processed 1604 files, 341.787 GiB in 0:30
snapshot XXXXXXXX saved

Step 3: Run the backup again on the same ZFS snapshot. Note that nothing new is added to the repo
this time:

# restic --cache-dir=./temp --verbose=1 backup /data/media/video/.zfs/snapshot/restic
open repository
repository XXXXXXXX opened successfully, password is correct
lock repository
load index files
using parent snapshot XXXXXXXX
start scan on [/data/media/video/.zfs/snapshot/restic]
start backup on [/data/media/video/.zfs/snapshot/restic]
scan finished in 3.311s: 1604 files, 341.787 GiB

Files:           0 new,     0 changed,  1604 unmodified
Dirs:            0 new,     0 changed,   231 unmodified
Data Blobs:      0 new
Tree Blobs:    219 new
Added to the repo: 0 B

processed 1604 files, 341.787 GiB in 0:04
snapshot XXXXXXXX saved

Step 4: Destroy the old ZFS snapshot and create a new ZFS snapshot of the exact same file-system. Note how st_dev is changed for the stat of both the file and directory:

# zfs destroy main/media/video@restic
# zfs snapshot main/media/video@restic
# stat /data/media/video/.zfs/snapshot/restic/download
11704657377658002536 27 drwxrwxr-t 2 xbmc media 18446744073709551615 19 "May 30 23:51:52 2020" "Jun 12 01:25:35 2017" "May 31 01:10:24 2020" "May 30 21:20:04 2020" 16384 49 0x800 /data/media/video/.zfs/snapshot/restic/download
# stat /data/media/video/.zfs/snapshot/restic/backups.txt 
11704657377658002536 97 -rw-rw-r-T 1 root media 18446744073709551615 799 "May 30 21:20:12 2020" "May  1 18:53:09 2010" "May 30 21:20:12 2020" "May 30 21:20:12 2020" 4096 9 0x800 /data/media/video/.zfs/snapshot/restic/backups.txt

Step 5: Run the backup one more time. Note that Restic reports all directories as changed and stores new tree blobs for them:

# restic --cache-dir=./temp --verbose=1 backup /data/media/video/.zfs/snapshot/restic
open repository
repository 509797e0 opened successfully, password is correct
lock repository
load index files
using parent snapshot 7615de93
start scan on [/data/media/video/.zfs/snapshot/restic]
start backup on [/data/media/video/.zfs/snapshot/restic]
scan finished in 3.504s: 1604 files, 341.787 GiB

Files:           0 new,     0 changed,  1604 unmodified
Dirs:            0 new,   231 changed,     0 unmodified
Data Blobs:      0 new
Tree Blobs:    219 new
Added to the repo: 17.876 MiB

processed 1604 files, 341.787 GiB in 0:31
snapshot 70edf3cd saved

Unfortunately, I'm not sure of any easy way to reproduce this behavior unless you a way to change the underlying st_dev (the ID of the mounted device!) while leaving all the other data and metadata alone. Without being able to manage ZFS snapshots, I suppose you could dd a whole file-system between block devices to test changing st_dev. Ugh!

Just to cover bases, is there another possible explanation? Or should stat account for everything that might change about a directory, other than its contents?

@rawtaz
Copy link
Contributor

rawtaz commented Oct 28, 2020

Thanks a lot for that clarity :) We'll have to look at whether it makes sense to look at the device ID.

@stephenedie
Copy link
Author

Let me point out something that dawned on me in case it isn't obvious to you: The st_dev attribute on all the files contained in a directory is also changing between ZFS snapshots. This means that the directory change detection logic may be based on changing st_dev of the constituent files rather than st_dev of the directory itself.

I'm also having second thoughts about whether it is correct that changes to st_dev alone should be treated as changes by Restic. I'm also not sure how st_dev is used by different OSes. My FreeBSD man page says "The st_dev and st_ino fields together identify the file uniquely within the system". My Linux man page makes no such promise. What happens if the file-system is on a USB stick, and the USB stick gets inserted into different computers or even different ports. Does st_dev change? It seems like it could without some guarantee that st_dev will always be stable over time for files+directories residing on the same file-system. I don't know that it's designed to work that way though.

@greatroar
Copy link
Contributor

My Linux man page makes no such promise. What happens if the file-system is on a USB stick, and the USB stick gets inserted into different computers or even different ports. Does st_dev change?

The GNU libc manual and The Linux Programming Interface both do. However, st_dev is better thought of as the connection to the device rather than the actual device. For internal disk drives that doesn't matter, but when I unplug my USB disk, plug in a USB stick and then plug in the disk again, the stick gets the disk's former device number and the disk gets a new one.

Restic doesn't look at the st_dev field for files because it's not considered in its change detection heuristic. It does not, and cannot have such a heuristic for directories: the timestamps on directories don't reflect changes to the files within, so those would get skipped too. In any case, it still records the metadata change, even for a file that is reported as unmodified (#2823 documents this in some more detail than the current manual). If you think of directories as entirely metadata, the fact that directories still change should make more sense.

The situation is a bit strange at first glance, but it usually works well. There's a few possibilities for improvement:

  • Restic could look at st_dev for changed files for consistency, but that would considerably slow down backups from ZFS snapshots (probably btrfs ones, too) and from removable media. If the default behavior were changed, --ignore-inode should be extended to turn it off, I think.
  • Restic backup could get a flag telling it not to record device numbers and/or inodes, so less near-duplicate directory information needs to be stored. Then again, 30MiB, even per day, is tiny compared to 341.787GiB.
  • The UI could be improved to list file metadata changes separately.

@mamoit
Copy link

mamoit commented Nov 8, 2020

I'm running restic 0.10.0 on android (because I can), and the "all files and dirs have changed" situation happens when backing up the the sdcard.
Running stat on the files yields a weird result on an access time 1979-12-31 00:00:00, everything else looks normal.
mount reports that the file system is sdcardfs, which I had never heard of to be honest.
Can this be considered the same issue as described, or should I report it on a new issue?

I also had an issue similar to this on my desktop, where a file that had a weird 1903 (or some year of the sort) modified time would always be backed up. I touched it and it stopped being backed up all the time.

@wpbrown
Copy link

wpbrown commented Nov 13, 2020

I'm doing the exact same thing as @stephenedie except with btrfs snapshots and I'm having the same problem. Restic is resaving all the unchanged tree data for every snapshot.

  File: snap/z7
  Size: 34        	Blocks: 0          IO Block: 4096   directory
Device: 9bh/155d	Inode: 256         Links: 1
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-11-13 00:01:25.815487088 +0000
Modify: 2020-11-13 00:01:35.807477764 +0000
Change: 2020-11-13 00:01:35.807477764 +0000
 Birth: -
  File: snap/z8
  Size: 34        	Blocks: 0          IO Block: 4096   directory
Device: 9ch/156d	Inode: 256         Links: 1
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-11-13 02:57:22.120369133 +0000
Modify: 2020-11-13 00:01:35.807477764 +0000
Change: 2020-11-13 00:01:35.807477764 +0000
 Birth: -

Files use this logic that doesn't include device id:

func fileChanged(fi os.FileInfo, node *restic.Node, ignoreInode bool) bool {

If I'm reading this right there is no change function for trees, it just relies on hash collisions:

func (arch *Archiver) saveTree(ctx context.Context, t *restic.Tree) (restic.ID, ItemStats, error) {

@wpbrown
Copy link

wpbrown commented Nov 14, 2020

For btrfs I bind mount the latest snapshot to a directory so restic sees a stable path. Commenting out this line solves the issue, and unchanged snapshots are properly detected as unchanged by restic.

node.DeviceID = uint64(stat.dev())

Looking at @greatroar's PR and it seems like integrating an an --ignore-device option there would be a good solution. I'm not sure there is much value to tracking device id at all.

@rawtaz
Copy link
Contributor

rawtaz commented Nov 14, 2020

I'm not sure there is much value to tracking device id at all.

I agree, what's the use case for looking at the device ID? It can change very much. Shan't we just stop doing that (instead of introducing an option for it)?

@stephenedie
Copy link
Author

In the detailed example I gave, I'm storing 17.876 MiB for 341.787 GiB. It's relatively small, but these are all video files with a very high average size. Running this on trees with lots of small files may very inefficient.

I tend to agree that Restic should just always ignore and not store/hash st_dev.

@wpbrown
Copy link

wpbrown commented Nov 14, 2020

Shan't we just stop doing that (instead of introducing an option for it)?

I tend to agree that Restic should just always ignore and not store/hash st_dev.

I can't think of a good reason. It's already ignored for files. Restic is already keying the tree off the absolute path. If that changes, the tree data is repacked. If the device under an absolute path changes (maybe a pre-failure disk gets replaced) and its mounted in the same path, that wouldn't be any reason to repack the tree data.

The only reason I thought of an ignore option is "backward compat". If you stop tracking device, everyone would get a one-time tree data repack upon upgrade. I suppose you could automatically include device if an existing repo already has it, but maybe since restic is pre-1.0 now would be a great time to have the one-time change and not carry this baggage? Only users with huge amount of small files would likely even notice, and even then it would only be a one-time annoyance.

@NobbZ
Copy link

NobbZ commented Aug 19, 2021

Can I assume that this very same issue will also exist on BTRFS and LVM snapshots?

Currently I do not use either, but am tempted to use LVM snapshotting to prepare the volumes before backing them up.

After my experiences so far, backing up a snapshot of the drive that contains the eg. mysql database is less error prone than doing an unsnapshotted backup, due to how files are changed in time...

@intentionally-left-nil
Copy link

My 2 cents - it's best to make the file & directory logic similar. I can't imagine why it would be okay to treat a file as cached with a different device ID, but not the folder. So, either restic should add device ID checking to files, or remove it from folders. IMO the latter makes sense to me here.

For restic to produce the wrong behavior:

  1. The file would have to be in the exact same location as the parent snapshot
  2. The ctime/mtime (depending on flags) would have to be the same
  3. However the user would have managed to change the file on a different device. Yes in theory it's possible for you to modify two files with the same timestamp on different devices but I really don't think it's a likely case (and if it's something you're worried about then it would make more sense to check files for this behavior as well as folders)

Since the device ID isn't stable to begin with, and there's a given use case (ZFS/btrfs snapshots) that is impacted by this, I would just remove the device ID entirly (my 2 cents)

@eyJhb
Copy link

eyJhb commented Oct 5, 2021

I am not sure what the best solution for this would be, but this is exactly the usecase that I want to use restic in. Ie. wount zfs snapshots, and then use restic to back them up :)

And in my case, I don't have video files, but am backing up my codes projects, etc. - I have 248644 files and 60264 folders.

I agree with #3041 (comment) that either ...

  1. Add a option to ignore it
  2. Just ignore the device id all together. I am unsure what value it adds.

Is there any usecases where it adds any value? I would be happy to know such. Does borgbackup keep track of device id as well?

Also, thanks for the project, I am excited to get going!
(also thanks to OP for making me aware of this issue)

@eyJhb
Copy link

eyJhb commented Oct 7, 2021

Looked at the source for borgbackup, and they do not use the DeviceID to determine if a file has changed/in any metadata.
What they do instead, is to have a option named --one-file-system, which tells it not to recurse into filesystems.

Description can be found here - https://github.com/borgbackup/borg/blob/3ad8dc8bd081a32206246f2a008dee161720144f/src/borg/archiver.py#L3320-L3329

I suggest we remove the NodeID for now, as there is no reason to have it.
We can add the same --one-file-system, but I am unsure if we want this?

Can any contributors comment on this? :) I will make a PR if it sounds OK.

@rawtaz
Copy link
Contributor

rawtaz commented Oct 7, 2021

Is borg's --one-file-system different from restic's --one-file-system (that already exists for the backup command)?

@eyJhb
Copy link

eyJhb commented Oct 7, 2021

Is borg's --one-file-system different from restic's --one-file-system (that already exists for the backup command)?

Sorry, this is one of the issues that is keeping me from using restic, so I didn't know restic already had this option. Thanks for pointing it out!

I guess just removing the DeviceID as metadata would suffice then, and of course keeping the current --one-file-system as-is.

@MichaelEischer
Copy link
Member

MichaelEischer commented Oct 9, 2021

The DeviceID is necessary for the hardlink detection in the restore command to work properly. Otherwise two files on different devices but with the same inode numbers (+ both files already using hardlinks) could end up hardlinked together.

My suggestion would be to use pseudo device ids instead. restic could just map device ids to pseudo device id starting from 0 and increment that counter each time it encounters a new device id. That should essentially let the subvolume always get the same pseudo device id which would then ensure that no new tree blobs are created.

[Edit] An alternative could be to add a --ignore-hardlinks option which would allow removing the DeviceID. But I'd rather prefer the pseudo device ids variant [/Edit]

@smlx
Copy link
Contributor

smlx commented Oct 10, 2021

Since there is no way heuristically for restic to know that it should consider a new device ID to be the same as a different one in another snapshot, maybe an option --override-device-id or --force-device-id? Then people that want to consider different ZFS snapshots to be the same device can hard-code a value in their backup scripts.

@pvgoran
Copy link

pvgoran commented Oct 10, 2021

Probably --map-device-id would be the best way to go. Like this: --map-device-id 1234=2345, or --map-device-id 1234:2345.

@kakra
Copy link

kakra commented Mar 2, 2024

I think what we should learn from this is: inode and devid are not stable IDs, they should not go into the metadata storage of the repository - never ever. After each reboot, or even remount, those IDs can change, especially for filesystems using virtual backing devices (like nfs, btrfs subvolumes, probably more) or non-inode filesystems (like fat etc), in case of NFS even after a server reboot (which is why we have inode generation numbers which restic probably totally ignores).

Thus, detecting hard links can only be done reliably during one scan of the filesystem, and metadata must store this information in a inode/devid-agnosting way. Simply ignoring devid or inode is not a proper fix (unless you don't care about proper hardlink restore which may have bad side effects if you restore).

Metadata should probably keep track of detected hard links, keep them in a hash and assign "virtual" IDs (virtid -> path), tho, virtual IDs may even not be needed. Upon successive scans in a new backup session, all hash entries need to be verified and mapped to the current devid/inode numbers to properly update the hash (and discard entries from the hash if all paths do not refer the same file, it only acts as a cache), and then store it back in the repository (excluding the mapped devid/inode numbers).

During restore this map can be used to recreate hard links to files already extracted. Partial restores should ignore existing hard links not part of the restore set, otherwise it would change existing files which are not part of the restore set - which could be unexpected or even bad.

A file has potentially changed, if its ctime has changed. This way, we would no longer detect renamed/moved files but does this matter with dedup backups anyways? The worst thing that happens is reading the file again even if no content was changed, and in that event, at least file metadata has changed which we would store anyways in this case.

With this in place, devids should no longer matter at all, and neither do inodes. It's really not the business of the backup repository if the newly backed up data source comes from different devices or a single/identical one - we just need to know the hardlinks while backing up the current set of files, and this knowledge needs to be fetched from the repository when creating new snapshots, and compared with this new situation.

The problem is really in the design of storing devid and inode numbers as part of the repository metadata. It should really go away, instead keep track of the hardlinks only by grouping paths together, store that in the metadata, and use that to reconstruct known valid hardlinks during the next backup session.

Also, a backup program should not try to be smart about what the parent snapshot is. It's either a snapshot of the base path currently stored, or it's something the user explicitly names on the cmdline (e.g. if you want to dynamically create a consistent snapshot of /home in /snapshots/home-1234, but pretend the parent is always /home, and restic would use that path to determine the last parent snapshot). So I really do not see a point here why we would need to store any filesystem IDs, be it devid, inode, or some UUID.

@kakra
Copy link

kakra commented Mar 2, 2024

BTW: There's an edge case for file change detection which storing the inode in the archive tries to fix

Getting back to my previous comment: For this to work, the inode must not to be stored in the repository. Instead, restic should use a cache storing inodes (and maybe other meta data) outside of the repository. If the cache is gone, well, then be it that way, and the file needs to be read again even if ctime and other data in the repository matches. This cache should never be backed up in the repository because upon restore, inodes would be different anyways (which also implies why the inode must not be stored in the repository because there may be collisions after a restore and a later backup).

Such a cache still won't cover cases where the filesystem has unstable inode numbers. In this case, the user should be able to adjust via cmdline how file-change detection works (ctime, size, mtime, or any combination).

This also means that meta data from the repository is not suitable for file change detection anyways: you never know what happened to the data between backups, the user could have rsync'ed the filesystem to a new computer... But then again, no ctime would match because you cannot set ctime.

Because of this I now revise my previous comment: The inode is not needed to detect file changes, ctime is sufficient. But you can use the inode to detect file moves and hardlinks - but only while the inode generation matches and the filesystem hasn't been unmounted. Unmounting the filesystem invalidates all knowledge collected from inodes.

@haslersn
Copy link

haslersn commented Mar 4, 2024

Thus, detecting hard links can only be done reliably during one scan of the filesystem

Detecting hard links using those stored device IDs and inode IDs is reliable, because they are stable within the same snapshot.

@kakra
Copy link

kakra commented Mar 4, 2024

Detecting hard links using those stored device IDs and inode IDs is reliable, because they are stable within the same snapshot.

Yes, this is true. But this discussion originated from the problem of duplicated metadata in successive backup snapshots, and that's why these IDs should not be stored at all but another way of keeping information about the hardlinks should be found, and reconstructed when taking a new snapshot by looking at the current files to be backed up and loading the current devid/inode from there.

@horvatkevin
Copy link

We could give each hard-linked file a unique ID. Upon restore, we can then check whether the actual restore target's device IDs match across the hard-linked files to restore the hard link or to perform some sort of fallback/error if the device IDs do not match.

This would eliminate the requirement to keep track of device IDs or to store them in the metadata.

@haslersn
Copy link

haslersn commented Mar 4, 2024

@kakra another way of keeping the information would have the same duplication problem, which needs to be solved.

@horvatkevin :

We could give each hard-linked file a unique ID.

We already have a unique ID (at least, unique within the same snapshot): i.e. the (Device_ID, Inode) tuple. You are essentially proposing to give this a different name (Unique_ID or so), but that does not solve the problem how to keep it consistent across snapshots in order to prevent metadata duplication.

In #3041 (comment) I explained how to make these IDs consistent across snapshots (at least in a best effort manner).

Whether or not you like to rename it is completely orthogonal to solving the problem of duplication.

@kakra
Copy link

kakra commented Mar 4, 2024

I propose that we do not need IDs at all, so there would be no duplication. Instead (but this needs additional conversion or migration) I propose that we keep a separate list of hard links grouped by inode/devid (the latter not stored in the metadata).

This way, on restore, hard links can be reconstructed. If needed, we can create virtual inode IDs (and probably call id hardlink ID or something similar) which can reduce the amount of data that needs to be stored - but the classic inode/devid should not be stored in the repository - because you cannot ensure that new snapshots are based off the same IDs, and it is virtually impossible to maintain a proper, collision-free mapping from the original IDs to the current ones. We shouldn't try to fit the stored IDs into a purpose they are not meant to be used for.

@haslersn

In order to properly fix this issue, restic needs to reuse the device_ids from the parent snapshot. To do so, there needs to be a mechanism to detect which device_ids in the new (to be created) snapshot correspond to which device_ids from the parent snapshot. In the following, I wrote up a design that might make sense. Please comment on this before I might start to implement it.

This is only half of the solution because inode numbers are not stable IDs in the same way (although in most filesystems, it seems they are) - but technically, inode numbers are subject to change, and for e.g. NFS, they may not even be stable during the same snapshot because we do not record the inode generation number. If we start adding that, too, metadata duplication will only increase.

I'm not trying to say that your thoughts are wrong but I think we should first try to define if our assumption are really true. And they are not: (dev_id,inode_id) is not a unique tuple during the time recording a snapshots - it's just very likely. We also need the inode generation number. And storing this in the metadata will greatly increase the chance of seeing new metadata in the next snapshot, essentially duplicating it. And this happens, because the design of storing these IDs in the repository - although they are known to be unstable - is probably wrong to begin with.

Your design can still work, we "just" (simplified) need to count new hardlinks and store them in the inode field, leaving any other IDs at 0 (also the inode number for regular files), at least for persisted data, runtime needs all the details. And we also need to add the generation number at runtime to be robust backing up NFS (and similarly behaving) sources. As long as we only encounter a few hardlinks, there's not much more to do: Duplicating a few hardlink metadata would probably be okay. If not, we should look into re-using those IDs for the new snapshot as good as possible. But this makes thinks complicated, and complicated things usually are more likely to contain bugs. Just storing a grouped list of hard links would be much easier.

You are essentially proposing to give this a different name (Unique_ID or so)

Sorry, but no, I'm not. That's what you want to make of it. I proposed to keep an extra list of hardlinked filenames grouped by inode. I just said (indirectly) that if we want to stick with the current fields (which we should remove), then we can probably re-use them for that purpose. But we should be agnostic about inode IDs and thus should not store them at all, e.g., by replacing this number with a new virtual ID. But I still think this makes things more complicated, and it doesn't keep the repository backwards compatible either. Also, inodes are not as unique as it sounds. And even if some man page proposes that, it may not be true. Even man pages contain errors or miss some details.

What I proposed, tho, is keeping (dev_id,inode) (or whatever identifies a file uniquely) in a cache instead of storing it in the repository. The repository design must work without this information persisted somewhere, the cache can be used to speed things up.

you can create the device_id mapping efficiently: You simply look up the mountpoints on the currently booted system

No, you cannot. With automounts, a stat call won't resolve the mount point and mount it, you'll get the wrong device IDs unless the automount already mounted the destination volume. This behavior is documented somewhere deep in the kernel sources, and also why it does this. You'd have to access an existing path within the to-be-mounted volume, this is usually . - it will always exist, there's a difference in looking up .../path/to/folder and .../path/to/folder/.. This means, you can get a different dev_id at the start of the snapshot vs reading files into/form the snapshot, if you don't take care of preventing that, which completely messes up the mapping if I'd guess.

Conclusion, and please, @haslersn, don't feel personally offended because this is not about your comments but the whole thread in general: I feel like here are a lot of misconceptions about what is unique and when, and which information is correctly available at which point in time, and what things really are, and even some of these behaviors changed between kernel versions, e.g. for the automounters (so we should not rely on a specific behavior)...

@kakra
Copy link

kakra commented Mar 4, 2024

@horvatkevin

We could give each hard-linked file a unique ID. Upon restore, we can then check whether the actual restore target's device IDs match across the hard-linked files to restore the hard link or to perform some sort of fallback/error if the device IDs do not match.

I think you meant "inode IDs" here, and then, yes, that is generally my idea but instead of persisting this in the metadata per file, we should store it as extra metadata per archive, agnostic of any IDs.

This would eliminate the requirement to keep track of device IDs or to store them in the metadata.

Yes, given you mean "inode IDs". I think we already found that storing device IDs in the repository is quite pointless and should be ignored (...mostly, because currently it is needed to detect hardlinks uniquely if the backup spans multiple volumes). inode/devid (and probably inodegen) should be stored only at runtime, and not persisted in the archive. To speed up future scans, this data can be stored in a cache, tho. But the cache would pretty much be always stale for btrfs or zfs because of changing devid. And because we do not persist IDs in the archive, we need some other means of knowing which filenames link to the same on-disk file (aka hardlinks).

With this, we can still use the old repository format without modifying the existing logic, we create a new format readable by newer restic versions, and we prevent adding edge cases and bugs to the existing logic - which works well for most cases, except those we are adding edge cases for. I'm pretty sure this is not the last edge case and clutch we are going to add, and this is because we are misusing those IDs for a purpose they do not fulfill: they are not stable across remount/reconnects - so the archives should not try to work around that: archives are by definition their own sort of "mount". Otherwise, we add a mapping here, an edge case there, a complete exception at another place, then another mapping, more edge cases. This will make the code unmaintainable in the long run, and in a few years, nobody will understand what it does.

@haslersn
Copy link

haslersn commented Mar 4, 2024

I propose that we keep a separate list of hard links

I very much like this solution and this is, as far as I'm aware, the first time that somebody proposes this solution in this thread. So scrap my solution from #3041 (comment).

separate list of hard links grouped by inode/devid

Why grouped by inode/devid? I'd assume the data structure can simply be a set of sets of file paths. Two file paths that are in the same set are hard links to the same file. During restore, if those file paths happen to be on different file systems, they of course cannot be hard linked, so two separate inodes are created instead and a warning is logged.

Some more brainstorming:

In order that we don't need to read the whole set of sets into RAM, I propose to additionally store a SHA-256 reference into above-mentioned set of sets as part of every file's metadata (call it e.g. hardlink_list). This metadata field would be stored only for files whose inode has a reference count > 1. With this design, metadata duplication occurs only for files whose hardlink_list has changed (i.e., the set of file paths which link to the same inode has changed). In particular, there's no duplication if you don't have hard links (with ref count > 1).

@kakra
Copy link

kakra commented Mar 4, 2024

I propose that we keep a separate list of hard links

I very much like this solution and this is, as far as I'm aware, the first time that somebody proposes this solution in this thread.

This idea mainly resulted from the observation that storing runtime-specific IDs (inode, devid) is probably the wrong way to go forward in the first place. We can keep that to read older repositories but should avoid it for newer archives, I don't think that we need some migration. Also, since I originally participated in this thread, I learned a lot about inodes and what you can do with them, and what they cannot do, and where the limits are especially across different filesystems, including how device IDs work with volumes not backed by static physical devices. So whatever ideas I had before, I now think we should avoid persisting such IDs into the archives.

So scrap my solution from #3041 (comment).

While it would have worked, I think it only fixes an edge case and we are going to find more, and then we get code full of edge case handling. So let's do some more brainstorming for the new idea.

separate list of hard links grouped by inode/devid

Why grouped by inode/devid? I'd assume the data structure can simply be a set of sets of file paths.

Essentially, yes, a set of sets (or whatever name golang uses). "grouping" just means the following:

  1. You start scanning the file system or folder structure.
  2. During scan, you will record something that maps discovered devid/inode/gen to a set of path names, if nlinks > 1.
  3. If you encounter another file with nlinks > 1, you'll look up if devid/inode/gen has already been recorded in a set, and either create a new one or append to the found set.
  4. Point 3 actually means you're grouping by some ID.

Later, when persisting the sets of hardlinks, they can be "anonymised" by removing the IDs, they are no longer interesting. You end up storing a set of sets. We don't need the IDs because:

  1. When extracting from the archive, we just will match paths from the sets to find sibling files and create hardlinks.
  2. When adding a new snapshot, the IDs may have changed anyways, so any ID from the previous snapshot is potentially invalid anyways.

Two file paths that are in the same set are hard links to the same file. During restore, if those file paths happen to be on different file systems, they of course cannot be hard linked, so two separate inodes are created instead and a warning is logged.

Makes sense, I agree.

Some more brainstorming:

In order that we don't need to read the whole set of sets into RAM, I propose to additionally store a SHA-256 reference into above-mentioned set of sets as part of every file's metadata (call it e.g. hardlink_list).

We should have another layer of verification because there can be collisions, so consider having more than one reference per SHA256 entry. If the SHA256 is found, it should verify to find the exact matching entry. Collisions are unlikely but not impossible.

This metadata field would be stored only for files whose inode has a reference count > 1. With this design, metadata duplication occurs only for files whose hardlink_list has changed (i.e., the set of file paths which link to the same inode has changed). In particular, there's no duplication if you don't have hard links (with ref count > 1).

Exactly, that's the whole idea. We could probably do some sort of delta hardlink_list but is that worth the effort? I think someone had some stats that in their backup sets, around 3% of metadata accounts for hardlinks, so this is a very low number, probably with very low noise. I wonder if it would be easier to just store the complete list, it would mean we store 3% of what we would currently store with metadata duplication. That's a pretty good trade.

OTOH, I've not looked at the current implementation details - I believe you know a lot more about it. If it's easy to store just the differences to the hardlink_list, and it's easy to reconstruct from the parent snapshots, then go for it. What about sets removed from the hardlink_list because they no longer exist in the current snapshot? Will it be properly recorded?

This probably involves a breaking format change. For a new client, this is easy: The old code paths are still intact and can work like before for older archive data. We probably could even avoid a one-time duplication by keeping devid/inodeid from the previous snapshot but ignore that on reads in the new code path, and simply not write those ID data for new items in the archive.

An old client, tho, will only be able to read the older format.

As far as I know, there's a planned format change going to happen anyways, and having this ready by then would be a good opportunity to get it in.

Getting this in should avoid all headaches we could have in the future with "random" behavior of filesystems wrt object IDs (inode, dev, ...). It keeps the code easy and doesn't need to touch a lot of the existing code paths to still read old archives in the same way it does now. The only "complicated" things to get right is probably the implicit migration from old-format metadata to new-format metadata on the fly, and storing and reconstructing the hardlink_list.

If it helps performance, we should consider storing a cache with discovered IDs. Keeping it out of the archive means we can simply discard it on format change (or environment change), and the only effect is it being slower one time.

The nice thing about the whole idea is:

If we restore a backup, we will get new inodes. When stripping this data from the archive, even taking a new backup from restored data would not create duplicate metadata (except if we store the ctime, which we should probably avoid, too, and put that in said cache instead).

About ctime and a cache, and without knowledge if restic already does this: ctime serves no purpose in the archive, because it cannot be restored. If we store it in a cache instead, we can still get faster backups. And if the cache becomes destroyed, yes, we need to do full file scans to find if a file has changed - we cannot just skip it. But if the remainder metadata is the same, we also get no duplication.

But this latter idea is a different idea although it somewhat fits into the concept, and thus may be appropriate to change while at it.

Also noteworthy but not part of this problem: We probably still need options to tell which attribute change counts as a backup indicator: Somebody may want to backup only files which are newer, someone else may consider mtime bogus and just consider filesize as a change indicator, some filesystems may not even provide all the details (e.g. FAT has 2-seconds resolution for file times, I think).

Speaking content-wise: A path indicates a file uniquely. A devid/inode is just an internal representation of the file object and should not be taken as a unique identifier of the file unless specific conditions are set, e.g. it's still the same mount and same boot-cycle, for network filesystems even that may not be true (hence there are inode generation numbers), and some filesystems even do not know what an inode is and will just create some on the fly using some algorithm, valid as long as it needs to be. So at best, an inode is a time-limited unique identifier. It just identifies an index node for an unspecified period of time, not the content. But from a user perspective, the paths identifies the file and the contents, and that's what we want to backup.

Thanks for considering, this looks promising. :-)

@kakra
Copy link

kakra commented Mar 4, 2024

We should have another layer of verification because there can be collisions, so consider having more than one reference per SHA256 entry. If the SHA256 is found, it should verify to find the exact matching entry. Collisions are unlikely but not impossible.

We could probably use ${SHA256}/${devid}/${inode}/${gen} to point into the set at runtime, then count a unique ID to actually store and reference in the metadata. This avoids a level of indirection but it will restart the unique ID with every backup, thus duplicating the hardlink_list (if you try to de-duplicate it). This in turn could be avoided by re-using the previous hardlink list, recreating the keys, and clean it from all entries that no longer resolve to an identical list of hardlinks. Generating new unique IDs would then skip over the ones already in the archive, or just count from the maximum number observed.

@haslersn
Copy link

haslersn commented Mar 5, 2024

We should have another layer of verification because there can be collisions, so consider having more than one reference per SHA256 entry. If the SHA256 is found, it should verify to find the exact matching entry. Collisions are unlikely but not impossible.

I was not speaking of the SHA-256 of the file. I was speaking of the SHA-256 of the set of file paths.

(SHA-256 is considered to be a cryptographically collision resistant hash function, so we can assume that collisions are never found. The Restic repository format already depends on this assumption by storing every object content-addressed by its SHA-256 hash.)

However, there's a problem: We don't know the set of file paths before the scan, so this would require a 2nd pass. I need to think more about how this can be solved in a single pass.

@kakra
Copy link

kakra commented Mar 5, 2024

I was not speaking of the SHA-256 of the file. I was speaking of the SHA-256 of the set of file paths.

Neither me... ;-) I thought you would be hashing the attributes of a hardlink to create an index into the hashlist which you could refer from the metadata. But yes, hashing the path list probably works, too. And with this I better understand how you'd implement deduplication of hardlink lists. OTOH, my idea could solve the problem in a single path because you don't rely on the discovered paths. You could then, at the end, compute the SHA256 over the path list before storing the list in the archive.

Slightly offtopic:

I'm not sure if restic really "relies" on this. Surely, it uses content-based addressing to identify duplicate blocks, but I'm pretty sure it still compares the contents to be sure they are really identical, and then uses a generated index to store a reference to the block. Or does it really reference blocks exclusively by the SHA256 of the contents? I mean, yes, "collision resistant" but that doesn't mean "collision free".

For Git, which has a similar issue with collisions, collisions have already been discovered to craft strange or impossible repositories, tho, it couldn't be used to attack the verification chain of the commits yet.

So at least I would expect restic to also have a full SHA256 of the complete file (which are composed of individual SHA256-addressed blocks) to have another layer of verification that extracted content at least matches what it originally stored deduplicated into the archive. Otherwise, collisions go unnoticed on extraction.

BTW: Calculating the probability of a collision must use the birthday paradox: It is more likely that two hashes collide with a growing number of hashes than trying to craft a single colliding hash. This is probably what many calculations get wrong, but it's still very unlikely.

I found an issue about this: #1732

IOW, given the low amount of hashes that hardlink lists would create, it is more likely to find hash collisions in the block storage than the hardlink storage.

@haslersn
Copy link

haslersn commented Mar 5, 2024

Calculating the probability of a collision must use the birthday paradox

I know

It is more likely that two hashes collide with a growing number of hashes than trying to craft a single colliding hash

In the end of the sentence you probably meant "to find a collission for a fixed hash". Then that is true, but for SHA-256, no hash collission is known at all.

If a weakness in the SHA-256 is found in the future, then there might be shortcuts to find a collision (e.g. by narrowing the search space). But as long as this is not the case, it is expected that a collision will never be found.

git is a different story, because it uses SHA-1 which has only 160 bits. It is much easier to produce 2^80 hashes than to produce 2^128 hashes. Furthermore, SHA-1 has known weaknesses which allow for more efficient collision attacks and ultimately have lead to successful collision attacks. Such weaknesses are not known for SHA-256.

@MichaelEischer
Copy link
Member

MichaelEischer commented Mar 9, 2024

but the classic inode/devid should not be stored in the repository - because you cannot ensure that new snapshots are based off the same IDs, and it is virtually impossible to maintain a proper, collision-free mapping from the original IDs to the current ones.

I agree that getting rid of the device / inodeID is probably the only way to 100% fix this issue while still handling hardlinks.

But that doesn't mean that we can't introduce a simpler temporary solution, especially once the feature flags have landed.

Finding a good solution for the metadata duplication issue here appears complex enough that we shouldn't rush it into a particular restic version.

In order that we don't need to read the whole set of sets into RAM, I propose to additionally store a SHA-256 reference into above-mentioned set of sets as part of every file's metadata (call it e.g. hardlink_list).

The list of paths referencing the same file instance is only available at the end of a backup. So, the final hash for each file instance would have to be injected in a second pass, which is a complexity disaster. The backup would have to iterate through the whole snapshot and modify the data it just wrote. To make matters worse, all sets of sets would still have to be kept in memory.

The alternative of only storing some id (instead of the SHA256) that references some kind of mutable set is even worse, as it would introduce mutable data into the repository format.

I only see two ways to store the set of sets:

  • either as a large set that is written when finalizing a snapshot, as suggested by @kakra
  • or distribute the set into the tree as follows: instead of deviceID+inodeID, there's a new field hardlink_target that stores the path of the first file instance of this hardlink set. That also has the benefit, that for each deviceID+inodeID set only a single path has to be kept in memory.

The "large set" approach also has the problem that it will likely require some additional mechanism to ensure that the set can be (partially) deduplicated across snapshots; otherwise, we'd just partially recreate the current problem. The inline approach on the other hand may require changes in multiple places if the first file of a hardlink set gets removed. But except for that it is far, far simpler to implement. (no need for extra set datastructures in the repository, which also requires garbage collection changes; when only restoring a part of a backup, also only the relevant part of the hardlink sets have to be reconstructed)

Also noteworthy but not part of this problem: We probably still need options to tell which attribute change counts as a backup indicator: Somebody may want to backup only files which are newer, someone else may consider mtime bogus and just consider filesize as a change indicator, some filesystems may not even provide all the details (e.g. FAT has 2-seconds resolution for file times, I think).

Change detection is out of scope for this issue. Let's focus on hardlink detection here.

@kakra
Copy link

kakra commented Mar 10, 2024

@MichaelEischer Thanks for considering the ideas.

The list of paths referencing the same file instance is only available at the end of a backup. So, the final hash for each file instance would have to be injected in a second pass, which is a complexity disaster. The backup would have to iterate through the whole snapshot and modify the data it just wrote. To make matters worse, all sets of sets would still have to be kept in memory.

That's why I suggested to just use an index counter into the sets mapping (devid/inodeid/...) -> (ID, [paths...]). Once you encounter a file with nlinks > 1, lookup devid/inode and maybe inodegen in the map, if it is there, record the additional paths and refer the ID, otherwise add the a new free ID and refer that. After the snapshot is completed, all IDs can be recorded (maybe by sorting the paths and storing SHA256 references so it can be deduplicated, I don't know the inner workings) but do not record the (devid/inodeid/...) part because that is unstable.

When reading a parent snapshot, let's first read this mapping and clean it from files with ncount < 2 (no hardlink, or not found), and recreate the (devid/inodeid/...) information by looking up the paths, remove conflicting files (as in devid/inode/...) from the set of paths, then remove all maps with only one or zero paths left. This can be done before scanning the file system, so the following changed-files scan can then re-use what was already discovered, and add new hardlinks, and it would still work in a single pass. The "rebuild" part of the devid/inodeid/... can be omitted if we do not refer to uniquely created IDs (and store the "large set" instead, see below).

Essentially, this is your idea of the hardlink_target but normalizing it into a unique ID so we don't end up with complicated house-keeping if the referenced path vanishes.

To deduplicate the large set, we could store a list of SHA256 hashes belonging to the snapshot, and then store each set of paths individually as such a SHA256 reference. As long as the paths per set are sorted before hashing, the IDs remain stable. To eliminate duplication as best as possible, this list of SHA256 hashes could be hashed into a single value, too, and then referenced by the snapshot.

But I think the inline idea probably works better - as you suggested (tho, I think it's a little more complicated to implement correctly). The "large set" idea still would record a full new list of SHA256 references in case of changes.

I just mentioned the "change detection" because we are going to remove devid/inode from the backup snapshots, and that will probably touch change detection - because that's actually why the whole issue has been posted in the first place. So we need a new definition what change detection means. Currently, a file is probably recorded into the snapshot if the inode/devid changed (among other hints). Is it recorded just because those IDs changed, or will it still be recorded because, e.g., the ctime changed?

@MichaelEischer
Copy link
Member

That's why I suggested to just use an index counter into the sets mapping (devid/inodeid/...) -> (ID, [paths...]).

That results in unstable IDs that are extremely dependent on whether a parent snapshot is used or not. The effort to maintain such a mapping is also rather significant.

This can be done before scanning the file system, so the following changed-files scan can then re-use what was already discovered, and add new hardlinks, and it would still work in a single pass.

Such a prescan is a second pass.

Essentially, this is your idea of the hardlink_target but normalizing it into a unique ID so we don't end up with complicated house-keeping if the referenced path vanishes.

There's not much housekeeping necessary here. The map is reconstructed at runtime, and if the reference path vanishes, then all other instances of that hardlink will simply store a new reference path. That way there's no need for housekeeping. This variant might be slightly inefficient in some cases, but is really simple as it does not have to pass on data between snapshots.

To deduplicate the large set, we could store a list of SHA256 hashes belonging to the snapshot, and then store each set of paths individually as such a SHA256 reference

I don't think that is a good idea. The repository format and in particular the repository index are not designed to accommodate large numbers of very small objects. Those can drastically increase the memory consumption of the repository index, which is already rather high.

But I think the inline idea probably works better - as you suggested (tho, I think it's a little more complicated to implement correctly)

I think the implementation effort and complexity is rather the other way around. The large set approach will likely add 1k+ lines of code (not counting tests), whereas the inline approach should just require a few hundred lines of code.

Currently, a file is probably recorded into the snapshot if the inode/devid changed (among other hints). Is it recorded just because those IDs changed, or will it still be recorded because, e.g., the ctime changed?

Please take a look at

func fileChanged(fi os.FileInfo, node *restic.Node, ignoreFlags uint) bool {
. The devid is not used for change detection.

I just mentioned the "change detection" because we are going to remove devid/inode from the backup snapshots, and that will probably touch change detection - because that's actually why the whole issue has been posted in the first place.

Is the problem here really that inodes are changing? The data from #3041 (comment) suggests that the primary problem is the deviceID. By replacing devid+inode with some other mechanism to detect hardlinks, we no longer need both for hardlink detection. That allows removing the devid, whereas the inode is also used for change detection. If necessary, we could beef up the --ignore-inode flag to not store the inode at all.

I just mentioned the "change detection" because we are going to remove devid/inode from the backup snapshots, and that will probably touch change detection - because that's actually why the whole issue has been posted in the first place.

The metadata stored for a file is always regenerated from scratch no matter whether the file exists in the parent snapshot or not. That ensures that restic doesn't accidentally miss file metadata changes. The "change detection" only determines whether the file content is read again or not.

@aawsome
Copy link
Contributor

aawsome commented Mar 10, 2024

As I am getting lots of notify mails on this topic, I want to add my 2 ct (or better some comments which are hopefully useful)

About hard link identification

As a matter of fact, hard links are determined by looking at nlinks, inode and devid.
Maybe as additional check, we may/should verify if the contents of two files are also identical.
But all together, there is no other way to find out if a given file is hard linked to another file.
Moreover, while inode and devid definitively are not assured to be fixed, there are many cases, where they are in fact fixed and work pretty well.

About the snapshots:

We must always take into account that snapshots (when writing snapshots I mean all the trees referenced by a snapshot) may be a subset of a device. We may have excluded parts which in fact contain snapshots to parts we have included. To makes things worse, we must take into account that a snapshot may be modified later. rewrite is already implemented which is able to remove parts from a snapshot and whenever there is a merge command, things may be added to a snapshot.
In all these cases we should ensure that hard links are still correctly within the snapshot.

About restore:

We have a similar situation as during backup when restoring: We may just restore a
part of a snapshot and still hard links within that part should be restored correctly. But it gets even more complicated if we do a in-place restore where some files may actually already exist.
Moreover, we may want to add options to users to customize behavior here: E.g. some users may find it nice to have an option to hard-link all files with identical content - regardless whether they have been hard-linked originally...

About change detection:

I think the main topic here is about change detection and not actually how to store hard link information in the snapshots. There are people with changing devid/inodes which don't want files to be re-read and metadata to be duplicated if there was no change. Now the re-reading of files can be adjusted in the parent detection. However, I don't see why - in the case we have stated that a file didn't change w.r.t. the parent - we would want to take the content from the parent without reading the file but not take the hard link information from the parent. IMO whatever we take to save information about hard links should be just copied over from the parent in the case of a match!

About some suggestions:

  • referencing the first file (from @MichaelEischer): while I see that this is an attempt to remove dependency on the maybe-changing attributes devid/inode, this IMO introduces other complexity, especially in the mentioned case where snapshots are modified later or also in the case of partial restores.
  • About an extra scan to find out hardlinks: I agree with @MichaelEischer that this would only increase complexity. I also don't see what could be gained but wouldn't be possible within a single scan.
  • About the idea of adding a new "unique number": As we agree that devid/inode is the unique way to identify hard links, this is just another artificial mapping.

My alternative proposal:

  • Keep saved information of devid/inode as it is within the snapshots.
  • Allow to ignore changes of those within the parent detection (--ignore-inode exists, I would add --ignore-devid)
  • In case of files matching to a parent, copy over devid/inode from the parent. (Note that without --ignore-devid or --ignore-inode this will do exactly what is done right now)
  • Keep the restore algorithm as it is depending on inode/devid. If wanted, add more options for users to customize restoring.
  • I agree that using this procedure we may save "wrong" devids or inodes in snapshots which might be irritating. A unique ID generated from both would have served better. But as both are anyway not meaningful when they are changing and not used (except for hardlink restore), it doesn't hurt. And IMO the effort to change repo format would be much too high.
  • Also note that this proposal is completely independent from the question whether we should remove devid/inode information for files with nlinks < 2.

@kakra
Copy link

kakra commented Mar 11, 2024

  • About the idea of adding a new "unique number": As we agree that devid/inode is the unique way to identify hard links, this is just another artificial mapping.

No, it's not unique in a general sense. This is the whole point here. It can be seen as mostly stable while the snapshot is taken. Actually, if taking snapshot of network file systems, inodegen should also be taken into account. But other than that, it solves no purpose storing it in the archive because it won't tell you anything about the file identity in the archive - at least not when referencing it later from a child snapshot, or when crossing file-system boundaries while creating the snapshot (in that latter case, inode won't be unique by definition without devid).

But yes, for snapshot creation we can keep things as-is. The current implementation works for identifying hard-links.

If it is as easy as implementing to ignore devid/inode for snapshot creation in context of the parent, then go for it. But I wonder what happens if we later restore such a snapshot? The "missing" files are inherited from the parent, thus we are going to see potentially incompatible devid/inode, and now, how do we reliably (and correctly) restore hardlinks? IMO, we need some way to identify hardlinks by a compatible identifier across snapshot inheritance.

Besides that, I really appreciate a KISS approach to the problem, thanks for your insights @aawsome.

I think these thoughts of you are really important:

We have a similar situation as during backup when restoring: We may just restore a
part of a snapshot and still hard links within that part should be restored correctly. But it gets even more complicated if we do a in-place restore where some files may actually already exist.

By "in-place" you mean restoring a file into the contents of an existing file, possibly truncating it, without replace or unlink/rename? I think this is always the wrong way of doing it, so if some user wants to do it, they possibly shoot their own foot. This can easily be "fixed" by documenting the pitfalls. Or do you mean just restoring into an existing set of files, possibly correctly replacing existing files, instead of restoring into a completely empty directory?

Moreover, we may want to add options to users to customize behavior here: E.g. some users may find it nice to have an option to hard-link all files with identical content - regardless whether they have been hard-linked originally...

While this may be a tempting idea, it is almost always a very bad idea - except you have a very specific set of files (e.g., an NNTP spool). Maybe this should be left to external tools which create checksums of existing files and then transform identical contents into hardlinks. restic should probably not try to mimic such a behavior.

@MichaelEischer
Copy link
Member

To makes things worse, we must take into account that a snapshot may be modified later. rewrite is already implemented which is able to remove parts from a snapshot and whenever there is a merge command, things may be added to a snapshot.

Is it actually a good idea to have spontaneously forming hardlinks when merging two snapshots? I'd rather not hardlink files than risking accidentally hardlinking the wrong files.

  • Allow to ignore changes of those within the parent detection (--ignore-inode exists, I would add --ignore-devid)

I'm not sure where that claim originated and why it continues to be repeated in this issue, but the change detection never checked the deviceID. (the tree blob deduplication is obviously by construction sensitive to deviceID changes). Based on the description that option should rather be called --keep-devid, as it would copy the existing deviceID/inode to the new node if the file appears to be unchanged. That happens strictly after the change detection was executed.

  • I agree that using this procedure we may save "wrong" devids or inodes in snapshots which might be irritating.

Having the "wrong" deviceID/inode would not be a problem in itself. However, based on how the deviceIDs are assigned by Linux (see https://www.kernel.org/doc/Documentation/admin-guide/devices.txt) this will very, very likely result in aliasing between different deviceIDs. As a result it wouldn't be surprising to see the same inode+deviceID in a snapshot refer to two completely unrelated files. That's just asking for trouble (it's only a little bit better than always omitting the deviceID).

As a bonus backup --keep-devid dataset and backup --force dataset can now legitimately create different snapshots, which doesn't help either.

But I wonder what happens if we later restore such a snapshot? The "missing" files are inherited from the parent, thus we are going to see potentially incompatible devid/inode, and now, how do we reliably (and correctly) restore hardlinks? IMO, we need some way to identify hardlinks by a compatible identifier across snapshot inheritance.

Simple, it's impossible to reliably restore hardlinks using that construction.

"Snapshot inheritance" is a concept that does not exist in restic. All tree metadata in a snapshot is always generated from scratch and if it is identical to an existing tree blob in the repository, then it can be deduplicated. The change detection in the backup command is just used to skip reading unchanged files. And while copying the deviceID of unchanged files over to the new snapshot would indeed be trivial to implement, it's a bad idea IMHO as it can introduce spurious hardlinks.

@nis65

This comment was marked as off-topic.

@MichaelEischer

This comment was marked as off-topic.

@tavianator
Copy link

I think I have a reasonable solution to this issue. If people agree with the approach, I'm happy to try to implement it myself.

Quick recap: it's possible for st_dev to change between backups. We'd like to ignore such changes, while still using (st_dev, st_ino) for hard-link detection. Imagine this scenario

First backup Second backup
st_dev | st_ino | path
-------+--------+-----
1000   | 2000   | /backup
1001   | 2000   | /backup/root
1001   | 2001   | /backup/root/file
1001   | 2001   | /backup/root/link
1002   | 2000   | /backup/home
1002   | 2001   | /backup/home/file
1002   | 2002   | /backup/home/other
st_dev | st_ino | path
-------+--------+-----
3000   | 2000   | /backup
3001   | 2000   | /backup/root
3001   | 2001   | /backup/root/file
3001   | 2001   | /backup/root/link
3001   | 2002   | /backup/root/other
1001   | 2000   | /backup/home
1001   | 2001   | /backup/home/file
1001   | 2002   | /backup/home/other
1001   | 2001   | /backup/home/link
1002   | 2000   | /backup/new

My idea is: whenever we see a new st_dev, check the existing backup for the same path. If it contains that path, map the new devid to the old one. Otherwise, map the new devid to a unique devid that's not part of the existing backup. For the above example:

st_dev       | st_ino | path
-------------+--------+-----
3000 => 1000 | 2000   | /backup
3001 => 1001 | 2000   | /backup/root
3001 => 1001 | 2001   | /backup/root/file
3001 => 1001 | 2001   | /backup/root/link
3001 => 1001 | 2002   | /backup/root/other
1001 => 1002 | 2000   | /backup/home
1001 => 1002 | 2001   | /backup/home/file
1001 => 1002 | 2002   | /backup/home/other
1001 => 1002 | 2001   | /backup/home/link
1002 => 1003 | 2000   | /backup/new

Hard-link detection still works, and the new backup can re-use as much of the old backup's trees as possible.

@jclulow
Copy link

jclulow commented May 2, 2024

How are you fabricating the new devid such that it will not accidentally conflict with a real devid in a subsequent backup?

@tavianator
Copy link

How are you fabricating the new devid such that it will not accidentally conflict with a real devid in a subsequent backup?

The idea is that subsequent backups would see that the devid is already used by the backup and map it to something else. Something like

storedDev, ok = devMap[realDev]
if !ok {
    storedDev, ok = getDevFromExistingBackup(path)
    if !ok {
        storedDev = generateUniqueDev(devMap)
    }
    devMap[realDev] = storedDev
}

@jclulow
Copy link

jclulow commented May 2, 2024

Yeah the idea makes sense to me, I'm just curious about how generateUniqueDev() will avoid potential collisions with real dev IDs that might get used later on the system being backed up. Assuming you are in fact going to store the fabricated dev ID, that is, in the backup?

@tavianator
Copy link

It doesn't have to. Notice that the real device ID isn't used any more (except as a map key).

@MichaelEischer
Copy link
Member

I can't shake the feeling that we've already discussed something similar (dynamically generating a devId map) somewhere above (although not all parts of the id assignment scheme below). But I don't have time to read this novel (aka. this issue) again right now; that will have to wait until restic 0.17.0 is done.

generateUniqueDev(devMap)

If that method sequentially assigns devIds as in the example above, then it's very easy to cause a new mountpoint to renumber all devIds in a new snapshot. Just add a new mountpoint that is backed up before existing mountpoints.

That is the generated id should yield as few collisions as possible. Either by picking completely random IDs or by somehow deriving them from the filepath. The latter variant has the benefit, that most it would in most cases also result in stable devIds even if no parent snapshot was detected.

As pseudocode the filepath-derived variant would look something like the following.

func generateUniqueDev(devMap, realDev, path):
  id := hash(path)
  for {
    if mappedIdInMap(devMap, id) {
      # use randomId in case of collisions
      id = randomId()
    } else {
      devMap[realDev] = realDev
      return
    }
  } 

That has a few nice properties:

  • deriving the devId from the path likely yields the same mappedIds for a fixed file structure, even without knowing the parent snapshot
  • even more stable if parent snapshot is known
  • additional mountpoints for new devices very likely won't affect the ids for existing devices. That property only exists for the hash/random based implementation of generateUniqueDev, not when generateUniqueDev sequentially hands out device ids.

Downsides:

  • small risk of random collisions and thus devId changes (only hash/random-based devIds)
  • small risk of conflicts between devIds from the parent snapshot and devIds already assigned during the current backup run
  • moving a mountpoint to a new path results in changed devIds

The hardlink_target based approach (see the discussion somewhere above) don't have those collision/conflict risks. Renaming a mountpoint could also be handled by storing relative paths, not absolute ones (although that might complicate the implementation a bit). However, that approach requires a format change along with quite a few further changes, whereas dynamically remapping the devId doesn't. So, we might just give the dynamic devId remapping a try (initially behind the device-id-for-hardlinks feature flag), and if it doesn't work out, we can still switch gears and use the hardlink_target-based approach.

@stackcoder
Copy link

For all who looking for a working solution right now, without the need to apply patches:
I created a simple bash based workaround restic-zfs.sh using mount (within private namespace) instead accessing .zfs/snapshot/.

It still detects parents as changed, not perfect, but much better than meta data changes on all and everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: backup state: need investigating cause unknown, need investigating/troubleshooting
Projects
None yet
Development

Successfully merging a pull request may close this issue.