Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restic 0.9.6 with new ctime behaviour causes all files to show as changed #2495

Closed
MarkMielke opened this issue Dec 2, 2019 · 13 comments · Fixed by #2823
Closed

restic 0.9.6 with new ctime behaviour causes all files to show as changed #2495

MarkMielke opened this issue Dec 2, 2019 · 13 comments · Fixed by #2823
Labels
category: backup state: need implementing cause/request established, need work/solution type: feature suggestion suggesting a new feature

Comments

@MarkMielke
Copy link

Github Enterprise backup utilities uses "rsync --link-dest" to create "Copy-on-Write snapshots". Basically they clone the Git repositories as a hard link tree, and then make updates to the tree, minimizing the disk storage requirements.

restic 0.9.6 with new ctime behaviour is no longer able to correctly detect which files have changed, because creating a hard link causes the ctime to be updated for every single file in the tree. To see this for yourself, try something like this in Unix:

### Create the file. The ctime is the time the file was created, or the metadata was last updated.
$ touch abc123
$ ls -l --time=ctime --full-time abc123
-rw-rw-r-- 1 mark mark 0 2019-12-02 18:15:57.649333848 -0500 abc123

### Create a hard link. No changes are being made to the file.
$ ln abc123 abc456
$ ls -l --time=ctime --full-time abc123 abc456
-rw-rw-r-- 2 mark mark 0 2019-12-02 18:16:08.138562962 -0500 abc123
-rw-rw-r-- 2 mark mark 0 2019-12-02 18:16:08.138562962 -0500 abc456

### Notice how the ctime was updated, and reflected on both files!

### Create another hard link.
$ ln abc123 abc789                            
$ ls -l --time=ctime --full-time abc123 abc456 abc789
-rw-rw-r-- 3 mark mark 0 2019-12-02 18:16:28.058998073 -0500 abc123
-rw-rw-r-- 3 mark mark 0 2019-12-02 18:16:28.058998073 -0500 abc456
-rw-rw-r-- 3 mark mark 0 2019-12-02 18:16:28.058998073 -0500 abc789

### Note how the ctime was updated, and relfected on all three files!

"ctime" represents the time that the inode was created, or changed. I'm not following the reasoning why "ctime" is considered the right choice for restic. I think this change should be reverted, or at least made optional and default to off. There are good reasons why rsync does not use ctime after all of these years. This behaviour is really surprising, and basically means we will either have to accept that restic considers all files different every time, re-write the Github Enterprise backup utilities, or locally customize Restic to stop doing this.

@smlx
Copy link
Contributor

smlx commented Dec 3, 2019

restic supports restoring hardlinks, so in fact the file has changed, and restic's behaviour is correct. the contents haven't changed, but that's not the only thing that's relevant.

you can use --ignore-inode to make restic ignore hard links, and treat each file separately.

@dimejo
Copy link
Contributor

dimejo commented Dec 3, 2019

"ctime" represents the time that the inode was created, or changed. I'm not following the reasoning why "ctime" is considered the right choice for restic.

#2179 explains why this change was necessary. Personally, I prefer my backup program cheking everything than potentially missing something important. This might cause a backup to run longer than necessary but deduplication should still work.

I think this change should be reverted, or at least made optional and default to off.

An optional mtime flag might be useful in your (and other) usecases but I'm wondering how common this usecase is.

@smlx
Copy link
Contributor

smlx commented Dec 3, 2019

An optional mtime flag might be useful in your (and other) usecases but I'm wondering how common this usecase is.

That's the effect of --ignore-inode. See here.

@MarkMielke
Copy link
Author

MarkMielke commented Dec 3, 2019

For:

restic supports restoring hardlinks, so in fact the file has changed, and restic's behaviour is correct. the contents haven't changed, but that's not the only thing that's relevant.

I think the new behaviour is a pretty pedantic interpretation. In particular, there could be hardlinks outside the scope of the content being backed up, and these probably should not be restored, and the act of taking a hard link should not always require a re-evaluation of the content of the file. An inode change does not necessarily imply a content change, and in cases such as ours - every file is being considered different every time, which is not a good result.

However, as long as there is an option to turn it off, my concern is resolved. I read in the original change the following comment: "If this change causes problems for you, please open an issue, and we can look into adding a seperate flag to disable just the ctime check.", so that is what I am doing.

But, if you already have such a flag, then perhaps the comment can be removed? :-)

We will try it out, and confirm. Also, it looks like that ignoreInode ignores both ctime changes (good!), and inode number changes (not sure if good!). However, in our particular use case, I think if the inode number changed, it also doesn't necessarily indicate a difference. We only really care about mtime, and the size is typically a cheap additional check which also indicates a definite change in content. The reason here, is that the Github Enterprise Backup utilities are written by a third party (Github!), and without re-writing their scripts entirely, we have limited control over what they do. Today, they use "rsync --link-dest" to create a hard link tree. tomorrow, they might allow some other mechanism, such as "cp --reflink". In all cases, the only guarantee they provide is that once the Github Enterprise backup is complete, that we have a directory that we can backup using a tool such as Restic. We don't know how they built the tree (unless we look under the covers). So, "--ignore-inode" works for this use case, I believe.

Ignoring my comment about pedantic... do you agree with the rest of the above? Thanks.

@MarkMielke
Copy link
Author

For context, this is what it looked like to us... :-)

Output of Mon Dec 1st 2:38 restic backup using restic 0.9.5:
Files:         247 new,    15 changed, 241152 unmodified

Output of Mon Dec 2nd 15:38 restic backup using restic 0.9.6:
Files:         345 new, 241369 changed,     0 unmodified

@fd0
Copy link
Member

fd0 commented Dec 3, 2019

Hey, thanks for raising this issue. It is important to collect such corner cases. Just to confirm:

  • You see that using rsync with hardlinks makes restic consider all files as changed
  • so the backup takes a lot longer because restic re-reads all files

Is that correct? What's your concern exactly? Backup time or repository size or both?

Please be aware that restic's deduplication will take care of the duplicate data, so the repository size is not an issue here.

So, in general I think restic is behaving correctly. If in doubt, it will re-read the file. That's a sane default in my opinion. We have optimizations for subsequent backups and restic prior to 0.9.6 used the mtime, but it turned out that there are programs which change the content but keep the mtime, so that could make restic miss data. That's bad, so we changed it to use ctime because it cannot be changed by programs (in contrast to mtime).

Now we have a case where using ctime makes restic consider the file as changed but it has not. Since restic cannot know that in this case it should use the mtime and not ctime, I think we should add an option like --use-mtime.

I'll consider this as a feature request and not a bug because, in my opinion, restic does the right thing: if in doubt, re-read the file. Backups may take longer in this situation, but losing data because restic decides a modified file hasn't changed is worse. :)

Thoughts?

@fd0 fd0 added the type: feature suggestion suggesting a new feature label Dec 3, 2019
@MarkMielke
Copy link
Author

The original concern is backup time, and more fundamentally, the work required to take the backup. It doubles the time to backup today with 0.9.6 vs 0.9.5, when all files are considered different. I'm sure this varies based upon the type of content. I knew about the de-duplication, and that is what changes the current behaviour from disastrous into just problematic.

The followup concern from me, would be the ability to report on the changes. I wonder what impact this would have on reporting deltas between snapshots and such? From our perspective, the files did not change. Only Restic is now taking twice as long, and doing significantly more work to figure this out, which makes Restic less appealing to use as a backup tool for our use case. From a practical standpoint, if Restic didn't have a flag to stop being so pedantic - I would either customize Restic (and propose a Pull Request, even if it was rejected) or choose a different tool.

I would like a --use-mtime as per the original comment in the Pull Request, and as per what you are describing above. You might also add a --use-ctime to allow it to be explicit either way, even if you choose to use one way by default?

@rawtaz
Copy link
Contributor

rawtaz commented Dec 3, 2019

Let's be really picky about how we name flags. It's very common that people suggest adding this and that flag, and if we're not careful, we'll end up with an inconsistent and big mess of flags for every little thing.

Perhaps a flag named something like --change-detect that can take one of the values ctime(which would be the default) and mtime is better than multiple single-value flags. To make restic use mtime one would add --change-detect=mtime.

@smlx
Copy link
Contributor

smlx commented Dec 3, 2019

I think that a --use-mtime or --change-detect=mtime flag would duplicate the functionality of --ignore-inode. Maybe --ignore-inode should have its documentation expanded to explicitly mention ctime?

A file changing inode but not contents is extremely common. e.g. saving a file in vim results in a changed inode because vim does a write to a temporary file and rename into place. This is common behaviour in many *nix utils because it minimises data loss if something goes wrong with the write.

In @MarkMielke's use-case they want to only use the mtime and file size heuristics (ignoring inode and ctime completely). That matches --ignore-inode functionality precisely, doesn't it?

@MarkMielke
Copy link
Author

Adjustment to my time comment... it looks like with --ignore-inode it is now taking about 1m44s to complete the backup snapshot, whereas without it, it was taking between 6m04s and 8m55s, with the first run after 0.9.6 upgrade taking 15m27s. So, more than double.

For @smlx 's comment, it's correct that for my use case right now that I am reporting, I think --ignore-inode will be what we want, and if I had known about it, I might not have raised this issue in the first place.

I do agree with the proliferation of top-level flags being something worthy of discouraging. The --change-detect flag looks smart to me, and I think it would be very clever if it was a comma separated list of things to check, like: --change-detect=ctime,mtime,size,mode,owner,group,link-count,inode-number,first-block,... that would basically allow exact definition of what you need that would not change from Restic version to Restic version. This might become incredibly powerful. However, it's also true that it might receive very low use compared to use --ignore-inode which should suffice for most edge cases, and --change-detect=mtime for a subset of edge cases. This is an important area of concern for "backup", in that the definition of how to detect changes is something that many people will want somebody else to figure out for them (which leads to the most conservative checks becoming default, to avoid making even a single mistake, even an obscure one!) vs having an efficient backup process, based upon intelligent knowledge of how the files were created in the first place.

@aldem
Copy link

aldem commented Dec 7, 2019

it turned out that there are programs which change the content but keep the mtime, so that could make restic miss data

Actually, this is relatively minor issue, most important is that metadata-only changes (ownership, permissions, ACLs and any attributes in general) update only ctime, thus without checking ctime all such changes will be ignored.

What could be done to optimize things a bit is to lookup cache/archive for files with same dev/inode numbers (hard links all have same inode number), and if we do have those files already then we could avoid copying of content entirely. Even if archive size is not an issue, backup time and IO costs are quite high, especially for big files.

@cmeyer23
Copy link

The ctime thing has screwed up my backup system as well. I was using hardlinks to create local "snapshots" of a large directory, and using restic to perform a remote backup to B2. Essentially this:

rm -rf DIR.03
mv DIR.02 DIR.03
mv DIR.01 DIR.02
cp -al DIR DIR.01
restic backup DIR

Since ctime is updated when creating a new hardlink, restic now marks all files as changed and spends many hours creating new index files. I see that --ignore-inode will effectively ignore the ctime change, but does it have any other effect?

@rawtaz
Copy link
Contributor

rawtaz commented Jan 18, 2020

@cmeyer23 You can see the actual code changes for --ignore-inode at https://github.com/restic/restic/pull/2205/files#diff-3d12e595331d9b4d5272a3350fa8c6e5 - as you can see it only affects whether or not inode changes are used for detecting changes to files. If you don't need that, it'll be safe to use this option when backing up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: backup state: need implementing cause/request established, need work/solution type: feature suggestion suggesting a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants