New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restic 0.9.6 with new ctime behaviour causes all files to show as changed #2495
Comments
restic supports restoring hardlinks, so in fact the file has changed, and restic's behaviour is correct. the contents haven't changed, but that's not the only thing that's relevant. you can use |
#2179 explains why this change was necessary. Personally, I prefer my backup program cheking everything than potentially missing something important. This might cause a backup to run longer than necessary but deduplication should still work.
An optional |
That's the effect of |
For:
I think the new behaviour is a pretty pedantic interpretation. In particular, there could be hardlinks outside the scope of the content being backed up, and these probably should not be restored, and the act of taking a hard link should not always require a re-evaluation of the content of the file. An inode change does not necessarily imply a content change, and in cases such as ours - every file is being considered different every time, which is not a good result. However, as long as there is an option to turn it off, my concern is resolved. I read in the original change the following comment: "If this change causes problems for you, please open an issue, and we can look into adding a seperate flag to disable just the ctime check.", so that is what I am doing. But, if you already have such a flag, then perhaps the comment can be removed? :-) We will try it out, and confirm. Also, it looks like that ignoreInode ignores both ctime changes (good!), and inode number changes (not sure if good!). However, in our particular use case, I think if the inode number changed, it also doesn't necessarily indicate a difference. We only really care about mtime, and the size is typically a cheap additional check which also indicates a definite change in content. The reason here, is that the Github Enterprise Backup utilities are written by a third party (Github!), and without re-writing their scripts entirely, we have limited control over what they do. Today, they use "rsync --link-dest" to create a hard link tree. tomorrow, they might allow some other mechanism, such as "cp --reflink". In all cases, the only guarantee they provide is that once the Github Enterprise backup is complete, that we have a directory that we can backup using a tool such as Restic. We don't know how they built the tree (unless we look under the covers). So, "--ignore-inode" works for this use case, I believe. Ignoring my comment about pedantic... do you agree with the rest of the above? Thanks. |
For context, this is what it looked like to us... :-)
|
Hey, thanks for raising this issue. It is important to collect such corner cases. Just to confirm:
Is that correct? What's your concern exactly? Backup time or repository size or both? Please be aware that restic's deduplication will take care of the duplicate data, so the repository size is not an issue here. So, in general I think restic is behaving correctly. If in doubt, it will re-read the file. That's a sane default in my opinion. We have optimizations for subsequent backups and restic prior to 0.9.6 used the Now we have a case where using I'll consider this as a feature request and not a bug because, in my opinion, restic does the right thing: if in doubt, re-read the file. Backups may take longer in this situation, but losing data because restic decides a modified file hasn't changed is worse. :) Thoughts? |
The original concern is backup time, and more fundamentally, the work required to take the backup. It doubles the time to backup today with 0.9.6 vs 0.9.5, when all files are considered different. I'm sure this varies based upon the type of content. I knew about the de-duplication, and that is what changes the current behaviour from disastrous into just problematic. The followup concern from me, would be the ability to report on the changes. I wonder what impact this would have on reporting deltas between snapshots and such? From our perspective, the files did not change. Only Restic is now taking twice as long, and doing significantly more work to figure this out, which makes Restic less appealing to use as a backup tool for our use case. From a practical standpoint, if Restic didn't have a flag to stop being so pedantic - I would either customize Restic (and propose a Pull Request, even if it was rejected) or choose a different tool. I would like a --use-mtime as per the original comment in the Pull Request, and as per what you are describing above. You might also add a --use-ctime to allow it to be explicit either way, even if you choose to use one way by default? |
Let's be really picky about how we name flags. It's very common that people suggest adding this and that flag, and if we're not careful, we'll end up with an inconsistent and big mess of flags for every little thing. Perhaps a flag named something like |
I think that a A file changing inode but not contents is extremely common. e.g. saving a file in vim results in a changed inode because vim does a write to a temporary file and rename into place. This is common behaviour in many *nix utils because it minimises data loss if something goes wrong with the write. In @MarkMielke's use-case they want to only use the mtime and file size heuristics (ignoring inode and ctime completely). That matches |
Adjustment to my time comment... it looks like with For @smlx 's comment, it's correct that for my use case right now that I am reporting, I think I do agree with the proliferation of top-level flags being something worthy of discouraging. The |
Actually, this is relatively minor issue, most important is that metadata-only changes (ownership, permissions, ACLs and any attributes in general) update only ctime, thus without checking ctime all such changes will be ignored. What could be done to optimize things a bit is to lookup cache/archive for files with same dev/inode numbers (hard links all have same inode number), and if we do have those files already then we could avoid copying of content entirely. Even if archive size is not an issue, backup time and IO costs are quite high, especially for big files. |
The ctime thing has screwed up my backup system as well. I was using hardlinks to create local "snapshots" of a large directory, and using restic to perform a remote backup to B2. Essentially this:
Since ctime is updated when creating a new hardlink, restic now marks all files as changed and spends many hours creating new index files. I see that |
@cmeyer23 You can see the actual code changes for |
Github Enterprise backup utilities uses "rsync --link-dest" to create "Copy-on-Write snapshots". Basically they clone the Git repositories as a hard link tree, and then make updates to the tree, minimizing the disk storage requirements.
restic 0.9.6 with new ctime behaviour is no longer able to correctly detect which files have changed, because creating a hard link causes the ctime to be updated for every single file in the tree. To see this for yourself, try something like this in Unix:
"ctime" represents the time that the inode was created, or changed. I'm not following the reasoning why "ctime" is considered the right choice for restic. I think this change should be reverted, or at least made optional and default to off. There are good reasons why rsync does not use ctime after all of these years. This behaviour is really surprising, and basically means we will either have to accept that restic considers all files different every time, re-write the Github Enterprise backup utilities, or locally customize Restic to stop doing this.
The text was updated successfully, but these errors were encountered: