-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to better handle incomplete transfers #13
base: main
Are you sure you want to change the base?
Conversation
dry-run doesn't work currently, as some operations need real output from real commands. In some cases, it's always safe to actually execute the command (e.g. `ls`), so add a "dry_safe" parameter to check_call which can allow these commands to execute even in dry-run mode.
If a btrfs send | receive is interrupted, then an incomplete snapshot can be left on the remote side. This causes two problems: 1) It cannot be used as a parent for any future receives 2) A subsequent snapbtrex clean operation will think that the incomplete snapshot is the most recent valid one, and may delete the most recent actually valid parent - leading to a situation where no valid parent is available on the sender and receiver for incremental backups. Fixing this is somewhat tricky, as there's no sure-fire way to identify the incomplete snapshots. However, _if_ a snapshot is "received", then it will have a received_uuid, and if it's incomplete, this will be '-'. This commit attempts to alleviate the problem in a few ways: - When starting a transfer, the name of the snapshot is stored into a ".snapbtrex_incomplete" file on the remote side. This is used as an indicator that a snapshot with no received_uuid really is incomplete. - When cleaning a directory, if a ".snapbtrex_incomplete" file is present and contains the name of a snapshot, then abort the clean operation. Manual intervention will be required, but incremental sends won't be inadvertently broken. - When sending snaps, when looking for parents, if the parent doesn't have a valid received_uuid then try two things: - If .snapbtrex_incomplete is present, and contains the name of that snapshot, we can be sure that it previously failed to complete, and delete the broken snapshot. It will be re-transferred. - If .snapbtrex_incomplete is not present or doesn't contain that snapshot, then simply ignore the "invalid" parent. This should allow us to transfer the snapshot with an older parent. This isn't perfect, but it should prevent failures more often than before. All of the functionality is behind a "--handle-incomplete" option, to preserve the old behaviour and not break anyone's current workflow.
P.S. I'll submit a new PR with a README change describing how it works/doesn't work if the approach looks reasonable. |
# Remove incomplete marker | ||
args = ["ssh -p " + ssh_port + " " + receiver + " 'rm " + os.path.join(receiver_path, ".snapbtrex_incomplete") + "'"] | ||
self.check_call(args, shell=True) | ||
self.trace(LOG_REMOTE + "finished sending snapshot") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be un-indented.
I did some digging in my data (e.g. snapshots of 5 years) and found one encounter where this happened! I seem to be lucky with network stability. This was no real problem - it recovered - but you're right, there where actually 3 snapshots unusable lying around. You can actually trace the following snapshot through the |
That's interesting - I'm not sure how yours is able to recover. For me it goes like this:
This leads to a further problem - I run a periodic job which prunes snapshots from my "remote" repository, using snapbtrex running locally there. This treats the "broken" snapshot as the most valuable - because it's the newest - and normally deletes the second-most-newest (which is complete). On the sender side, I only keep the snapshot which was most-recently successfully sent to the remote, for use for the next incremental update. The problem is, the "prune" on the receiver side doesn't pay attention to "successful" vs "non-successful" - and often it leaves behind the broken snapshot, while deleting the most-recent successful one; and so that then breaks incremental backups entirely because I don't have a suitable parent on the sender side which I can use - and I have to start again with a new full transfer. P.S. I'm not sure it's about network stability so much as e.g. closing my laptop mid transfer and then taking it somewhere else, or shutting down my PC. |
I regularly have the issue that an interrupted transfer leaves a broken snapshot on the remote, which prevents all future transfers and also breaks incremental backups when the last "complete" snapshot gets cleaned up on the remote.
This attempts to help, by trying to detect the incomplete snapshots, and ignoring/deleting/working-around them, as well as preventing clean-up on a repository with a broken snapshot.
As the commit message says, it's not perfect, but I think it does help. I'm more than happy to hear better ways to handle it!