Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to better handle incomplete transfers #13

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

usedbytes
Copy link
Contributor

I regularly have the issue that an interrupted transfer leaves a broken snapshot on the remote, which prevents all future transfers and also breaks incremental backups when the last "complete" snapshot gets cleaned up on the remote.

This attempts to help, by trying to detect the incomplete snapshots, and ignoring/deleting/working-around them, as well as preventing clean-up on a repository with a broken snapshot.

As the commit message says, it's not perfect, but I think it does help. I'm more than happy to hear better ways to handle it!

dry-run doesn't work currently, as some operations need real output from
real commands. In some cases, it's always safe to actually execute the
command (e.g. `ls`), so add a "dry_safe" parameter to check_call which
can allow these commands to execute even in dry-run mode.
If a btrfs send | receive is interrupted, then an incomplete snapshot
can be left on the remote side. This causes two problems:

1) It cannot be used as a parent for any future receives
2) A subsequent snapbtrex clean operation will think that the incomplete
   snapshot is the most recent valid one, and may delete the most recent
   actually valid parent - leading to a situation where no valid parent is
   available on the sender and receiver for incremental backups.

Fixing this is somewhat tricky, as there's no sure-fire way to identify
the incomplete snapshots.

However, _if_ a snapshot is "received", then it will have a
received_uuid, and if it's incomplete, this will be '-'.

This commit attempts to alleviate the problem in a few ways:

 - When starting a transfer, the name of the snapshot is stored into a
   ".snapbtrex_incomplete" file on the remote side. This is used as an
   indicator that a snapshot with no received_uuid really is incomplete.

 - When cleaning a directory, if a ".snapbtrex_incomplete" file is
   present and contains the name of a snapshot, then abort the clean
   operation. Manual intervention will be required, but incremental
   sends won't be inadvertently broken.

 - When sending snaps, when looking for parents, if the parent doesn't have
   a valid received_uuid then try two things:
   - If .snapbtrex_incomplete is present, and contains the name of that
     snapshot, we can be sure that it previously failed to complete, and
     delete the broken snapshot. It will be re-transferred.
   - If .snapbtrex_incomplete is not present or doesn't contain that
     snapshot, then simply ignore the "invalid" parent. This should
     allow us to transfer the snapshot with an older parent.

This isn't perfect, but it should prevent failures more often than
before.

All of the functionality is behind a "--handle-incomplete" option, to
preserve the old behaviour and not break anyone's current workflow.
@usedbytes
Copy link
Contributor Author

P.S. I'll submit a new PR with a README change describing how it works/doesn't work if the approach looks reasonable.

# Remove incomplete marker
args = ["ssh -p " + ssh_port + " " + receiver + " 'rm " + os.path.join(receiver_path, ".snapbtrex_incomplete") + "'"]
self.check_call(args, shell=True)
self.trace(LOG_REMOTE + "finished sending snapshot")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be un-indented.

@yoshtec
Copy link
Owner

yoshtec commented Nov 3, 2019

I did some digging in my data (e.g. snapshots of 5 years) and found one encounter where this happened! I seem to be lucky with network stability.

This was no real problem - it recovered - but you're right, there where actually 3 snapshots unusable lying around.

You can actually trace the following snapshot through the Snapshot(s): property if one snapshot possesses more than one snapshot there is something odd, also if one snapshot doesn't have another one (and its not the last) then there's also an issue.

@usedbytes
Copy link
Contributor Author

usedbytes commented Nov 4, 2019

That's interesting - I'm not sure how yours is able to recover.

For me it goes like this:

  1. send/receive is interrupted, leaving a snapshot on the remote side which has no received_uuid
  2. The next time snapbtrex tries to sync, it sees the partial snapshot, and determines that it's the best parent (it's the most recent), and so tries to do a new send/receive against that snapshot
  3. The receiver side fails, because the parent doesn't have a received_uuid
  4. snapbtrex aborts (though since the new error handling it doesn't exit with an error - that caught me out 😆 )
  5. Next time snapbtrex tries - the same thing happens. The "broken" snapshot is still the most recent, gets picked as the parent, and fails.

This leads to a further problem - I run a periodic job which prunes snapshots from my "remote" repository, using snapbtrex running locally there. This treats the "broken" snapshot as the most valuable - because it's the newest - and normally deletes the second-most-newest (which is complete).

On the sender side, I only keep the snapshot which was most-recently successfully sent to the remote, for use for the next incremental update. The problem is, the "prune" on the receiver side doesn't pay attention to "successful" vs "non-successful" - and often it leaves behind the broken snapshot, while deleting the most-recent successful one; and so that then breaks incremental backups entirely because I don't have a suitable parent on the sender side which I can use - and I have to start again with a new full transfer.

P.S. I'm not sure it's about network stability so much as e.g. closing my laptop mid transfer and then taking it somewhere else, or shutting down my PC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants