Attempt to better handle incomplete transfers #13

usedbytes · 2019-11-02T12:11:34Z

I regularly have the issue that an interrupted transfer leaves a broken snapshot on the remote, which prevents all future transfers and also breaks incremental backups when the last "complete" snapshot gets cleaned up on the remote.

This attempts to help, by trying to detect the incomplete snapshots, and ignoring/deleting/working-around them, as well as preventing clean-up on a repository with a broken snapshot.

As the commit message says, it's not perfect, but I think it does help. I'm more than happy to hear better ways to handle it!

dry-run doesn't work currently, as some operations need real output from real commands. In some cases, it's always safe to actually execute the command (e.g. `ls`), so add a "dry_safe" parameter to check_call which can allow these commands to execute even in dry-run mode.

If a btrfs send | receive is interrupted, then an incomplete snapshot can be left on the remote side. This causes two problems: 1) It cannot be used as a parent for any future receives 2) A subsequent snapbtrex clean operation will think that the incomplete snapshot is the most recent valid one, and may delete the most recent actually valid parent - leading to a situation where no valid parent is available on the sender and receiver for incremental backups. Fixing this is somewhat tricky, as there's no sure-fire way to identify the incomplete snapshots. However, _if_ a snapshot is "received", then it will have a received_uuid, and if it's incomplete, this will be '-'. This commit attempts to alleviate the problem in a few ways: - When starting a transfer, the name of the snapshot is stored into a ".snapbtrex_incomplete" file on the remote side. This is used as an indicator that a snapshot with no received_uuid really is incomplete. - When cleaning a directory, if a ".snapbtrex_incomplete" file is present and contains the name of a snapshot, then abort the clean operation. Manual intervention will be required, but incremental sends won't be inadvertently broken. - When sending snaps, when looking for parents, if the parent doesn't have a valid received_uuid then try two things: - If .snapbtrex_incomplete is present, and contains the name of that snapshot, we can be sure that it previously failed to complete, and delete the broken snapshot. It will be re-transferred. - If .snapbtrex_incomplete is not present or doesn't contain that snapshot, then simply ignore the "invalid" parent. This should allow us to transfer the snapshot with an older parent. This isn't perfect, but it should prevent failures more often than before. All of the functionality is behind a "--handle-incomplete" option, to preserve the old behaviour and not break anyone's current workflow.

usedbytes · 2019-11-02T12:13:10Z

P.S. I'll submit a new PR with a README change describing how it works/doesn't work if the approach looks reasonable.

usedbytes · 2019-11-02T12:15:23Z

snapbtrex.py

+            # Remove incomplete marker
+            args = ["ssh -p " + ssh_port + " " + receiver + " 'rm " + os.path.join(receiver_path, ".snapbtrex_incomplete") + "'"]
+            self.check_call(args, shell=True)
+            self.trace(LOG_REMOTE + "finished sending snapshot")


Should be un-indented.

yoshtec · 2019-11-03T21:13:04Z

I did some digging in my data (e.g. snapshots of 5 years) and found one encounter where this happened! I seem to be lucky with network stability.

This was no real problem - it recovered - but you're right, there where actually 3 snapshots unusable lying around.

You can actually trace the following snapshot through the Snapshot(s): property if one snapshot possesses more than one snapshot there is something odd, also if one snapshot doesn't have another one (and its not the last) then there's also an issue.

usedbytes · 2019-11-04T11:18:00Z

That's interesting - I'm not sure how yours is able to recover.

For me it goes like this:

send/receive is interrupted, leaving a snapshot on the remote side which has no received_uuid
The next time snapbtrex tries to sync, it sees the partial snapshot, and determines that it's the best parent (it's the most recent), and so tries to do a new send/receive against that snapshot
The receiver side fails, because the parent doesn't have a received_uuid
snapbtrex aborts (though since the new error handling it doesn't exit with an error - that caught me out 😆 )
Next time snapbtrex tries - the same thing happens. The "broken" snapshot is still the most recent, gets picked as the parent, and fails.

This leads to a further problem - I run a periodic job which prunes snapshots from my "remote" repository, using snapbtrex running locally there. This treats the "broken" snapshot as the most valuable - because it's the newest - and normally deletes the second-most-newest (which is complete).

On the sender side, I only keep the snapshot which was most-recently successfully sent to the remote, for use for the next incremental update. The problem is, the "prune" on the receiver side doesn't pay attention to "successful" vs "non-successful" - and often it leaves behind the broken snapshot, while deleting the most-recent successful one; and so that then breaks incremental backups entirely because I don't have a suitable parent on the sender side which I can use - and I have to start again with a new full transfer.

P.S. I'm not sure it's about network stability so much as e.g. closing my laptop mid transfer and then taking it somewhere else, or shutting down my PC.

usedbytes added 2 commits November 2, 2019 11:19

usedbytes commented Nov 2, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to better handle incomplete transfers #13

Attempt to better handle incomplete transfers #13

usedbytes commented Nov 2, 2019

usedbytes commented Nov 2, 2019

usedbytes Nov 2, 2019

yoshtec commented Nov 3, 2019

usedbytes commented Nov 4, 2019 •

edited

Loading

Attempt to better handle incomplete transfers #13

Are you sure you want to change the base?

Attempt to better handle incomplete transfers #13

Conversation

usedbytes commented Nov 2, 2019

usedbytes commented Nov 2, 2019

usedbytes Nov 2, 2019

Choose a reason for hiding this comment

yoshtec commented Nov 3, 2019

usedbytes commented Nov 4, 2019 • edited Loading

usedbytes commented Nov 4, 2019 •

edited

Loading