Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sftp backend does not reconnect #353

Open
mika opened this issue Nov 16, 2015 · 39 comments · May be fixed by #4058
Open

sftp backend does not reconnect #353

mika opened this issue Nov 16, 2015 · 39 comments · May be fixed by #4058
Labels

Comments

@mika
Copy link

mika commented Nov 16, 2015

I'm trying to back up data to Hetzner's backup space (http://wiki.hetzner.de/index.php/Backup/en). It starts nicely with ~15-20MiB/sec, but as soon as I'm receiving a Write failed: Broken pipes or Connection to uXXX.your-backup.de closed by remote host. from Hetzner's backup space then restic is permanently decreasing its performance to <1MiB/s and doesn't seem to be able to recover from it. While this clearly seems to be a problem from Hetzner's backend storage I can't find a ssh configuration (TCPKeepAlive, ServerAliveInterval,...) which would work around this problem. It would be nice if restic could recover from this situation itself.

Thanks for restic! :)

@mika
Copy link
Author

mika commented Nov 16, 2015

Sorry, forgot to add, I'm running latest git:

restic 0.1.0 (v0.1.0-145-g34bf70f)
compiled at 2015-11-16 17:34:04 with go1.3.3

on a Debian wheezy amd64 system, using the following command line:

 % ./restic -r sftp://hetzner-backup//odysseus/home/mika backup ~/

@fd0
Copy link
Member

fd0 commented Nov 16, 2015

Thanks for your report, the backend infrastructure does not yet handle connection errors. For sftp, just one persistent connection is made to the backup server, which does not reconnect. That's the underlying problem.

I suspect that after the error is received, the connection is closed and restic does not store anything at all after that. However, you should be able to resume the backup by restarting restic.

Do you know what server the Hetzner backup uses?

@fd0 fd0 changed the title restic perfomance drastically decreases on sftp connection issues sftp backend does not reconnect Nov 16, 2015
@mika
Copy link
Author

mika commented Nov 16, 2015

Ah that explains it. :) Thanks for your fast reply! Please let me know if I could be of any help with testing, would love to use restic also for the Hetzner use case. :)

@episource
Copy link
Contributor

Side note: It seems unlikely to me, that Hetzner's ssh backend is broken. However, the following observation might be valuable for you, too.

A few times I observed stalling tcp over IPv6 connections that where related to faulty routers on the way to the target host. What happend: Some router on the way had Sequence Number Randomization enabled. This should be transparent to both Server and client, however they did not do it properly, as they did not rewrite SACK (selective acknowledgement) tcp header option. Once the first packet loss occured, there was a high chance for the sack option to contain an illegal (out of expected range) value causing netfilter/conntrack to drop the response package.

IPv4 could theoretically affected, too, but I never observed this myself.

You can log packets flagged invalid by the conntrac module:
echo "255">/proc/sys/net/netfilter/nf_conntrack_log_invalid

Sack is used only, when both client and server support it. It can be disabled per host using the following netfilter roule (requires netfilter + TCPOPTSTRIP kernel support):
ip6tables -t mangle -I OUTPUT -p tcp --syn -d your.server.tld -j TCPOPTSTRIP --strip-options sack-permitted

You could also try to (temporarily) disable sack for all connections (ipv4+ipv6):
sysctl -w net.ipv4.tcp_sack=1

@mika
Copy link
Author

mika commented Nov 17, 2015

Interesting hints, thanks. No matter whether I'm running with net.ipv4.tcp_sack=1 or net.ipv4.tcp_sack=0, this doesn't seem to change anything sadly.

I got runtime panics though, full log available in http://michael-prokop.at/tmp/restic_panic.txt jftr.

@fd0
Copy link
Member

fd0 commented Nov 17, 2015

Thanks for the log, the panic was expected. Error handling and reporting back to the user deserves a lot more work ;)

@debe
Copy link

debe commented Sep 19, 2016

I'd like to note that I receive the same error message when backing up from my linux (debian jessie) systems to hetzner backup space. Is there a workaround yet?

@fd0
Copy link
Member

fd0 commented Sep 19, 2016

Not yet, but you can just run the backup again and it will resume.

@debe
Copy link

debe commented Nov 9, 2016

@fd0 btw. It aborts exactly at the same position over and over again. You cannot continue.
I aborts quite early, so it never finishes my backup, but it's not the same position, you're right.

@fd0
Copy link
Member

fd0 commented Nov 9, 2016

Hm, I find that rather unexpected. Restic starts, reads files, splits them into chunks, bundles the chunks together to pack files and uploads the pack files to the backend. The pack files are never the same (neither size nor content), so the server always sees different data.

@debe Did you create a debug log yet? Maybe that contains another hint...

@debe
Copy link

debe commented Nov 9, 2016

@fd0
Copy link
Member

fd0 commented Apr 26, 2017

FYI: It was just reported that Hetzner is still running ProFTPD 1.3.5b. The fix should be in 1.3.6, which just released a few weeks ago on 9 April.

@fd0
Copy link
Member

fd0 commented Sep 5, 2017

A user just reported that Hetzner seems to have installed a newer version of the ftp service, so it works much better now. Can anybody else confirm this?

@mika
Copy link
Author

mika commented Sep 8, 2017

@fd0 - jepp, did some tests today (using v0.7.1) and it looks much better \o/
Being the initial bug reporter from my PoV you can close this issue.
Thanks for restic!

@mika
Copy link
Author

mika commented Sep 8, 2017

Well, maybe I was too fast with my reply, just had the following failure with a 103GB directory (hetzner storagebox as repository target via sftp, restic v0.7.1 locally):

# restic backup --one-file-system /srv
scan [/srv]
scanned 114020 directories, 1211537 files in 3:13
panic: Write: failed to send packet header: write |1: broken pipe557 items  0 errors  ETA 1:50:26

goroutine 97108 [running]:
restic/archiver.(*Archiver).saveChunk(0xc421c37a90, 0xc213e0, 0xc420012478, 0x8b881b4, 0x12a230, 0x751ece900000, 0xc423c38000, 0x12a230, 0x268000, 0xc421af68f0, ...)
        src/restic/archiver/archiver.go:165 +0x395
created by restic/archiver.(*Archiver).SaveFile
        src/restic/archiver/archiver.go:239 +0x3a4
restic backup --one-file-system /srv  170.64s user 31.84s system 40% cpu 8:24.19 total

It seems different from the problem I initially reported here back then though.
Rerunning it now, if you have any further suggestions please let me know. :)

@mika
Copy link
Author

mika commented Sep 8, 2017

Same problem again, this time after ~37 minutes:

scan [/srv]
scanned 114020 directories, 1211537 files in 3:08
panic: Close: failed to send packet header: write |1: broken pipe325557 items  0 errors  ETA 58:46 

goroutine 536461 [running]:
restic/archiver.(*Archiver).saveChunk(0xc420f66230, 0xc213e0, 0xc420012478, 0x92f6b8d, 0x310175, 0x178b6059e00000, 0xc43e24a000, 0x310175, 0x3c4000, 0xc42195e9a0, ...)
        src/restic/archiver/archiver.go:165 +0x395
created by restic/archiver.(*Archiver).SaveFile
        src/restic/archiver/archiver.go:239 +0x3a4
restic backup --one-file-system /srv  1034.36s user 105.90s system 50% cpu 37:25.20 total

Running restic prune is the appropriate step now to get rid of those uploaded (though being an incomplete snapshot)m right?

@fd0
Copy link
Member

fd0 commented Sep 10, 2017

Yes, restic prune will take care of that.

@mika
Copy link
Author

mika commented Sep 12, 2017

FTR, retried with latest restic from git (v0.7.1-183-g0f25ef94) with debug enabled, last lines from debug log:

2017/09/12 19:12:58 repository/master_index.go:39       repository.(*MasterIndex).Lookup        108502  id c79f83fb not found in any index
2017/09/12 19:12:58 repository/repository.go:170        repository.(*Repository).SaveAndEncrypt 108502  save id c79f83fb (data, 1519469 bytes)
2017/09/12 19:12:58 sftp/sftp.go:80     sftp.startClient.func2  12      ssh command exited, err signal: killed
2017/09/12 19:12:58 repository/packer_manager.go:124    repository.(*Repository).savePacker     108483  Save(<data/faaaad262a>) error: OpenFile: failed to send packet header: write |1: file already closed
2017/09/12 19:12:58 archiver/archiver.go:106    archiver.(*Archiver).Save       108483  Save(data, 6165f0e9): error OpenFile: failed to send packet header: write |1: file already closed
2017/09/12 19:12:58 archiver/archiver.go:164    archiver.(*Archiver).saveChunk  108483  Save(6165f0e9) failed: OpenFile: failed to send packet header: write |1: file already closed

(Running it from a Debian jessie system towards Hetzner's your-storagebox.de, IP 88.99.48.239 in my case)

@mika
Copy link
Author

mika commented Sep 13, 2017

And another update of mine :) With latest restic from git and running it under ionice -c 3 nice -n 19 it seems to be way better/stable so far. (Or maybe the powerftpd version was updated on my storagebox in the meanwhile?) Will report if I should find out anything more, if there's anything further I could try/investigate please let me know. :)

@LaszloHont
Copy link

Is restic prune safe to use over unreliable connections?

@fd0
Copy link
Member

fd0 commented Sep 15, 2017

It is "safe" in terms of safety: It won't break the repo. But you'll have to start from the beginning and the repo will keep growing until you manage to finish one prune run. That's a safety feature (restic will only remove data once it is sure that the new copies of still referenced data are safely saved).

@fd0 fd0 added the state: need implementing cause/request established, need work/solution label May 26, 2018
@dragon2611
Copy link

It seems to me at least that it might be more reliable to use the rclone backend and let rclone handle the sftp until reconnect support is implemented, particularly if uploading over residential broadband.

@fd0
Copy link
Member

fd0 commented Apr 27, 2019

For your information: We've recently figured out that some sftp servers disconnect clients after a while without activity, you can prevent that with restic by adding the following settings to the ~/.ssh/config file for the server:

ServerAliveInterval 60
ServerAliveCountMax 240

See: https://restic.readthedocs.io/en/latest/030_preparing_a_new_repo.html#sftp

@zacha81
Copy link

zacha81 commented Nov 17, 2020

Hello. Is this still being worked on?

I have implemented a repo check job and from time to time it won't finish it seems because of this problem
it looks like this

read group #20 of 7086 data packs (out of total 197033 packs in 28 groups)
Load(<data/f351a2355c>, 0, 0) returned error, retrying after 552.330144ms: failed to send packet: write |1: file already closed
Load(<data/67b8cf0bdc>, 0, 0) returned error, retrying after 582.280027ms: failed to send packet: write |1: file already closed
Load(<data/670969ac2e>, 0, 0) returned error, retrying after 720.254544ms: failed to send packet: write |1: file already closed
subprocess ssh: Connection to uXXXXXX.your-storagebox.de closed by remote host.
Load(<data/f30466a2d3>, 0, 0) returned error, retrying after 468.857094ms: failed to send packet: write |1: file already closed
Load(<data/bb270bf82d>, 0, 0) returned error, retrying after 462.318748ms: failed to send packet: write |1: file already closed
Load(<data/bb270bf82d>, 0, 0) returned error, retrying after 890.117305ms: failed to send packet: write |1: file already closed
Load(<data/f30466a2d3>, 0, 0) returned error, retrying after 424.227764ms: failed to send packet: write |1: file already closed
Load(<data/f351a2355c>, 0, 0) returned error, retrying after 492.389441ms: failed to send packet: write |1: file already closed
Load(<data/67b8cf0bdc>, 0, 0) returned error, retrying after 447.727139ms: failed to send packet: write |1: file already closed
Load(<data/670969ac2e>, 0, 0) returned error, retrying after 600.683895ms: failed to send packet: write |1: file already closed

It seems it will continue spilling those error messages forever, the mentioned blob references are repeatedly shown. The problem is

  • it will leave an exclusive lock
  • it won't terminate

So I would at first notice this happens, when the next backup job is scheduled, which will notify me that the repo is still in use and I have to manually terminate the job and remove the lock. It would already be a great improvement if restic at least terminated on those errors (so I get the error notice immediately)

regards

@fd0
Copy link
Member

fd0 commented Nov 28, 2020

This is not actively being worked on as far as I know. Somebody would need to step up and implement reconnection/restarting code into the sftp backend.

@johnflan
Copy link

I assume this issue is still open?

I (not infrequently) get failed prunes like this from my home network to hetzner. Previously I spun up an AWS instance to run prune and it finished without difficulties.

will delete 15274 packs and rewrite 16703 packs, this frees 110.817 GiB
subprocess ssh: client_loop: send disconnect: Broken pipe
Load(<data/c63c8807f9>, 0, 0) returned error, retrying after 552.330144ms: failed to send packet: write |1: file already closed
Save(<lock/2c06eee37b>) returned error, retrying after 720.254544ms: OpenFile: EOF
Load(<data/c63c8807f9>, 0, 0) returned error, retrying after 873.42004ms: failed to send packet: write |1: file already closed
Save(<lock/2c06eee37b>) returned error, retrying after 703.28564ms: ssh command exited: exit status 255
Save(<lock/2c06eee37b>) returned error, retrying after 1.040217184s: ssh command exited: exit status 255
Load(<data/c63c8807f9>, 0, 0) returned error, retrying after 1.335175957s: failed to send packet: write |1: file already closed
Save(<lock/2c06eee37b>) returned error, retrying after 954.512469ms: ssh command exited: exit status 255
Load(<data/c63c8807f9>, 0, 0) returned error, retrying after 1.107876242s: failed to send packet: write |1: file already closed
Save(<lock/2c06eee37b>) returned error, retrying after 1.511079094s: ssh command exited: exit status 255
Load(<data/c63c8807f9>, 0, 0) returned error, retrying after 2.027308147s: failed to send packet: write |1: file already closed
Save(<lock/2c06eee37b>) returned error, retrying after 3.854635449s: ssh command exited: exit status 255
Load(<data/c63c8807f9>, 0, 0) returned error, retrying after 4.987726727s: failed to send packet: write |1: file already closed
Save(<lock/2c06eee37b>) returned error, retrying after 4.067955962s: ssh command exited: exit status 255
Load(<data/c63c8807f9>, 0, 0) returned error, retrying after 5.015617898s: failed to send packet: write |1: file already closed
Save(<lock/2c06eee37b>) returned error, retrying after 6.988645419s: ssh command exited: exit status 255
Load(<data/c63c8807f9>, 0, 0) returned error, retrying after 8.277195667s: failed to send packet: write |1: file already closed
Save(<lock/2c06eee37b>) returned error, retrying after 10.034154425s: ssh command exited: exit status 255
Load(<data/c63c8807f9>, 0, 0) returned error, retrying after 10.163166574s: failed to send packet: write |1: file already closed
Save(<lock/2c06eee37b>) returned error, retrying after 22.663987965s: ssh command exited: exit status 255
Load(<data/c63c8807f9>, 0, 0) returned error, retrying after 13.811796615s: failed to send packet: write |1: file already closed
[17:06:00] 45.49%  7598 / 16703 packs rewritten
unable to refresh lock: ssh command exited: exit status 255
Remove(<lock/be44e6cf56>) returned error, retrying after 351.593438ms: ssh command exited: exit status 255
Remove(<lock/be44e6cf56>) returned error, retrying after 645.653563ms: ssh command exited: exit status 255
Remove(<lock/be44e6cf56>) returned error, retrying after 1.204507436s: ssh command exited: exit status 255
Remove(<lock/be44e6cf56>) returned error, retrying after 2.299204301s: ssh command exited: exit status 255
Remove(<lock/be44e6cf56>) returned error, retrying after 2.007570431s: ssh command exited: exit status 255
Remove(<lock/be44e6cf56>) returned error, retrying after 3.026422858s: ssh command exited: exit status 255
Remove(<lock/be44e6cf56>) returned error, retrying after 7.133794867s: ssh command exited: exit status 255
Remove(<lock/be44e6cf56>) returned error, retrying after 6.0363136s: ssh command exited: exit status 255
Remove(<lock/be44e6cf56>) returned error, retrying after 17.496021524s: ssh command exited: exit status 255
Remove(<lock/be44e6cf56>) returned error, retrying after 23.002952479s: ssh command exited: exit status 255
failed to send packet: write |1: file already closed
github.com/pkg/sftp.sendPacket
	github.com/pkg/sftp/packet.go:135
github.com/pkg/sftp.(*conn).sendPacket
	github.com/pkg/sftp/conn.go:31
github.com/pkg/sftp.(*clientConn).dispatchRequest
	github.com/pkg/sftp/conn.go:117
github.com/pkg/sftp.(*clientConn).sendPacket
	github.com/pkg/sftp/conn.go:108
github.com/pkg/sftp.(*Client).open
	github.com/pkg/sftp/client.go:449
github.com/pkg/sftp.(*Client).Open
	github.com/pkg/sftp/client.go:437
github.com/restic/restic/internal/backend/sftp.(*SFTP).openReader
	github.com/restic/restic/internal/backend/sftp/sftp.go:320
github.com/restic/restic/internal/backend.DefaultLoad
	github.com/restic/restic/internal/backend/utils.go:49
github.com/restic/restic/internal/backend/sftp.(*SFTP).Load
	github.com/restic/restic/internal/backend/sftp/sftp.go:307
github.com/restic/restic/internal/limiter.rateLimitedBackend.Load
	github.com/restic/restic/internal/limiter/limiter_backend.go:44
github.com/restic/restic/internal/backend.(*RetryBackend).Load.func1
	github.com/restic/restic/internal/backend/backend_retry.go:79
github.com/cenkalti/backoff.RetryNotify
	github.com/cenkalti/backoff/retry.go:37
github.com/restic/restic/internal/backend.(*RetryBackend).retry
	github.com/restic/restic/internal/backend/backend_retry.go:36
github.com/restic/restic/internal/backend.(*RetryBackend).Load
	github.com/restic/restic/internal/backend/backend_retry.go:77
github.com/restic/restic/internal/cache.(*Backend).Load
	github.com/restic/restic/internal/cache/backend.go:197
github.com/restic/restic/internal/repository.DownloadAndHash
	github.com/restic/restic/internal/repository/repository.go:768
github.com/restic/restic/internal/repository.Repack
	github.com/restic/restic/internal/repository/repack.go:29
main.pruneRepository
	github.com/restic/restic/cmd/restic/cmd_prune.go:242
main.runPrune
	github.com/restic/restic/cmd/restic/cmd_prune.go:62
main.glob..func19
	github.com/restic/restic/cmd/restic/cmd_prune.go:27
github.com/spf13/cobra.(*Command).execute
	github.com/spf13/cobra/command.go:852
github.com/spf13/cobra.(*Command).ExecuteC
	github.com/spf13/cobra/command.go:960
github.com/spf13/cobra.(*Command).Execute
	github.com/spf13/cobra/command.go:897
main.main
	github.com/restic/restic/cmd/restic/main.go:98
runtime.main
	runtime/proc.go:204
runtime.goexit
	runtime/asm_amd64.s:1374
Load
github.com/restic/restic/internal/repository.DownloadAndHash
	github.com/restic/restic/internal/repository/repository.go:784
github.com/restic/restic/internal/repository.Repack
	github.com/restic/restic/internal/repository/repack.go:29
main.pruneRepository
	github.com/restic/restic/cmd/restic/cmd_prune.go:242
main.runPrune
	github.com/restic/restic/cmd/restic/cmd_prune.go:62
main.glob..func19
	github.com/restic/restic/cmd/restic/cmd_prune.go:27
github.com/spf13/cobra.(*Command).execute
	github.com/spf13/cobra/command.go:852
github.com/spf13/cobra.(*Command).ExecuteC
	github.com/spf13/cobra/command.go:960
github.com/spf13/cobra.(*Command).Execute
	github.com/spf13/cobra/command.go:897
main.main
	github.com/restic/restic/cmd/restic/main.go:98
runtime.main
	runtime/proc.go:204
runtime.goexit
	runtime/asm_amd64.s:1374
Repack
github.com/restic/restic/internal/repository.Repack
	github.com/restic/restic/internal/repository/repack.go:31
main.pruneRepository
	github.com/restic/restic/cmd/restic/cmd_prune.go:242
main.runPrune
	github.com/restic/restic/cmd/restic/cmd_prune.go:62
main.glob..func19
	github.com/restic/restic/cmd/restic/cmd_prune.go:27
github.com/spf13/cobra.(*Command).execute
	github.com/spf13/cobra/command.go:852
github.com/spf13/cobra.(*Command).ExecuteC
	github.com/spf13/cobra/command.go:960
github.com/spf13/cobra.(*Command).Execute
	github.com/spf13/cobra/command.go:897
main.main
	github.com/restic/restic/cmd/restic/main.go:98
runtime.main
	runtime/proc.go:204

restic version
restic 0.11.0 compiled with go1.15.9 on linux/amd64

@SimoneLazzaris
Copy link

Hello there!
Thanks for you great work.

Still having this problem with hetzner... my backups are failing after some minutes (13 mins, 16 mins... it varies).
Error is always

Save(<data/blabla>) returned error, retrying after xxxms: Write: connection list
Fatal: unable to save snapshot: ssh command exited: exit status 255

No combination of ssh config parameter is able to help me. I also noticed that plain ssh connection are dropped. I don't know if it's an Hetzner problem or a connection issue, but I believe that a simple retry loop will fix (or mask) the issue.

@MichaelEischer
Copy link
Member

For me this sounds like the following SFTP issue discussed in the documentation (https://restic.readthedocs.io/en/stable/030_preparing_a_new_repo.html#sftp):

Please be aware that sftp servers close connections when no data is received by the client. This can happen when restic is processing huge amounts of unchanged data. To avoid this issue add the following lines to the client’s .ssh/config file:

ServerAliveInterval 60
ServerAliveCountMax 240

@johnflan
Copy link

Thanks Michael - I actually have those values already configured in my .ssh/config file

@MichaelEischer
Copy link
Member

@johnflan Oh, that response was mainly intended for SimoneLazzaris.

In your case I wonder whether this might be a result of an internet reconnection? With a daily reconnection as for example for DSL, it wouldn't be too unexpected if that interrupted a command running for 17 hours.

@SimoneLazzaris
Copy link

SimoneLazzaris commented Oct 22, 2022

@MichaelEischer I've tried that, and also disabling/enabling TCPKeepAlive, but without success. I suspect that something is wrong on Hetzner side actually, because also plain ssh connections are sometime dropped.

I managed to make it work using rclone as intermediate interface. So I've configured rclone to have a backend on hetzner and then I told restic to use it. It's been uploading for 24 hours now; sometime I get some errors but the connection is immediately reestablished and the backup can go on:
image

Note that before, with direct sftp, my connection was dropped every 10/20 minutes

@johnflan
Copy link

@johnflan Oh, that response was mainly intended for SimoneLazzaris.

In your case I wonder whether this might be a result of an internet reconnection? With a daily reconnection as for example for DSL, it wouldn't be too unexpected if that interrupted a command running for 17 hours.

No worries @MichaelEischer - I had considered that - but don't every have a problem performing the initial backup just the prune. On a superficial glance it would seem that in the case of a network drop restic should attempt to reinstate the scp stream to the remote target.

@mika
Copy link
Author

mika commented Oct 23, 2022

@johnflan I also had tons of troubles with restic and Hetzner, what was also relevant for me was https://docs.hetzner.com/robot/dedicated-server/troubleshooting/performance-intel-i218-nic - if you're running a dedicated server at Hetzner, consider giving this a try

@ibash ibash linked a pull request Nov 29, 2022 that will close this issue
11 tasks
@ibash
Copy link

ibash commented Nov 29, 2022

Took a first stab at this in a draft PR, approach is to separate out the ssh connection code from the backend and then establish a new connection when it exits. Not ready for maintainers to review yet, but if anyone wants to give feedback on the approach I'd welcome it.

@hvhaugwitz
Copy link

What is the state of this feature request?

@ibash
Copy link

ibash commented May 8, 2023

My pr kind of worked, but needed cleanup + more extensive manual testing before it’d be merge ready.

Personally didn’t need this functionality that much because I made my wrapper script that calls restic just kill the process on disconnect and retry later.

@MichaelEischer MichaelEischer linked a pull request Jul 29, 2023 that will close this issue
11 tasks
@jbsilva
Copy link

jbsilva commented Dec 27, 2023

I'm having the same problem with Hetzner.

... I made my wrapper script that calls restic just kill the process on disconnect and retry later.

@ibash, could you share that script?
I might try this before the rclone alternative.

@ibash
Copy link

ibash commented Dec 27, 2023

I'm having the same problem with Hetzner.

... I made my wrapper script that calls restic just kill the process on disconnect and retry later.

@ibash, could you share that script? I might try this before the rclone alternative.

Unfortunately my backups code is kind of messy / bespoke. I have a branch where I'm working on a cleanup. At some point later I'll publish that.

But the relevant bit is in here:
https://gist.github.com/ibash/d20e320c5248cc61eaa1cb1248be4bb3

In the code below child is the restic process. Then I attach a handler to stderr and if the text returned error, retrying after and connection lost ever appears I just eagerly kill the process. This script is wrapped in a launchd process that tries to backup every few hours, so the retry will happen sometime later.

      child.stderr.on('data', (data) => {
        // unfortunately restic never recovers from this, so end early s.t. the
        // backup can be retried again sooner
        //
        // ref: https://github.com/restic/restic/issues/353
        if (data.includes('returned error, retrying after') && data.includes('connection lost')) {
          isConnectionLost = true
          reject(new Error('connection lost'))
        }
      })

@Cantello
Copy link

Cantello commented Mar 14, 2024

Similar problems here, apparently also a Hetzner problem (using a 1TB storage box over there) but it was working until 07Mar2024, then it broke. Some minutes into the backup (photos, ~350GB) I get the usual:

Save(<data/c2a9ae6dd9>) returned error, retrying after 552.330144ms: Write: connection lost
Save(<data/c4bf517c04>) returned error, retrying after 720.254544ms: OpenFile: connection lost
Save(<data/0634acad0f>) returned error, retrying after 582.280027ms: OpenFile: connection lost
Save(<data/87ce95b26b>) returned error, retrying after 468.857094ms: OpenFile: connection lost
Save(<data/447de67d48>) returned error, retrying after 462.318748ms: OpenFile: connection lost
subprocess ssh: Connection to u123456.your-storagebox.de closed by remote host.
Fatal: unable to save snapshot: ssh command exited: exit status 255
error in cleanup handler: ssh command exited: exit status 255

The command line is unspectacular:
restic backup --exclude /mnt/Photos/iOS_Photos --exclude-caches --password-file /etc/resticprofile/password.txt --repo sftp:123456@u123456.your-storagebox.de:restic-backup --tag photos --verbose=3 /mnt/Photos

Did someone already ask over in the Hetzner forum or their tech support? Can't be a restic issue if it was working without any changes before, right?

edit: Changed the backend to rclone and it works now, even with the whole 380GB. No idea what broke it and I only asked in Hetzner's forum, did anyone ask tech support yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.