Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crazy slow incremental backup over WAN #1420

Closed
cbf305 opened this issue Nov 10, 2017 · 8 comments
Closed

Crazy slow incremental backup over WAN #1420

cbf305 opened this issue Nov 10, 2017 · 8 comments
Labels
state: need feedback waiting for feedback, e.g. from the submitter

Comments

@cbf305
Copy link

cbf305 commented Nov 10, 2017

restic version:

restic 0.7.3 (v0.7.3-235-gf39f7c76)
compiled with go1.9.1 on linux/amd64

I launched restic with the following:

restic -r sftp:backup:/storage/backups backup -p /root/.restic --files-from /root/.restic.conf

Contents of .restic.conf:

/root
/etc
/home
/opt
/var/log
/mnt/disk1
/mnt/disk2
/mnt/disk3
/mnt/disk4
/mnt/disk5

Output at start of app:

password is correct
scan [/root /etc /home /opt /var/log /mnt/disk1 /mnt/disk2 /mnt/disk3 /mnt/disk4 /mnt/disk5]
scanned 23834 directories, 189873 files in 0:01

Issue:
Restic is a sweet app. I've been using it for local backups for about 2 months. Since CrashPlan announced they are getting rid of their home plan, I wanted to switch to restic since it's been working so well for local backups. I have approx 4 TB of data, the majority of which are videos and pictures, so very static data. I'm setting up a backup to a remote server and my home connection is 100Mbps/10Mbps, so I took my machine to the servers location and did the initial backup, so my remote incremental backups would be substatially faster since I'm just sending the changes. BTW, the remote server's WAN connection is 400Mbps down/up. The initial backup took ~12 hours to complete over a 1Gbps LAN connection. The following day I ran one incremental manually just to be sure everything was backed up. The incremental took 32 minutes. It was slower that I was expecting for a net data change of zero, but I didn't really think twice about it since my backups will run while I am sleeping.

So I took my machine home and setup up the cron job. It's first automated run and first run over my 10Mbps connection was last night. Since returning home, my data drives have only had 600MB of data added, so I was interested to compare that 32 minute incremental with what extra time it would take for the 600MB of data, which should have only added a few minutes. Well, it took it a little over 10 hours to run :-( I loaded up htop and nload in a tmux session so I could see if I would saturate that 10Mbps limit (CrashPlan never did, but restic did :-). Once I saw restic kick off in htop, I headed to bed. I checked my screen this morning and when I saw it still running, I though it was uploading everything again, so I looked at nload and it was hardly showing any I/O. I checked about every half hour until restic was gone from the processes, and I had a new snapshot in the repo. I checked nload and the inbound/outbound totals were 701MB/2.17GB, respectivly. There was hardly anything sent out, and certainly not enough to justify a 10 hour run time. Other than the 600MB of data and the slower WAN connection, nothing has change on my machine. Since it's sending such little data to the remote, it seems silly that the WAN speed would really be the cause of going from 32 minutes to 10 hours. What's even more bizarre is I started a manual backup about 2 hours ago just to confirm what I saw last night and it's on target to hit 10 hours again, this time with zero changes to the data and so far nload totals have only gone up a few MB. I also checked SMART on my disks since they were transported around and everything is fine, and the system temps were slightly above idle, so I really can't explain it.

I saw that there was issue 386 that lead to the inclusion of metadata caching, which was a result of a similar slow WAN issue. Am I mistaken in comparing the incremental backup in restic to CrashPlan in that it should have completed much faster than it did?

Thanks for any suggestions!

@fd0
Copy link
Member

fd0 commented Nov 10, 2017

Thanks for the report, that is really odd. Could you paste the complete output from the restic backup run until it starts saving files? Is there something about a parent snapshot in it?

Do you observe high CPU and/or disc IO during backup?

Which version of restic did you use to create the initial snapshot?

@fd0 fd0 added the state: need feedback waiting for feedback, e.g. from the submitter label Nov 10, 2017
@cbf305
Copy link
Author

cbf305 commented Nov 10, 2017

I saw the --parent option in the docs, but this is the only machine backing up to this repo, so I didn't specify it. I have never seen it list out anything about a parent either. I started it manually again just now and snagged it with a screenshot tool so you can see the whole thing. And even though the ETA is only showing 3 hours it will slowly tick up to 9.5 to 10 hours if I let it run for a couple minutes.

The cores bounce around between 7-35% and iotop shows the disks are about idle before restic starts and then the read speeds average 115M/s, and earlier I observed spikes up to 145M/s and dips as low as 75M/s.

screenshot_20171110_134353

@cbf305
Copy link
Author

cbf305 commented Nov 10, 2017

You piqued my curiosity :-) I ran it again but specified the --parent option with the ID of the last snapshot and it did display the "using parent" in the output and started with the best ETA I've seen yet at 2.5 hours, but after about a minute it was back up to the 10 hour mark.

screenshot_20171110_142220

@fd0
Copy link
Member

fd0 commented Nov 10, 2017

Ok, I think you've run into a version of the bug described in #1290, I'll try to reproduce it.

@fd0
Copy link
Member

fd0 commented Nov 10, 2017

Hm, I'm not able to reproduce it. What should happen is that restic finds a parent (=previous) snapshot of the same set of directories. It should the load the metadata from that snapshot, and detect that the files haven't been modified since, and only read and save new data. The metadata should also be automatically cached somewhere in ~/.cache/restic.

All that should be pretty robust, besides bugs such as #1290.

It looks like restic is reading all files locally again (which is why it takes so long), but after reading the data it detects that all the data is already saved in the repo, so it continues with the next file. That's not efficient, obviously.

Can you please rebuild restic and create a debug log (instructions) of restic running for a couple of minutes and attach it to the issue? Please make sure to replace any sensitive details included in there (authentication tokens, passwords etc) before posting it.

@fd0
Copy link
Member

fd0 commented Nov 10, 2017

Can you paste the output of restic snapshots?

@cbf305
Copy link
Author

cbf305 commented Nov 13, 2017

Sorry, was away from my machine all weekend. Before I left I started a backup since I'd been canceling all the ones since that first 10 hour incremental finished and since I dumped some new stuff on the drives. I ran it with the same command posted in the last screenshot where I specified the --parent. It took it 9 hrs and 47 minutes.

This morning I had about 10G of new stuff to dump on the drives, so I thought I'll just try once more, but without specifying the --parent option. It fired off and for the first time, it said "using parent ..." in the output. It started off very strong with an eta of 10 minutes and a speed of ~6GiB/s, but slowed down when it hit the new stuff, as expected. Still it maintained a healthy speed and was saturating my 10Mbps upload speed. Overall the backup completed in 3:23 (h:mm). Not too bad compared to the 10 hours with zero changes ;-)

For kicks I started a second backup since nothing new has been added. It also displayed the "using parent ..." and it completed in 9:57 (m:ss) with an average speed of 5983MiB/s! Looks like your hypothesis was right in that it needed to reconcile all the files again for some strange reason. Once it did that it's picking up the parent correctly and the backups are wicked fast.

I guess let's close this issue and if this crops up again I will get the debug log and pass that along.

Thanks!

@fd0
Copy link
Member

fd0 commented Nov 14, 2017

Nice, so it looks like there's additional room for improvement in the parent detection code. I'm going to close this issue for now, please report back when it recurs!

@fd0 fd0 closed this as completed Nov 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: need feedback waiting for feedback, e.g. from the submitter
Projects
None yet
Development

No branches or pull requests

2 participants