New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow backup speed for large (incremental) backups #386

Closed
yatesco opened this Issue Jan 8, 2016 · 26 comments

Comments

Projects
None yet
9 participants
@yatesco

yatesco commented Jan 8, 2016

Hi,

I notice that restic doesn't seem to be optimised for unchanged files which makes backing up very large repos (~500GBs for example with some large files) quite time consuming.

Not sure if this would work internally, but you must be keeping a map from file->chunks so can't you notice that a file is the "same" (e.g. same hashcode, date/time access and file size?) and then simply clone the file->chunks map which should be significantly quicker?

I hesitate to raise this because having seen the calibre of this tool I can't believe this hasn't already occurred to you :-).

Thanks!

@yatesco

This comment has been minimized.

Show comment
Hide comment
@yatesco

yatesco Jan 8, 2016

and apologies for the [FR] notation in the title - I notice you add lots of nice labels but when I raise an issue I don't get an option to add the labels.

yatesco commented Jan 8, 2016

and apologies for the [FR] notation in the title - I notice you add lots of nice labels but when I raise an issue I don't get an option to add the labels.

@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Jan 8, 2016

Member

That's no problem, only members of the restic organisation can add labels, I'll gladly do that for you (and remove redundant bits from the title).

Restic does indeed already check whether files have changed, but since we do not yet hold any local state, the information needs to be fetched from the repository. That's what makes it slow at the moment. I'm planning to add a local metadata cache (file names, modification timestamps, lists of blobs etc.), that should improve the speed a lot!

Please note that interrupted backups are resumed already, at least concerning transferred data. Restic regularly creates checkpoints for new data that needed to be saved in the repo, so most data is not transferred twice even when the backup was interrupted.

I consider the slow speed for incremental (=not the first) backups a bug.

Member

fd0 commented Jan 8, 2016

That's no problem, only members of the restic organisation can add labels, I'll gladly do that for you (and remove redundant bits from the title).

Restic does indeed already check whether files have changed, but since we do not yet hold any local state, the information needs to be fetched from the repository. That's what makes it slow at the moment. I'm planning to add a local metadata cache (file names, modification timestamps, lists of blobs etc.), that should improve the speed a lot!

Please note that interrupted backups are resumed already, at least concerning transferred data. Restic regularly creates checkpoints for new data that needed to be saved in the repo, so most data is not transferred twice even when the backup was interrupted.

I consider the slow speed for incremental (=not the first) backups a bug.

@fd0 fd0 changed the title from [FR] Optimise unchanged files to Optimise unchanged files Jan 8, 2016

@fd0 fd0 changed the title from Optimise unchanged files to Slow backup speed for large backups Jan 8, 2016

@fd0 fd0 changed the title from Slow backup speed for large backups to Slow backup speed for large (incremental) backups Jan 8, 2016

@fd0 fd0 added the bug label Jan 8, 2016

@yatesco

This comment has been minimized.

Show comment
Hide comment
@yatesco

yatesco Jan 8, 2016

Thanks for the clarity - that makes sense.

yatesco commented Jan 8, 2016

Thanks for the clarity - that makes sense.

@viric

This comment has been minimized.

Show comment
Hide comment
@viric

viric Jan 22, 2016

Contributor

So, the bottleneck is not the I/O operation with the source disk, but the need to retrieve the index from the repository. Is it?

Contributor

viric commented Jan 22, 2016

So, the bottleneck is not the I/O operation with the source disk, but the need to retrieve the index from the repository. Is it?

@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Jan 22, 2016

Member

Ah, not exactly. All index files are read at the start of the backup operation in bulk, so that's no problem and restic already known which data blobs have been saved.

However, for each directory, a JSON data structure has to be fetched from the repository, I suspect that the latency for this operation causes the slowdown.

I did not yet have the time to look at the problem in detail, but I have the following solutions in mind:

  • Having a local cache of all tree objects that is consulted instead of the backend. When a tree is missing, it is retrieved from the backend and also saved in the local cache.
  • What's basically needed during an "incremental" backup is a stream of tree objects describing the previous state of the directories, in the correct order. This stream could be created by a backup operation and used in the next backup, this is similar to the local cache, but the list is already sorted and can be directly fed into the archiving code.
  • Tune the archive code to pre-fetch more aggressively and/or add more concurrency.
Member

fd0 commented Jan 22, 2016

Ah, not exactly. All index files are read at the start of the backup operation in bulk, so that's no problem and restic already known which data blobs have been saved.

However, for each directory, a JSON data structure has to be fetched from the repository, I suspect that the latency for this operation causes the slowdown.

I did not yet have the time to look at the problem in detail, but I have the following solutions in mind:

  • Having a local cache of all tree objects that is consulted instead of the backend. When a tree is missing, it is retrieved from the backend and also saved in the local cache.
  • What's basically needed during an "incremental" backup is a stream of tree objects describing the previous state of the directories, in the correct order. This stream could be created by a backup operation and used in the next backup, this is similar to the local cache, but the list is already sorted and can be directly fed into the archiving code.
  • Tune the archive code to pre-fetch more aggressively and/or add more concurrency.
@viric

This comment has been minimized.

Show comment
Hide comment
@viric

viric Jan 22, 2016

Contributor

Ah, the index tells what data blobs have been saved, but does not tell about the files already saved (file tree metadata). I thought the index was in fact a list of files/sizes/timestamps.

Contributor

viric commented Jan 22, 2016

Ah, the index tells what data blobs have been saved, but does not tell about the files already saved (file tree metadata). I thought the index was in fact a list of files/sizes/timestamps.

@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Jan 22, 2016

Member

That explains your confusion ;)

Did you find the design document yet? https://github.com/restic/restic/blob/master/doc/Design.md

It explains the data structures and terminology quite well.

Member

fd0 commented Jan 22, 2016

That explains your confusion ;)

Did you find the design document yet? https://github.com/restic/restic/blob/master/doc/Design.md

It explains the data structures and terminology quite well.

@Bregor

This comment has been minimized.

Show comment
Hide comment
@Bregor

Bregor Feb 14, 2016

Repo can be not as large as 500GB.
For example, initial backup of 50Gb to S3 takes (server somewhere in Germany, S3 DC - Frankfurt) about 15 minutes.
Second backup of same files takes two hours.

Bregor commented Feb 14, 2016

Repo can be not as large as 500GB.
For example, initial backup of 50Gb to S3 takes (server somewhere in Germany, S3 DC - Frankfurt) about 15 minutes.
Second backup of same files takes two hours.

@krzysztofantczak

This comment has been minimized.

Show comment
Hide comment
@krzysztofantczak

krzysztofantczak Apr 27, 2016

I can confirm that, with even smaller amount of data. 500mb here ;-) First time it took about 39 seconds. Now it's running second time, ETA sometimes shows over 5 hours.... ;/

I can confirm that, with even smaller amount of data. 500mb here ;-) First time it took about 39 seconds. Now it's running second time, ETA sometimes shows over 5 hours.... ;/

@fd0 fd0 self-assigned this Sep 4, 2016

@Quantum-Studio-iOS

This comment has been minimized.

Show comment
Hide comment
@Quantum-Studio-iOS

Quantum-Studio-iOS Sep 17, 2016

I can confirm this behaviour. I have an minio server setup and a rather large directory&file tree, ~6Gb.

Initial backup took 12 minutes
I removed one file and now restic backup ETA is 13 hours!

[7:10] 0.90%  143.798 KiB/s  60.384 MiB / 6.576 GiB  1693 / 128556 items  ... ETA 13:12:04

Is there anything we can do localy to solve the issue? Or, as of yet, restic is not ready for big directory trees?

I can confirm this behaviour. I have an minio server setup and a rather large directory&file tree, ~6Gb.

Initial backup took 12 minutes
I removed one file and now restic backup ETA is 13 hours!

[7:10] 0.90%  143.798 KiB/s  60.384 MiB / 6.576 GiB  1693 / 128556 items  ... ETA 13:12:04

Is there anything we can do localy to solve the issue? Or, as of yet, restic is not ready for big directory trees?

@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Sep 17, 2016

Member

You can try using restic backup --force, which will re-read all data locally, but won't load any metadata of a previous backup from the repo. This helps with high-latency (non-local) repos. If this is much faster, you've experienced the already known limitation that restic has at the moment: Metadata is fetched from the repo and not (yet) cached locally.

Member

fd0 commented Sep 17, 2016

You can try using restic backup --force, which will re-read all data locally, but won't load any metadata of a previous backup from the repo. This helps with high-latency (non-local) repos. If this is much faster, you've experienced the already known limitation that restic has at the moment: Metadata is fetched from the repo and not (yet) cached locally.

@aspcartman

This comment has been minimized.

Show comment
Hide comment
@aspcartman

aspcartman Sep 17, 2016

@fd0
If --force used, my backup stops being incremental?
May I suggest downloading the metadata for the whole directory tree instead of separate requests for each directory, or this is exactly what you mean by saying 'cached'?

@fd0
If --force used, my backup stops being incremental?
May I suggest downloading the metadata for the whole directory tree instead of separate requests for each directory, or this is exactly what you mean by saying 'cached'?

@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Sep 17, 2016

Member

If --force used, my backup stops being incremental?

No, restic won't store duplicate data in the repo, so the backup is still incremental.

May I suggest downloading the metadata for the whole directory tree instead of separate requests for each directory, or this is exactly what you mean by saying 'cached'?

That's what I meant. Unfortunately the way the repo is structured the tree objects (which store the metadata information like filenames, timestamps etc.) are scattered in small files in the repo. But I'm working on that.

Member

fd0 commented Sep 17, 2016

If --force used, my backup stops being incremental?

No, restic won't store duplicate data in the repo, so the backup is still incremental.

May I suggest downloading the metadata for the whole directory tree instead of separate requests for each directory, or this is exactly what you mean by saying 'cached'?

That's what I meant. Unfortunately the way the repo is structured the tree objects (which store the metadata information like filenames, timestamps etc.) are scattered in small files in the repo. But I'm working on that.

@aspcartman

This comment has been minimized.

Show comment
Hide comment
@aspcartman

aspcartman Sep 17, 2016

@fd0
Thanks, then I'm having a hard time understanding what --force does.
I thought that for the backup to be incremental it needs to fetch the metadata about the existing backup, compare and upload changes. If --force "will re-read all data locally, but won't load any metadata of a previous backup from the repo", then how come it be still incremental? If it is still incremental, what the actual difference with this option and why do we fetch the metadata if not for "incrementalism"?

@fd0
Thanks, then I'm having a hard time understanding what --force does.
I thought that for the backup to be incremental it needs to fetch the metadata about the existing backup, compare and upload changes. If --force "will re-read all data locally, but won't load any metadata of a previous backup from the repo", then how come it be still incremental? If it is still incremental, what the actual difference with this option and why do we fetch the metadata if not for "incrementalism"?

@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Sep 17, 2016

Member

Ok, no problem, let me explain. restic does not really distinguish full and incremental backups.

restic works by splitting files into blobs of data with a CDC algorithm (details). Each blob is identified by its SHA256 hash and is only saved once in the repo. So if you have two files with the same first blob but different endings, the first blob will only be saved once. In order to do this splitting, restic needs to read all data from all files. For each directory, the entries are saved in a tree blob. For files this means the file name, the list of blobs the file consists of and metadata such as the mod time, change time and access time is saved

For the first backup, all data is read and all blobs are stored in the repo.

For the second backup (without --force), restic checks the repo and finds that there already is a snapshot of that particular directory. It will then load the metadata for the directories and files and for each file check the mod time. If it did not change, the old list of blobs is taken and the data stored in the file is not read again (because it was not modified).

For a backup with --force, restic does not load any metadata and will read all data from all files again. It will however only store new blobs, so it is still kind of "incremental".

The main difference is that all data is read again with --force. As a byproduct, restic is faster for non-local remotes, because it doesn't need to load (and wait for) any data from the repo during the backup process.

Does that answer your question? Please feel free to post any followup questions.

Member

fd0 commented Sep 17, 2016

Ok, no problem, let me explain. restic does not really distinguish full and incremental backups.

restic works by splitting files into blobs of data with a CDC algorithm (details). Each blob is identified by its SHA256 hash and is only saved once in the repo. So if you have two files with the same first blob but different endings, the first blob will only be saved once. In order to do this splitting, restic needs to read all data from all files. For each directory, the entries are saved in a tree blob. For files this means the file name, the list of blobs the file consists of and metadata such as the mod time, change time and access time is saved

For the first backup, all data is read and all blobs are stored in the repo.

For the second backup (without --force), restic checks the repo and finds that there already is a snapshot of that particular directory. It will then load the metadata for the directories and files and for each file check the mod time. If it did not change, the old list of blobs is taken and the data stored in the file is not read again (because it was not modified).

For a backup with --force, restic does not load any metadata and will read all data from all files again. It will however only store new blobs, so it is still kind of "incremental".

The main difference is that all data is read again with --force. As a byproduct, restic is faster for non-local remotes, because it doesn't need to load (and wait for) any data from the repo during the backup process.

Does that answer your question? Please feel free to post any followup questions.

@yhafri

This comment has been minimized.

Show comment
Hide comment
@yhafri

yhafri Dec 2, 2016

I'm facing this very same issue with a small 15mB directory.
First back was quick, less than 10sec. Second one is till running (couple of minutes).

This is an old issue. Is is fixed?

yhafri commented Dec 2, 2016

I'm facing this very same issue with a small 15mB directory.
First back was quick, less than 10sec. Second one is till running (couple of minutes).

This is an old issue. Is is fixed?

@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Dec 3, 2016

Member

@yhafri This issue is not yet fixed. Running restic backup --force should fix it for you.

Can you describe the structure of the small directory? This sounds like a great test case...

Member

fd0 commented Dec 3, 2016

@yhafri This issue is not yet fixed. Running restic backup --force should fix it for you.

Can you describe the structure of the small directory? This sounds like a great test case...

@yhafri

This comment has been minimized.

Show comment
Hide comment
@yhafri

yhafri Dec 3, 2016

I can confirm that --force fixed my issue.

Here is what i did for testing:

$ mkdir t && cd t
$ for i in `seq 1 15`; do dd if=/dev/urandom of=$i.bin bs=1M count=1; done
$ du -hs .
15M	.
$ restic version
restic 0.3.0 (v0.3.0-38-g505a209)
compiled at 2016-11-30 04:53:47 with go1.7.3 on darwin/amd64

yhafri commented Dec 3, 2016

I can confirm that --force fixed my issue.

Here is what i did for testing:

$ mkdir t && cd t
$ for i in `seq 1 15`; do dd if=/dev/urandom of=$i.bin bs=1M count=1; done
$ du -hs .
15M	.
$ restic version
restic 0.3.0 (v0.3.0-38-g505a209)
compiled at 2016-11-30 04:53:47 with go1.7.3 on darwin/amd64
@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Dec 3, 2016

Member

Hm. I have a hunch that this is something different. As outlined above, the slowdown comes from the metadata that is read from the repo for each directory. For the test directory, this means one tree object is fetched. That shouldn't take long.

Member

fd0 commented Dec 3, 2016

Hm. I have a hunch that this is something different. As outlined above, the slowdown comes from the metadata that is read from the repo for each directory. For the test directory, this means one tree object is fetched. That shouldn't take long.

@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Dec 3, 2016

Member

Could you please create a debug log and post it somewhere?

Member

fd0 commented Dec 3, 2016

Could you please create a debug log and post it somewhere?

@russelldavies

This comment has been minimized.

Show comment
Hide comment
@russelldavies

russelldavies Jan 29, 2017

For what it's worth, I cannot replicate the performance problems @yhafri is seeing. My remote repo was using SFTP not Minio which leads me to believe it's not the tree object fetching problem (in the debug log I only saw one instance of got tree node for... which makes sense).

On the other hand, creating a snapshot of a directory that had over 12,000 subdirectories took much longer than the initial run because there were >12k remote tree node requests to the remote storage.

For what it's worth, I cannot replicate the performance problems @yhafri is seeing. My remote repo was using SFTP not Minio which leads me to believe it's not the tree object fetching problem (in the debug log I only saw one instance of got tree node for... which makes sense).

On the other hand, creating a snapshot of a directory that had over 12,000 subdirectories took much longer than the initial run because there were >12k remote tree node requests to the remote storage.

@viric

This comment has been minimized.

Show comment
Hide comment
@viric

viric Jan 30, 2017

Contributor

The remote storage should only be asked for packs, not blobs. 12k subdirectories may fit just in a very few packs. Isn't it?

Contributor

viric commented Jan 30, 2017

The remote storage should only be asked for packs, not blobs. 12k subdirectories may fit just in a very few packs. Isn't it?

@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Jan 30, 2017

Member

@viric That's not correct, unfortunately. At the moment, there is no local cache for tree objects or other metadata, so each tree is fetched by itself, on demand, from the remote repo. I'm planning to add a local cache, but that hasn't happened yet.

Member

fd0 commented Jan 30, 2017

@viric That's not correct, unfortunately. At the moment, there is no local cache for tree objects or other metadata, so each tree is fetched by itself, on demand, from the remote repo. I'm planning to add a local cache, but that hasn't happened yet.

@russelldavies

This comment has been minimized.

Show comment
Hide comment
@russelldavies

russelldavies Jan 30, 2017

@fd0 Re the local cache work, is there some private branch you've started on or design documentation? I'll probably have some free time this weekend so could help out on this feature.

@fd0 Re the local cache work, is there some private branch you've started on or design documentation? I'll probably have some free time this weekend so could help out on this feature.

@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Jan 30, 2017

Member

I don't have anything that's working yet, sorry. I'll let you know, and you can subscribe to #29 to track progress.

Member

fd0 commented Jan 30, 2017

I don't have anything that's working yet, sorry. I'll let you know, and you can subscribe to #29 to track progress.

@fd0

This comment has been minimized.

Show comment
Hide comment
@fd0

fd0 Sep 29, 2017

Member

We've recently merged #1040 which adds a local metadata cache, it resolves this issue so I'm closing it.

Member

fd0 commented Sep 29, 2017

We've recently merged #1040 which adds a local metadata cache, it resolves this issue so I'm closing it.

@fd0 fd0 closed this Sep 29, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment