archive and restore sparse files #79

fd0 · 2015-01-26T21:34:56Z

description from the gnu tar manual: https://www.gnu.org/software/tar/manual/html_node/sparse.html

jannic · 2015-04-10T19:23:27Z

At least on linux, the description in the tar manual is outdated, the part "in order to determine if the file is sparse tar has to read it before trying to archive it" isn't true any more. The lseek() option SEEK_HOLE allows to find holes without reading the whole file.

Using this from golang would involve some low-level non-portable code, but I guess for files with very large holes it could be an important optimization.

fd0 · 2015-04-10T19:46:40Z

Thanks for the hint. I have some more optimizations in mind that may only work on Linux. In general, I think that's okay to implement, as long as the other platforms are not degraded in functionality.

andrewchambers · 2015-04-10T22:16:46Z

Long runs of zeros will be compressed perfectly anyway.

jannic · 2015-04-11T10:00:42Z

Correct, detecting holes doesn't help reducing the backup size. It can still improve performance, as reading large amounts of zeroes and then compressing them takes longer than just skipping over them.
But what's more important: When restoring files which had large holes, it's very useful to actually recreate those holes instead of writing zeroes. It obviously saves space, and in some cases, restoring the backups without recreating holes might even lead to disk-full errors.

andrewchambers · 2015-04-11T14:04:30Z

There is a technical tradeoff here, complicate the code for something that won't be important 99 percent of the time? I also think actually restoring backups is far more rare than storing them. I don't mind either way, I just want to point out that adding more code hurts reliability as a whole.

cfcs · 2015-06-26T20:09:13Z

We had some relevant discussion here: #117
(TL;DR: fallocate() is superior to posix_fallocate() - if fallocate() returns EOPNOTSUPP for a file, that result should be cached and regular zeroes should be written instead of attempting additional fallocate() calls)

Documentation for glibc's lseek() - used when detecting sparse ranges

I think the main benefit from supporting sparse files would be in the restore scenario, so I'm going to be dealing with that use-case in this comment. Detecting holes upon backup saves some syscalls, while in the restore scenario we're saving IO, which is usually much slower.

Supporting sparse files would allow us to save space even if the original file was not sparse, or if the filesystem did not support sparse ranges. I suspect there are many cases where files are not correctly made sparse, so this could potentially yield some surprising results upon restoration of a snapshot.
I feel that implementation would be relatively straightforward and that this feature is definitely worth the (relatively low) complexity added.

Documentation for glibc's fallocate() - used when making sparse ranges

Here's an example call to fallocate() on Linux:

#include <fcntl.h>
int result;
result = fallocate( fd, FALLOC_FL_PUNCH_HOLE, current_offset, length_of_zeroes );
switch( result ){
    case 0: break; /* success */
    case ENOSYS: break; /* stop trying to use fallocate() for this kernel / program invocation */
    case EOPNOSUPP: break; /* stop trying to use fallocate() for this file/filesystem */
    default: break; }

Making sparse files only makes sense for kernels / filesystems that support them.
The alternative is to simply write the zeroes as usual. If in the future support for restoring a file to stdout would be implemented, the fallocate() call would fail with an EOPNOSUPP error, consistent with what you'd like. The price of detection is one failed syscall per file (or per filesystem, really) -- I personally think that's worth it.

Don't know how relevant this is, but for Windows: StackOverflow discussion on sparse files in NTFS and MSDN documentation on sparse files in NTFS

Side-note: It would be interesting to investigate how many consecutive null blocks are required before a noticeable performance boost is achieved.

pdf · 2015-09-05T22:29:19Z

@andrewchambers file-backed VMs images and similar are quite common.

yatesco · 2016-01-08T12:02:06Z

+1 backing up VM backups/images is my primary use-case for finding a block-level deduplicating tool.

dsommers · 2016-04-02T20:49:34Z

It is important to be able to restore sparse files correctly. I've experienced backup tools not doing that causing havoc on restore. I don't recall all the details now, but I believe it was related to a backup of /var/log, where /var/log/lastlog was a sparse file of 265GB. Of course lastlog don't use all that space in the vast majority of cases. But the restore of /var/log on a 80GB partition caused a nice explosion, as the restore insisted on restoring every single 0 in that file.

fd0 · 2016-05-03T19:27:36Z

Hint from @cfcs https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt

abourget · 2018-08-04T04:34:47Z

Does restic support sparse files today ?

fd0 · 2018-08-05T10:28:05Z

No, it does not.

rbott · 2019-11-25T15:05:03Z

Just an additional information:
Some services (e.g. Apache Kafka) use sparse files within their storage engines. If you do regular restores for testing or as part of a workflow, restoring all sparse files to their full size makes that rather complicated or impossible (see @dsommers comment).

Core dump files are also sparse files (although I am not sure if anyone wants to backup/restore these).

Being able to restore a sparse file exactly as backed up seems mandatory from our point of view.

andy108369 · 2022-09-29T20:58:45Z

Looks like #3854 PR adds a support for restore operation only.
I think this issue should be still open as it is related to the backup (archive) part.

Ref. #3914

fd0 added the feature label Jan 8, 2016

fd0 mentioned this issue May 18, 2017

backup size should not include excluded dirs #958

Closed

sduchesneau mentioned this issue Aug 26, 2018

Feature/cobrify eoscanada/pitreos#1

Merged

d-b mentioned this issue Aug 20, 2019

Add support for sparse file restoration #2378

Closed

7 tasks

rawtaz added category: backup type: feature suggestion suggesting a new feature and removed feature labels Nov 21, 2019

jpfe-tid mentioned this issue Feb 26, 2020

Restore of velero+restic backup error "no space left on device" vmware-tanzu/velero#2298

Closed

greatroar mentioned this issue Feb 26, 2020

Write sparse files in restorer #2601

Closed

7 tasks

This comment has been minimized.

Sign in to view

MichaelEischer mentioned this issue Aug 7, 2022

restore: Add support for sparse files #3854

Merged

8 tasks

MichaelEischer mentioned this issue Aug 29, 2022

restic to GCP hangs on large files #3903

Closed

MichaelEischer closed this as completed in #3854 Sep 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archive and restore sparse files #79

archive and restore sparse files #79

fd0 commented Jan 26, 2015

jannic commented Apr 10, 2015

fd0 commented Apr 10, 2015

andrewchambers commented Apr 10, 2015

jannic commented Apr 11, 2015

andrewchambers commented Apr 11, 2015

cfcs commented Jun 26, 2015

pdf commented Sep 5, 2015

yatesco commented Jan 8, 2016

dsommers commented Apr 2, 2016

fd0 commented May 3, 2016

abourget commented Aug 4, 2018

fd0 commented Aug 5, 2018

rbott commented Nov 25, 2019

This comment has been minimized.

This comment has been minimized.

andy108369 commented Sep 29, 2022

archive and restore sparse files #79

archive and restore sparse files #79

Comments

fd0 commented Jan 26, 2015

jannic commented Apr 10, 2015

fd0 commented Apr 10, 2015

andrewchambers commented Apr 10, 2015

jannic commented Apr 11, 2015

andrewchambers commented Apr 11, 2015

cfcs commented Jun 26, 2015

pdf commented Sep 5, 2015

yatesco commented Jan 8, 2016

dsommers commented Apr 2, 2016

fd0 commented May 3, 2016

abourget commented Aug 4, 2018

fd0 commented Aug 5, 2018

rbott commented Nov 25, 2019

This comment has been minimized.

This comment has been minimized.

andy108369 commented Sep 29, 2022