New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive and restore sparse files #79

Open
fd0 opened this Issue Jan 26, 2015 · 12 comments

Comments

Projects
None yet
8 participants
@fd0
Copy link
Member

fd0 commented Jan 26, 2015

description from the gnu tar manual: https://www.gnu.org/software/tar/manual/html_node/sparse.html

@jannic

This comment has been minimized.

Copy link
Contributor

jannic commented Apr 10, 2015

At least on linux, the description in the tar manual is outdated, the part "in order to determine if the file is sparse tar has to read it before trying to archive it" isn't true any more. The lseek() option SEEK_HOLE allows to find holes without reading the whole file.

Using this from golang would involve some low-level non-portable code, but I guess for files with very large holes it could be an important optimization.

@fd0

This comment has been minimized.

Copy link
Member

fd0 commented Apr 10, 2015

Thanks for the hint. I have some more optimizations in mind that may only work on Linux. In general, I think that's okay to implement, as long as the other platforms are not degraded in functionality.

@andrewchambers

This comment has been minimized.

Copy link

andrewchambers commented Apr 10, 2015

Long runs of zeros will be compressed perfectly anyway.

@jannic

This comment has been minimized.

Copy link
Contributor

jannic commented Apr 11, 2015

Correct, detecting holes doesn't help reducing the backup size. It can still improve performance, as reading large amounts of zeroes and then compressing them takes longer than just skipping over them.
But what's more important: When restoring files which had large holes, it's very useful to actually recreate those holes instead of writing zeroes. It obviously saves space, and in some cases, restoring the backups without recreating holes might even lead to disk-full errors.

@andrewchambers

This comment has been minimized.

Copy link

andrewchambers commented Apr 11, 2015

There is a technical tradeoff here, complicate the code for something that won't be important 99 percent of the time? I also think actually restoring backups is far more rare than storing them. I don't mind either way, I just want to point out that adding more code hurts reliability as a whole.

@cfcs

This comment has been minimized.

Copy link

cfcs commented Jun 26, 2015

We had some relevant discussion here: #117
(TL;DR: fallocate() is superior to posix_fallocate() - if fallocate() returns EOPNOTSUPP for a file, that result should be cached and regular zeroes should be written instead of attempting additional fallocate() calls)

Documentation for glibc's lseek() - used when detecting sparse ranges

I think the main benefit from supporting sparse files would be in the restore scenario, so I'm going to be dealing with that use-case in this comment. Detecting holes upon backup saves some syscalls, while in the restore scenario we're saving IO, which is usually much slower.

Supporting sparse files would allow us to save space even if the original file was not sparse, or if the filesystem did not support sparse ranges. I suspect there are many cases where files are not correctly made sparse, so this could potentially yield some surprising results upon restoration of a snapshot.
I feel that implementation would be relatively straightforward and that this feature is definitely worth the (relatively low) complexity added.

Documentation for glibc's fallocate() - used when making sparse ranges

Here's an example call to fallocate() on Linux:

#include <fcntl.h>
int result;
result = fallocate( fd, FALLOC_FL_PUNCH_HOLE, current_offset, length_of_zeroes );
switch( result ){
    case 0: break; /* success */
    case ENOSYS: break; /* stop trying to use fallocate() for this kernel / program invocation */
    case EOPNOSUPP: break; /* stop trying to use fallocate() for this file/filesystem */
    default: break; }

Making sparse files only makes sense for kernels / filesystems that support them.
The alternative is to simply write the zeroes as usual. If in the future support for restoring a file to stdout would be implemented, the fallocate() call would fail with an EOPNOSUPP error, consistent with what you'd like. The price of detection is one failed syscall per file (or per filesystem, really) -- I personally think that's worth it.

Don't know how relevant this is, but for Windows: StackOverflow discussion on sparse files in NTFS and MSDN documentation on sparse files in NTFS

Side-note: It would be interesting to investigate how many consecutive null blocks are required before a noticeable performance boost is achieved.

@pdf

This comment has been minimized.

Copy link

pdf commented Sep 5, 2015

@andrewchambers file-backed VMs images and similar are quite common.

@yatesco

This comment has been minimized.

Copy link

yatesco commented Jan 8, 2016

+1 backing up VM backups/images is my primary use-case for finding a block-level deduplicating tool.

@fd0 fd0 added the feature label Jan 8, 2016

@dsommers

This comment has been minimized.

Copy link

dsommers commented Apr 2, 2016

It is important to be able to restore sparse files correctly. I've experienced backup tools not doing that causing havoc on restore. I don't recall all the details now, but I believe it was related to a backup of /var/log, where /var/log/lastlog was a sparse file of 265GB. Of course lastlog don't use all that space in the vast majority of cases. But the restore of /var/log on a 80GB partition caused a nice explosion, as the restore insisted on restoring every single 0 in that file.

@fd0

This comment has been minimized.

@abourget

This comment has been minimized.

Copy link

abourget commented Aug 4, 2018

Does restic support sparse files today ?

@fd0

This comment has been minimized.

Copy link
Member

fd0 commented Aug 5, 2018

No, it does not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment