Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive and restore sparse files #79

Closed
fd0 opened this issue Jan 26, 2015 · 16 comments · Fixed by #3854
Closed

archive and restore sparse files #79

fd0 opened this issue Jan 26, 2015 · 16 comments · Fixed by #3854
Labels

Comments

@fd0
Copy link
Member

fd0 commented Jan 26, 2015

description from the gnu tar manual: https://www.gnu.org/software/tar/manual/html_node/sparse.html

@jannic
Copy link
Contributor

jannic commented Apr 10, 2015

At least on linux, the description in the tar manual is outdated, the part "in order to determine if the file is sparse tar has to read it before trying to archive it" isn't true any more. The lseek() option SEEK_HOLE allows to find holes without reading the whole file.

Using this from golang would involve some low-level non-portable code, but I guess for files with very large holes it could be an important optimization.

@fd0
Copy link
Member Author

fd0 commented Apr 10, 2015

Thanks for the hint. I have some more optimizations in mind that may only work on Linux. In general, I think that's okay to implement, as long as the other platforms are not degraded in functionality.

@andrewchambers
Copy link

Long runs of zeros will be compressed perfectly anyway.

@jannic
Copy link
Contributor

jannic commented Apr 11, 2015

Correct, detecting holes doesn't help reducing the backup size. It can still improve performance, as reading large amounts of zeroes and then compressing them takes longer than just skipping over them.
But what's more important: When restoring files which had large holes, it's very useful to actually recreate those holes instead of writing zeroes. It obviously saves space, and in some cases, restoring the backups without recreating holes might even lead to disk-full errors.

@andrewchambers
Copy link

There is a technical tradeoff here, complicate the code for something that won't be important 99 percent of the time? I also think actually restoring backups is far more rare than storing them. I don't mind either way, I just want to point out that adding more code hurts reliability as a whole.

@cfcs
Copy link

cfcs commented Jun 26, 2015

We had some relevant discussion here: #117
(TL;DR: fallocate() is superior to posix_fallocate() - if fallocate() returns EOPNOTSUPP for a file, that result should be cached and regular zeroes should be written instead of attempting additional fallocate() calls)

Documentation for glibc's lseek() - used when detecting sparse ranges

I think the main benefit from supporting sparse files would be in the restore scenario, so I'm going to be dealing with that use-case in this comment. Detecting holes upon backup saves some syscalls, while in the restore scenario we're saving IO, which is usually much slower.

Supporting sparse files would allow us to save space even if the original file was not sparse, or if the filesystem did not support sparse ranges. I suspect there are many cases where files are not correctly made sparse, so this could potentially yield some surprising results upon restoration of a snapshot.
I feel that implementation would be relatively straightforward and that this feature is definitely worth the (relatively low) complexity added.

Documentation for glibc's fallocate() - used when making sparse ranges

Here's an example call to fallocate() on Linux:

#include <fcntl.h>
int result;
result = fallocate( fd, FALLOC_FL_PUNCH_HOLE, current_offset, length_of_zeroes );
switch( result ){
    case 0: break; /* success */
    case ENOSYS: break; /* stop trying to use fallocate() for this kernel / program invocation */
    case EOPNOSUPP: break; /* stop trying to use fallocate() for this file/filesystem */
    default: break; }

Making sparse files only makes sense for kernels / filesystems that support them.
The alternative is to simply write the zeroes as usual. If in the future support for restoring a file to stdout would be implemented, the fallocate() call would fail with an EOPNOSUPP error, consistent with what you'd like. The price of detection is one failed syscall per file (or per filesystem, really) -- I personally think that's worth it.

Don't know how relevant this is, but for Windows: StackOverflow discussion on sparse files in NTFS and MSDN documentation on sparse files in NTFS

Side-note: It would be interesting to investigate how many consecutive null blocks are required before a noticeable performance boost is achieved.

@pdf
Copy link

pdf commented Sep 5, 2015

@andrewchambers file-backed VMs images and similar are quite common.

@yatesco
Copy link

yatesco commented Jan 8, 2016

+1 backing up VM backups/images is my primary use-case for finding a block-level deduplicating tool.

@fd0 fd0 added the feature label Jan 8, 2016
@dsommers
Copy link

dsommers commented Apr 2, 2016

It is important to be able to restore sparse files correctly. I've experienced backup tools not doing that causing havoc on restore. I don't recall all the details now, but I believe it was related to a backup of /var/log, where /var/log/lastlog was a sparse file of 265GB. Of course lastlog don't use all that space in the vast majority of cases. But the restore of /var/log on a 80GB partition caused a nice explosion, as the restore insisted on restoring every single 0 in that file.

@fd0
Copy link
Member Author

fd0 commented May 3, 2016

@abourget
Copy link

abourget commented Aug 4, 2018

Does restic support sparse files today ?

@fd0
Copy link
Member Author

fd0 commented Aug 5, 2018

No, it does not.

@rbott
Copy link

rbott commented Nov 25, 2019

Just an additional information:
Some services (e.g. Apache Kafka) use sparse files within their storage engines. If you do regular restores for testing or as part of a workflow, restoring all sparse files to their full size makes that rather complicated or impossible (see @dsommers comment).

Core dump files are also sparse files (although I am not sure if anyone wants to backup/restore these).

Being able to restore a sparse file exactly as backed up seems mandatory from our point of view.

@andrewchen5678

This comment has been minimized.

@rawtaz

This comment has been minimized.

@andy108369
Copy link

Looks like #3854 PR adds a support for restore operation only.
I think this issue should be still open as it is related to the backup (archive) part.

Ref. #3914

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet