Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

store less bytes thanks to backreferences

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 test add -m MB option, to limit memory consumption. February 19, 2012
Octocat-spinner-32 AUTHORS
Octocat-spinner-32 COPYING
Octocat-spinner-32 ChangeLog
Octocat-spinner-32 FAQ
Octocat-spinner-32 INSTALL
Octocat-spinner-32 Makefile.am
Octocat-spinner-32 Makefile.linux
Octocat-spinner-32 NEWS
Octocat-spinner-32 README
Octocat-spinner-32 SPEC
Octocat-spinner-32 autogen.sh
Octocat-spinner-32 configure.ac
Octocat-spinner-32 undup.c
README
undup - compress files by consolidating duplicate data

undup tries to compress an input stream by watching for blocks that have
previously appeared.  It replaces the duplicated data with a backreference.
Integrity is ensured by validating a SHA256 across the entire stream at
reconstruction time.

undup is intended to be pipelined with a general-purpose compressor such as
gzip, bzip2, or xz.

USAGE
-----

tar cf - dir | undup | xz > dir.tar.undup.xz
xzcat dir.tar.undup.xz | undup -d -o dir.tar; tar xf dir.tar

SAMPLE RESULTS
--------------

% for r in 3.0 3.1 3.2 3.3-rc1; do
    git archive --format=tar --prefix=linux-$r/ v$r | tar -C /tmp/linuxes -xf -
done
% tar -C /tmp -cf linuxes.tar linuxes
% du -shc /tmp/linuxes/*
500M    /tmp/linuxes/linux-3.0
504M    /tmp/linuxes/linux-3.1
511M    /tmp/linuxes/linux-3.2
518M    /tmp/linuxes/linux-3.3-rc1
2.0G    total

File sizes:

1833635840   linuxes.tar
 937173504   linuxes.tar.undp
 404399664   linuxes.tar.gz
 316914845   linuxes.tar.bz2
 270460412   linuxes.tar.xz
 203023371   linuxes.tar.undp.gz
 167099750   linuxes.tar.lrz
 159673153   linuxes.tar.undp.bz2
 138929420   linuxes.tar.undp.xz


format   ratio    pipelined w/ undup
------   -----    ------------------
undp      1.95
gzip      4.53       9.03
bzip2     5.78      11.48
xz        6.78      13.19
lrzip    10.97

Timings for undup + compressors on Core i7 L 640 @ 2.13GHz (2.9 GHz Turbo)

First, we time the undup phase.  This consumes a significant amount
of memory (for undup 0.2, about 105 MB of RAM to store hashes for the
1.8 GB linuxes.tar) and can be pipelined, but to get the most
reproducible timing results, we've run each phase separately.

undup linuxes.tar 47.26s user 4.15s system 97% cpu 52.885 total

Second, we compare times for various compressors to compress
linuxes.tar.undp.

gzip   35.81s user 0.72s system 96% cpu 37.817 total
bzip2 117.79s user 0.45s system 99% cpu 1:58.66 total
xz    606.51s user 1.31s system 99% cpu 10:09.72 total

undup + bzip2 achieves an 11.48x compression ratio while consuming only 
165 seconds of CPU time; elapsed time for a pipeline is reasonably similar:

undup 59.64s user 3.93s system 32% cpu 3:14.76 total
bzip2 138.65s user 1.05s system 71% cpu 3:14.73 total

This compares favorably to lrzip 0.608, which achieves a 10.97x ratio after
consuming 913 seconds of CPU time (lrzip is multithreaded by default):

lrzip -v -w 10 linuxes.tar 913.08s user 14.99s system 298% cpu 5:10.78 total
Something went wrong with that request. Please try again.