Parallel, indexed xz compressor
C Shell
Pull request Compare This branch is 1 commit ahead, 118 commits behind vasi:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Pixz (pronounced 'pixie') is a parallel, indexing version of XZ.

The existing XZ Utils ( ) provide great compression in the .xz file format, but they have two significant problems:

* They are single-threaded, while most users nowadays have multi-core computers.
* The .xz files they produce are just one big block of compressed data, rather than a collection of smaller blocks. This makes random access to the original data impossible.

With pixz, both these problems are solved. The most useful commands:

$ pixz foo.tar foo.tpxz         # Compress and index a tarball, multi-core
$ pixz -l foo.tpxz              # Very quickly list the contents of the compressed tarball
$ pixz -d foo.tpxz foo.tar      # Decompress it, multi-core
$ pixz -x dir/file < foo.tpxz | tar x   # Very quickly extract a file, multi-core.
                                        # Also verifies that contents match index.

$ tar -Ipixz -cf foo.tpxz foo           # Create a tarball using pixz for multi-core compression

$ pixz bar bar.xz           # Compress a non-tarball, multi-core
$ pixz -d bar.xz bar        # Decompress it, multi-core

Specifying input and output:

$ pixz < foo.tar > foo.tpxz     # Same as 'pixz foo.tar foo.tpxz'
$ pixz -i foo.tar -o foo.tpxz   # Ditto. These both work for -x, -d and -l too, eg:

$ pixz -x -i foo.tpxz -o foo.tar file1 file2 ... # Extract the files from foo.tpxz into foo.tar

$ pixz foo.tar                  # Compress it to foo.tpxz, removing the original
$ pixz -d foo.tpxz              # Extract it to foo.tar, removing the original

Other flags:

$ pixz -1 foo.tar           # Faster, worse compression
$ pixz -9 foo.tar           # Better, slower compression 

$ pixz -t foo.tar           # Compress but don't treat it as a tarball (don't index it)
$ pixz -d -t foo.tpxz       # Decompress foo, don't check that contents match index
$ pixz -l -t foo.tpxz       # List the xz blocks instead of files

WARNING: Running pixz without the -t flag will cause it to treat the input as a tarball, as long as it looks vaguely tarball-like. This means if the file starts with at least 1024 zero bytes, pixz will assume it's empty, and truncate the output! If your input files aren't tarballs, run with -t or face possible data-loss.

Compare to:
        * About equally complex, efficient
        * lzip format seems less-used
        * Version 1 is theoretically indexable...I think
        * Python, much simpler
        * More flexible, supports arbitrary compression programs
        * Uses streams instead of blocks, not indexable
        * Splits input and then combines output, much higher disk usage 
        * Simpler code
        * Uses OpenMP instead of pthreads
        * Uses streams instead of blocks, not indexable
        * Uses temp files and doesn't combine them until the whole file is compressed, high disk/memory usage

Comparable tools for other compression algorithms:
        * Not indexable
        * Appears slow
        * bzip2 algorithm is non-ideal
        * Not indexable
        * Not parallel

    * libarchive 2.8 or later
    * liblzma 4.999.9-beta-212 or later (from the xz distribution)