Quickly pack wiki history dumps, or anything with many long repeated sections (think 100+ bytes) relatively long distances apart (think up to a few MB)
Go
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lrcompress
.gitignore
LICENSE
README.md
format.md
histzip.go

README.md

histzip

Hey! Thanks for your interest, but consider using something like brotli or zstd instead. Those modern compressors also compress quickly with long history windows (the problem histzip was written to solve) but are much better supported; you should be able to achieve better speed and compression ratios with them as well.


Compress wiki change histories, etc. Made for any input with long duplicated sections (100+ bytes) up to a few MB apart.

Download binaries for Linux amd64, Linux x86, Windows 64-bit and 32-bit, and Mac 64-bit. For faster compression try the gccgo Linux amd64 build.

Compress by piping text through histzip and bzip2 or similar:

./histzip < revisions.xml | bzip2 > revisions.xml.hbz

Turn that around to decompress:

bunzip2 < revisions.xml.hbz | ./histzip > revisions.xml

Running on dumps of English Wikipedia's history, that pipeline ran at 51 MB/s for the newest chunk and 151 MB/s for the oldest. Compression ratios were comparable to 7zip's: 8% worse for the new chunk and 10% better for the old chunk.

While compressing, histzip decompresses its output and compares checksums as a self-check. There are write-ups of the framing format and the format for compressed data. You can use the same compression engine in other programs via the histzip/lrcompress library.

If you're interested in long-range compression, some other projects might interest you. rzip is awesome; histzip lifts some implementation tricks directly from it. bm is a Bentley-McIlroy library by CloudFlare also written in Go, compressing matches against a fixed dictionary (in essence, performing a binary diff). Git, xdelta, and open-vcdiff also each have open-source binary diff implementations. Google's Brotli compressor is a tuned flate variant defaulting to a 4MB window, first used to compress WOFF 2.0 Web fonts.

If you have any trouble or questions, get in touch!

Public domain, Randall Farmer, 2013-4.