This is a small project to implement Philip Gage's Byte Pair Encoding compression, written in contemporaneous C89 for fun. It is based directly on Gage's pseudo-code. I started off looking at his code, but then decided early on to rewrite without looking at it. So it definitely can't be considered original.
I added error-checking to the expand program, and the compress program should be able to handle arbitrary binary input. A voluminous amount of debug logging is generated with a debug print macro if DEBUG
is defined. A few sample files, text and binary, are provided for testing.
This is the first time I've used fuzzing, and AFL++ was extremely useful for finding corner cases, in particular "circular byte expansion". I'm convinced it's pretty robust now, as I don't dynamically allocate memory anywhere.
I picked BPE because it appealed to me as dead-simple compression algorithm. It is in theory - replace two bytes with an unused one - but also fundamentally limited in the given algorithm by the byte as a unit, specifically the number of unused byte values available. If a block happens to start with many different byte values, then the block will be cut short with little compression.
The pair table RLE is a little fiddly (I didn't implement it exactly) and limited in size, while a Huffman code is very elegant and can be as large as needed. Encoding can't be fast either, because my program takes dozens of passes over each variable-length block, while Huffman coding can be done in one or two passes over each fixed-length block. At least I/O is buffered so I can use getc
/putc
byte-by-byte without being horrendously inefficient.