Skip to content

jxu/bpe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is a small project to implement Philip Gage's Byte Pair Encoding compression, written in contemporaneous C89 for fun. It is based directly on Gage's pseudo-code. I started off looking at his code, but then decided early on to rewrite without looking at it. So it definitely can't be considered original.

I added error-checking to the expand program, and the compress program should be able to handle arbitrary binary input. A voluminous amount of debug logging is generated with a debug print macro if DEBUG is defined. A few sample files, text and binary, are provided for testing.

This is the first time I've used fuzzing, and AFL++ was extremely useful for finding corner cases, in particular "circular byte expansion". I'm convinced it's pretty robust now, as I don't dynamically allocate memory anywhere.

I picked BPE because it appealed to me as dead-simple compression algorithm. It is in theory - replace two bytes with an unused one - but also fundamentally limited in the given algorithm by the byte as a unit, specifically the number of unused byte values available. If a block happens to start with many different byte values, then the block will be cut short with little compression. The pair table RLE is a little fiddly (I didn't implement it exactly) and limited in size, while a Huffman code is very elegant and can be as large as needed. Encoding can't be fast either, because my program takes dozens of passes over each variable-length block, while Huffman coding can be done in one or two passes over each fixed-length block. At least I/O is buffered so I can use getc/putc byte-by-byte without being horrendously inefficient.

About

Byte Pair Encoding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages