ZCode is a custom compression algorithm I originally developed for a competition held for the Spring 2019 Datastructures and Algorithms course of Dr. Mahdi Safarnejad-Boroujeni at Sharif University of Technology, at which I became first-place. The code is pretty slow and has a lot of room for optimization, but it is pretty readable. It can be an excellent educational resource for whoever is starting on compression algorithms.
The algorithm is a cocktail of classical compression algorithms mixed and served for Unicode documents. It hinges around
the LZW algorithm to create a finite size symbol dictionary; the results are then byte-coded into variable-length custom
symbols, which I call zee
codes! Finally, the symbol table is truncated accordingly, and the compressed document is
encoded into a byte stream.
Huffman trees highly inspire zee
codes, but because in normal texts, symbols are usually much more uniformly distributed
than the original geometrical (or exponential) distribution assumption for effective Huffman coding, the gains of using
variable-sized byte-codes both from an implementation and performance perspective outweighed bit Huffman encodings.
Results may vary, but my tests showed a steady ~4-5x compression ratio on Farsi texts, which is pretty nice!
ZCode is available on pip, and only requires a 3.6 or higher python installation beforehand.
pip install -U zcode
You can run the algorithm for any utf-8
encoded file using the zcode
command. It will automatically decompress files
ending with a .zee
extensions and compress others into .zee
files, but you can always override the default behavior
by providing optional arguments like:
zcode INPUTFILE [--output OUTPUT_FILE --action compress/decompress --symbol-size SYMBOL_SIZE --code-size CODE_SIZE]
The symbol-size
argument controls the algorithms' buffer size for processing symbols (in bytes). It is automatically
set depending on your input file size but you can change it as you wish. code-size
controls the maximum length for
coded bytes while encoding symbols (this equals to 2 by default and needs to be provided to the algorithm upon
decompression).
MIT LICENSE, see vahidzee/zcode/LICENSE