Skip to content
Python Module for Huffman Encoding and Decoding
Python Jupyter Notebook
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
dahuffman minor dict comprehension code style tweak Oct 11, 2019
tests fixup! Issue #5: added pre-trained codec for XML Jul 14, 2019
train Issue #5: added pre-trained codec for XML Jul 14, 2019
.deepsource.toml Trying out DeepSource (add .deepsource.toml) Oct 11, 2019
.gitignore Added Jupyter notebook with some more examples and basic comparison o… Jul 14, 2019
.travis.yml Issue #6 Drop support for Python 2 Jul 9, 2019
LICENSE.txt Initial commit of dahuffman module for huffman encoding and decoding Sep 30, 2017
MANIFEST.in
README.rst Issue #1 and #5: add functionality to save/load code table to pickle … Jul 13, 2019
examples.ipynb Added Jupyter notebook with some more examples and basic comparison o… Jul 14, 2019
setup.cfg
setup.py Issue #1 and #5: add functionality to save/load code table to pickle … Jul 13, 2019

README.rst

dahuffman - Python Module for Huffman Encoding and Decoding


dahuffman is a pure Python module for Huffman encoding and decoding, commonly used for lossless data compression.

The name of the module refers to the full name of the inventor of the Huffman code tree algorithm: David Albert Huffman (August 9, 1925 – October 7, 1999).

Features and design

  • Pure Python implementation, only using standard library.
  • Leverages iterators and generators internally, allows to be used in streaming fashion.
  • Not limited to byte/unicode string input, can handle other "symbols" or tokens, for example chess moves or sequences of categorical data, as long as these symbols can be used as keys in dictionaries (meaning they should be hashable).
  • Properly handle end of encoded bit stream if it does not align with byte boundaries
  • For Python 3.5 and up

Installation

pip install dahuffman

Usage

Basic usage example, where the code table is built based on given symbol frequencies:

>>> from dahuffman import HuffmanCodec
>>> codec = HuffmanCodec.from_frequencies({'e': 100, 'n':20, 'x':1, 'i': 40, 'q':3})
>>> encoded = codec.encode('exeneeeexniqneieini')
>>> encoded
b'\x86|%\x13i@'
>>> len(encoded)
6
>>> codec.decode(encoded)
'exeneeeexniqneieini'
>>> codec.print_code_table()
bits  code       (value)  symbol
   5  00000      (    0)  _EOF
   1  1          (    1)  'e'
   2  01         (    1)  'i'
   3  001        (    1)  'n'
   4  0001       (    1)  'q'
   5  00001      (    1)  'x'

You can also "train" the codec by providing it data directly:

>>> codec = HuffmanCodec.from_data('hello world how are you doing today foo bar lorem ipsum')
>>> codec.encode('do lo er ad od')
b'^O\x1a\xc4S\xab\x80'
>>> len(_)
7

Using it with sequences of symbols (country codes in this example):

>>> countries = ['FR', 'UK', 'BE', 'IT', 'FR', 'IT', 'GR', 'FR', 'NL', 'BE', 'DE']
>>> codec = HuffmanCodec.from_data(countries)
>>> encoded = codec.encode(['FR', 'IT', 'BE', 'FR', 'UK'])
>>> encoded
b'L\xca'
>>> len(encoded)
2
>>> codec.decode(encoded)
['FR', 'IT', 'BE', 'FR', 'UK']

Doing it in a streaming fashion (generators):

>>> import random
>>> def sample(n, symbols):
...     for i in range(n):
...             if (n-i) % 5 == 1:
...                     print(i)
...             yield random.choice(symbols)
...
>>> codec = HuffmanCodec.from_data(countries)
>>> encoded = codec.encode_streaming(sample(16, countries))
>>> encoded
<generator object encode_streaming at 0x108bd82d0>
>>> decoded = codec.decode_streaming(encoded)
>>> decoded
<generator object decode_streaming at 0x108bd8370>
>>> list(decoded)
0
5
10
15
['DE', 'BE', 'FR', 'GR', 'UK', 'BE', 'UK', 'IT', 'UK', 'FR', 'DE', 'IT', 'NL', 'IT', 'FR', 'UK']

Pre-trained codecs

The dahuffman.codecs package contains a bunch of pre-trained code tables. The codecs can be loaded as follows:

>>> from dahuffman import load_shakespeare
>>> codec = load_shakespeare()
>>> codec.print_code_table()
Bits Code                     Value Symbol
   4 0000                         0 'n'
   4 0001                         1 's'
   4 0010                         2 'h'
   5 00110                        6 'u'
   7 0011100                     28 'k'
   9 001110100                  116 'Y'
  14 00111010100000            3744 '0'
...
>>> len(codec.encode('To be, or not to be; that is the question;'))
24
You can’t perform that action at this time.