pyDAC (python Directly Addressable Codes) offers a variable-length encoding scheme for (unsigned) integers with random access to any element of the encoded sequence.
In terms of compression, a DAC structure is very likely to outperform standard base-128 compression schemes aka VByte, Varint, VInt, EncInt etc..
As a bonus, a DAC structure gives to random access to each and every sequence element without any decoding.
Install from PyPi using
pip install pyDACfrom pyDAC import DACimports the module.
import random
from pyDAC import DAC
values = random.sample(range(2**32), 10**7)
encoded_values = DAC(iter(values))creates a DAC structure encoded_values for the values sequence.
The ith element from the original values sequence can be retrieved from a DAC structure encoded_values using the subscript operator
for i in range(len(values)):
assert values[i] == encoded_values[i]A DAC structure encoded_values is also iterable.
You can easily loop through the stored elements stored
dac_iter = iter(encoded_values)
while True:
try:
val = next(dac_iter)
except StopIteration:
break # Iterator exhausted: stop the loop
else:
print(val)or return all stored elements at once
assert values == list(iter(encoded_values))A DAC structure can provide compression ratios and space_savings in comparision to the minimal fixed width representation and to the variable byte representation of the original values sequence.
For example,
values = [1, 2, 1, 8, 3, 4, 5, 9, 13, 1024, 262189]
encoded_values = DAC(iter(values))
print(encoded_values.space_savings)
>>> {'vbyte': 0.08214285714285718, 'fixed_width': 0.508133971291866}
print(encoded_values.compression_ratios)
>>> {'vbyte': 1.0894941634241246, 'fixed_width': 2.0330739299610894}@article{
title = {{Algorithms and Compressed Data Structures for Information Retrieval}},
author = {Ladra, Susana},
type = {Phd Thesis},
institution = {Universidade da Coru{\~{n}}a},
pages = {272},
year = {2011},
isbn = {5626895531}
}@inproceedings{
title = {{Directly addressable variable-length codes}},
author = {Brisaboa, Nieves R. and Ladra, Susana and Navarro, Gonzalo},
booktitle = {Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)},
volume = {5721 LNCS},
doi = {10.1007/978-3-642-03784-9_12},
isbn = {3642037836},
issn = {03029743},
pages = {122--130},
publisher = {Springer, Berlin, Heidelberg},
year = {2009}
}
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
