Dynamic succinct/compressed data structures
C++ CMake
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
datasets
include
.gitignore
CMakeLists.txt
LICENSE
README.md
benchmark.cpp
cw-bwt.cpp
debug-alan.cpp
debug.cpp
h0_lz77.cpp
rle_bwt.cpp
rle_lz77_v1.cpp
rle_lz77_v2.cpp

README.md

DYNAMIC: a succinct and compressed dynamic data structures library

Author: Nicola Prezza mail: nicola.prezza@gmail.com

Please cite this library as:

@inproceedings{prezza2017framework,
title={A Framework of Dynamic Data Structures for String Processing},
author={Prezza, Nicola},
booktitle={International Symposium on Experimental Algorithms},
year={2017},
organization={Leibniz International Proceedings in Informatics (LIPIcs)}
}

Data structures

This library offers space- and time-efficient implementations of some basic succinct/compressed dynamic data structures. Note that at the moment the library does not feature delete operations (only inserts). DYNAMIC features:

  • A succinct Searchable Partial Sums with Inserts (SPSI) structure representing a list of integers s_1, s_2, ..., s_m. Space: about 1.2 * m * ( log(M/m) + log log m ) bits, where M = m + s_1 + s_2 + ... + s_m. The structure supports also update operations (i.e. s_i = s_i + delta).
  • A Succinct dynamic bitvector supporting rank/select/access/insert (RSAI) operations. Space: about 1.2 * n bits.
  • A gap-compressed dynamic bitvector supporting RSAI operations. Space: about 1.2 * b * ( log(n/b) + log log b ) bits, b being the number of bits set and n being the bitvector length. All operations take log(b) time.
  • A dynamic sparse vector (of integers).
  • A dynamic string supporting RSAI operations. The user can choose at construction time between fixed-length/gamma/Huffman encoding of the alphabet. All operations take log(n) * log(sigma) time (or log(n) * H0 with Huffman encoding).
  • A run-length encoded dynamic string supporting RSAI operations. Space: approximately R*(1.2 * log(sigma) + 2.4 * (log(n/R)+log log R) ) bits, where R is the number of runs in the string. All operations take log(R) time.
  • A dynamic (left-extend only) entropy/run-length compressed BWT
  • A dynamic (left-extend only) entropy/run-length compressed FM-index. This structure consists in the above BWT + a dynamic suffix array sampling

Algorithms

  • Two algorithms to build LZ77 in repetition-aware RAM working space. Both algorithms use a run-length encoded BWT with sparse Suffix array sampling. The first algorithm stores 2 SA samples per BWT run. The second algorithm (much more space efficient) stores 1 SA sample per LZ factor. From the papers "Computing LZ77 in Run-Compressed Space", Alberto Policriti and Nicola Prezza, DCC2016 and " LZ77 Computation Based on the Run-Length Encoded BWT", Alberto Policriti and Nicola Prezza (accepted for publication in Algorithmica)
  • An algorithm to build the BWT in run-compressed space
  • An algorithm to build LZ77 in nH0(2+o(1)) space and n * log n * H0 time. From the paper "Fast Online Lempel-Ziv Factorization in Compressed Space", Alberto Policriti and Nicola Prezza, SPIRE2015
  • An algorithm to build the BWT in high-order compressed space. The algorithm runs in O(n * H_k * log log n) average-case time (e.g. good for DNA) and O(n * H_k * log n) worst-case time. From the paper "Average linear time and compressed space construction of the Burrows-Wheeler transform" Policriti A., Gigante N. and Prezza N., LATA 2015

The SPSI structure is the building block on which all other structures are based. This structure is implemented with cache-efficient B-trees.

TODO:

  • Implement delete operations
  • Implement a good memory allocator. At the moment the default allocator is used, which results in about 25% of memory being wasted due to fragmentation
  • Geometric data structures (predecessor/2D range search)
  • Batch inserts

Download

git clone https://github.com/nicolaprezza/dynamic

Compile

Thre library features some example executables. To compile them, firstly create and enter a bin/ directory

mkdir bin; cd bin

Then, launch cmake as (default build type is release):

cmake ..

Finally, build the executables:

make

The above command creates the executables in the bin directory.

Usage

The header include/dynamic.hpp contains all type definitions and is all you need to include in your code. The folder algorithms/ contains some algorithms implemented with the library's structures. This is a snapshot of dynamic.hpp:

/*
 * a succinct searchable partial sum with inserts implemented with cache-efficient
 * B trees.
 */
typedef spsi<packed_vector,256,16> packed_spsi;

/*
 * dynamic gap-encoded bitvector
 */
typedef gap_bitvector<packed_spsi> gap_bv;

/*
 * dynamic succinct bitvector (about 1.2n bits)
 */
typedef succinct_bitvector<spsi<packed_vector,8192,16> > suc_bv;

/*
 * succinct/compressed dynamic string implemented with wavelet trees.
 * user can choose (at construction time) between fixed-length / gamma / Huffman encoding of characters.
 */
typedef wt_string<suc_bv> wt_str;

/*
 * run-length encoded (RLE) string. This string uses 1 sparse bitvector
 * for all runs, one dynamic string for run heads, and sigma sparse bitvectors (one per character)
 */
typedef rle_string<gap_bv, wt_str> rle_str;

/*
 * RLE string implemented with a run-length encoded wavelet tree. Each
 * WT node is run-length encoded. 
 */
typedef wt_string<rle_str> wtrle_str;

/*
 * succinct/compressed BWT (see description of wt_str)
 */
typedef bwt<wt_str,rle_str> wt_bwt;

/*
 * run-length encoded BWT
 */
typedef bwt<rle_str,rle_str> rle_bwt;

/*
 * dynamic sparse vector: <= m*k + O(m log n/m) bits of space, where k is the maximum
 * number of bits of any integer > 0 and n is the total number of integers.
 */
typedef sparse_vector<packed_spsi,gap_bv> sparse_vec;

/*
 * dynamic succinct/entropy compressed FM index. BWT positions are
 * marked with a succinct bitvector
 *
 * roughly n*H0 + n + (n/k)*log n bits of space, where k is the SA sample  rate
 *
 */
typedef fm_index<wt_bwt, suc_bv, packed_spsi> wt_fmi;

/*
 * dynamic run-length encoded FM index. BWT positions are
 * marked with a gap-encoded bitvector.
 *
 * roughly 2.4*R*log(n/R) + 2.4 log log R + 1.2*R*log(sigma) + (n/k)*log n bits of  space, where
 * k is the SA sample rate and R is the number of runs in the BWT
 *
 */
typedef fm_index<rle_bwt, gap_bv, packed_spsi> rle_fmi;