Set Encoding Playground

This is a repository for testing various ways to compress sets of integers, particularly Unicode code point sets.

The project comes with a CLI for benchmarking compressors against both real-world corpora and synthetic datasets that vary density and clustering characteristics.

Full disclosure: a lot of the supporting code (dataset parsing, synthetic data generation, parts of the CLI) was AI-written. The encoding strategies themselves were handwritten. None of this code is intended to be production-ready; it's structured around being able to easily compare various encoding strategies instead of things like speed or memory usage.

Corpora

I've included both real and synthetic corpora of data to test on. The real ones behave very differently from the synthetic ones! There's a full list in the CLI, but here are some of them:

nam-files main sets: Taken from the nam-files repository. These are the character subsets that Google Fonts' font files are split up by. Most of them are quite compact, except for CJK.
nam-files "slices": Google Fonts' CJK fonts are now split up even further. These "slices" are automatically generated from Google's corpus of webpages and optimized so that as few of them need to be downloaded as necessary. They aren't well-behaved--they contain sparse-ish parts where there are gaps of a couple dozen characters at a time, dense sequences of sequential characters, and large gaps between groups.
The Unicode Character Database: a bunch of Unicode properties. A lot of these are numeric properties or enumerations; I've omitted the numeric ones and split the enumerations up by enum value.
Google Fonts language membership: Lists of all the languages that every Google font is a member of (is this a good time to mention that neither I nor this project are affiliated with Google?) This was taken from Glypht. I have control over which numeric ID each language maps to, so for better compressibility, I sorted the languages in descending order by font support. This means that more frequently-present numbers will be lower, and rarer ones will be higher.
A couple different text files pre-transformed into integer sets using the Burrows-Wheeler transform and move-to-front transform. I got this idea from Binary Interpolative Coding Revisited.
Uniformly-random integer sets (each integer has a random chance of being present or absent) with varying density.
Autocorrelated integer sets: the presence of each integer might depend on the previous one or might be random, depending on the autocorrelation intensity. There are views where the autocorrelation is uniform and the density varies, and views where the density is uniform and the autocorrelation varies.
Geometrically-distributed run lengths; that is, there are runs of zeroes and ones, and their length is determined by a geometric distribution parameterized by average run length. Once again, you can slice it by density or by run length.

Encoding strategies

There are a variety of encoding strategies (and combinations of encoding strategies). Some of them are:

Methods that use prefix-coded integers (a few different prefix codes have been implemented):
- Gap encoding: output the gaps between successive integers. Good for sparse data, bad for long runs of consecutive present integers.
- Run-length encoding: output the lengths of runs of present and absent integers. Good for encoding pure ranges; suffers on sparser or mixed data. This is the best performer on the UCD dataset.
- "Flip coding": a hybrid between gap and run-length encoding. It behaves like gap coding, but when two successive integers are present, we switch to encoding the gaps between absences. After two successive gaps between absences, we switch back, and so on. This is the current frontrunner for the nam-files and language membership real-world corpora.
Elias-Fano encoding: the integers' lower bits are packed together, and the upper bits are stored as unary-encoded deltas. I'm not sure if I implemented this incorrectly, but it performs quite poorly. It's the best encoding for very sparse random data, but skyrockets above ~20% density. I think part of the problem is that the encoding can represent the same number multiple times sequentially; we don't ever do that, so it's just overhead.
Binary interpolative coding: we can encode the middle element, then recursively encode the elements below and above it with fewer bits since we know their ranges are restricted. This is competitive, but can't take advantage of long run lengths very well. It's the best performer on the nam-files "slices" dataset.
"Triple partitioned" / three data structures in a trenchcoat: roughly partition the data by density, and switch between gap/range encoding and raw bitsets. The output bitstream stores gaps between all individual values, then all ranges, then all bitsets. This feels quite kludgey, but is annoyingly good on many real-world datasets, especially since the integer encoding is LEB128 instead of some fancy prefix code.
A regular bitset compressed with zlib. This seems to approach the information-theoretical lower bound on synthetic corpora, but performs worse on real-world data. It's included as a reminder that synthetic benchmarks can be deceiving.

Running it

You can get a full list of CLI options via:

cargo run --release --bin compare -- --help

By default, it will generate results based on the nam-files corpus and a set of some good-performing strategies.

Besides the table printed to stdout, the run also produces:

/tmp/out.csv – raw numeric results for spreadsheets or plotting (consumed by plot_grid.py).
/tmp/out.txt – encoded payloads per strategy and corpus entry for deep dives.

Generating charts

To output pretty charts, you can use the chart binary, which outputs charts into an HTML file (with D3.js). Those charts are then extracted using headless Chrome.

For a full list of options, see:

cargo run --release --bin chart -- --help

Visualising grids

For density×autocorrelation or density×run-length sweeps, you can turn the CSV into heatmaps and 3D surfaces:

cargo run --release --bin compare -- --corpus synthetic-grid
python3 plot_grid.py

The script will generate PNGs under /tmp/ (one per strategy plus aggregate views).

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
corpus		corpus
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
plot_grid.py		plot_grid.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Set Encoding Playground

Corpora

Encoding strategies

Running it

Generating charts

Visualising grids

About

Uh oh!

Releases

Packages

Languages

valadaptive/set-encoding-playground

Folders and files

Latest commit

History

Repository files navigation

Set Encoding Playground

Corpora

Encoding strategies

Running it

Generating charts

Visualising grids

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages