v0.8.0 — Deterministic Foundations
ex_data_sketch v0.8.0 — Deterministic Foundations
Release date: 2026-05-12
v0.8.0 transforms ex_data_sketch from a collection of probabilistic algorithms into a production-grade probabilistic runtime for the BEAM. Zero new sketch families. Five phases of substrate investment.
Highlights
Deterministic Hashing (Phase 1)
Every sketch now shares a documented, validated, byte-stable hash layer. Hash.XXH3, Hash.Murmur3, Hash.Metadata, and Hash.Validation are new. HLL.new/1, ULL.new/1, Theta.new/1, and CMS.new/1 now honor a user-supplied :hash_strategy (:xxhash3 | :murmur3 | :phash2). The :murmur3 option was silently overridden in v0.7.x — this is now a behavior change: code that passed hash_strategy: :murmur3 will produce different estimates than before (correct ones, using Murmur3).
Binary Stability & Corruption Detection (Phase 2)
EXSK serialization bumps from v1 to v2. Every serialize/1 output now carries a CRC32C checksum and an embedded Hash.Metadata block. deserialize/1 transparently reads both v1 and v2. A 200-mutation bit-flip fuzz suite verifies that no single-bit corruption silently propagates. v0.7.x readers cannot read v2 frames.
Hot-Path Performance (Phase 3)
8 new Rust NIFs (_raw_h_nif family) dispatch hashing by algorithm byte, extending in-Rust hashing to Murmur3. XXH3 throughput remains 25–34 M items/sec at p=14; Murmur3 is within 8%. Legacy _raw_nif family preserved for v0.7.x binary stability.
Precompiled NIFs (Phase 4)
Windows x86_64 and ARM64 MSVC targets added. The matrix is now 8 targets x 2 NIF versions = 16 artifacts per release. mix test.nif_on / mix test.nif_off aliases handle local NIF-mode flips.
Property-Based Validation (Phase 5)
14 new StreamData properties lock monotonicity (HLL/ULL), error bounds, rank consistency (KLL/REQ), overestimation-only (CMS), and no-false-negative guarantees (Bloom/XorFilter/Cuckoo). Binary v2 corruption never silently produces a valid sketch.
Post-Release Fixes (since tag)
- #240 — EXSK v2
Header.decode/1now rejects non-zero reservedflagswith a structuredDeserializationError, matching the documented v2 contract. Previously, frames withflags != 0were silently accepted. - #238 —
serialize/1on any Murmur3-configured sketch no longer crashes withCaseClauseError. The sketch-localhash_strategy_byte/1encoders across HLL, ULL, Theta, and CMS now map:murmur3to wire byte3. - #239 — Release docs and the checksum file are no longer git-tracked; plan docs are excluded from the Hex package.
- Credo lint: replaced chained
Enum.map |> Enum.mapwith single passes, replacedlist ++ [item]with prepend-and-reverse, replacedList.foldlwithEnum.reduce, replacedList.lastwith pattern matching, removed fully-qualified module references, removed restatement comments, documented previously@doc falsepublic helpers.
Wire Format
| Sketch (empty) | v1 size | v2 size | Overhead |
|---|---|---|---|
| HLL p=4 | 18 B | 50 B | +32 (2.8x) |
| HLL p=14 | 16,398 B | 16,430 B | +32 (0.2%) |
| KLL k=200 (populated) | ~3-5 KB | ~3-5 KB | +32 (~1%) |
For any sketch larger than ~1 KB, overhead is negligible.
Performance
| Path (HLL p=14) | Throughput |
|---|---|
| Pure phash2 | ~1.7 M items/sec |
| Pure xxhash3 | ~1.9 M items/sec |
| Rust raw XXH3 | ~30 M items/sec |
| Rust raw_h Murmur3 | ~28 M items/sec |
Test Suite
| Metric | v0.7.1 | v0.8.0 |
|---|---|---|
| Tests (NIF on) | 1,186 | 1,317 |
| Doctests | 169 | 202 |
| Properties (NIF on) | 152 | 171 |
| Line coverage | 88% | 92.7% |
| Credo issues | 0 | 0 |
Breaking Changes
- EXSK v2 is one-way. v0.7.x readers cannot decode v2 frames. Stage your rollout: deploy v0.8.0 readers first, then producers.
:murmur3is no longer silently overridden. Code that passedhash_strategy: :murmur3(and got:xxhash3in v0.7.x) will now actually use Murmur3. Estimates are still correct but differ from v0.7.x output for the same input.- Serialized binary format changes. The version byte shifts from
1to2. Tests asserting<<"EXSK", 1, ...>>must update to2.
Migration
Most users need no code changes. Full guide: guides/v0.8.0_migration_notes.md (shipped in HexDocs).
# mix.exs
{:ex_data_sketch, "~> 0.8.0"}For deployments that share persisted sketches across nodes:
- Deploy v0.8.0 to all readers first. v0.8.0 reads both v1 and v2.
- Verify reader stability for one deploy cycle.
- Deploy v0.8.0 to producers. They now emit v2 frames.
Known Issues
- ULL accuracy at low p. Use
p >= 12for production. Atp < 12and high cardinality, estimates diverge significantly. - HLL memory at high cardinality. Streaming 10M+ items into a single HLL allocates ~1.86 GB of transient BEAM state. Workaround: smaller chunk sizes.
- Windows ARM64 precompiled NIF has limited CI history. Fallback:
EX_DATA_SKETCH_BUILD=1 mix deps.compile. - Backend.default/0 returns
Pureeven when the Rust NIF is loaded. Opt in explicitly:backend: ExDataSketch.Backend.Rust.
Non-Goals for v0.8.0
No new sketch families. No Apache DataSketches interop beyond Theta CompactSketch. No streaming integrations (Broadway/Flow). No persistence layers (ETS/DETS/CubDB). No telemetry. No SIMD. No 6-bit register packing. No raw-NIF path for membership filters.
Full design docs: plans/0.8.0_architectural_summary.md, plans/0.8.0-risks.md, plans/0.8.0_migration_notes.md, plans/0.8.0_serialization_compatibility.md.