Skip to content

v0.8.0 — Deterministic Foundations

Choose a tag to compare

@github-actions github-actions released this 12 May 16:16
· 1 commit to main since this release

ex_data_sketch v0.8.0 — Deterministic Foundations

Release date: 2026-05-12

v0.8.0 transforms ex_data_sketch from a collection of probabilistic algorithms into a production-grade probabilistic runtime for the BEAM. Zero new sketch families. Five phases of substrate investment.

Highlights

Deterministic Hashing (Phase 1)

Every sketch now shares a documented, validated, byte-stable hash layer. Hash.XXH3, Hash.Murmur3, Hash.Metadata, and Hash.Validation are new. HLL.new/1, ULL.new/1, Theta.new/1, and CMS.new/1 now honor a user-supplied :hash_strategy (:xxhash3 | :murmur3 | :phash2). The :murmur3 option was silently overridden in v0.7.x — this is now a behavior change: code that passed hash_strategy: :murmur3 will produce different estimates than before (correct ones, using Murmur3).

Binary Stability & Corruption Detection (Phase 2)

EXSK serialization bumps from v1 to v2. Every serialize/1 output now carries a CRC32C checksum and an embedded Hash.Metadata block. deserialize/1 transparently reads both v1 and v2. A 200-mutation bit-flip fuzz suite verifies that no single-bit corruption silently propagates. v0.7.x readers cannot read v2 frames.

Hot-Path Performance (Phase 3)

8 new Rust NIFs (_raw_h_nif family) dispatch hashing by algorithm byte, extending in-Rust hashing to Murmur3. XXH3 throughput remains 25–34 M items/sec at p=14; Murmur3 is within 8%. Legacy _raw_nif family preserved for v0.7.x binary stability.

Precompiled NIFs (Phase 4)

Windows x86_64 and ARM64 MSVC targets added. The matrix is now 8 targets x 2 NIF versions = 16 artifacts per release. mix test.nif_on / mix test.nif_off aliases handle local NIF-mode flips.

Property-Based Validation (Phase 5)

14 new StreamData properties lock monotonicity (HLL/ULL), error bounds, rank consistency (KLL/REQ), overestimation-only (CMS), and no-false-negative guarantees (Bloom/XorFilter/Cuckoo). Binary v2 corruption never silently produces a valid sketch.

Post-Release Fixes (since tag)

  • #240 — EXSK v2 Header.decode/1 now rejects non-zero reserved flags with a structured DeserializationError, matching the documented v2 contract. Previously, frames with flags != 0 were silently accepted.
  • #238serialize/1 on any Murmur3-configured sketch no longer crashes with CaseClauseError. The sketch-local hash_strategy_byte/1 encoders across HLL, ULL, Theta, and CMS now map :murmur3 to wire byte 3.
  • #239 — Release docs and the checksum file are no longer git-tracked; plan docs are excluded from the Hex package.
  • Credo lint: replaced chained Enum.map |> Enum.map with single passes, replaced list ++ [item] with prepend-and-reverse, replaced List.foldl with Enum.reduce, replaced List.last with pattern matching, removed fully-qualified module references, removed restatement comments, documented previously @doc false public helpers.

Wire Format

Sketch (empty) v1 size v2 size Overhead
HLL p=4 18 B 50 B +32 (2.8x)
HLL p=14 16,398 B 16,430 B +32 (0.2%)
KLL k=200 (populated) ~3-5 KB ~3-5 KB +32 (~1%)

For any sketch larger than ~1 KB, overhead is negligible.

Performance

Path (HLL p=14) Throughput
Pure phash2 ~1.7 M items/sec
Pure xxhash3 ~1.9 M items/sec
Rust raw XXH3 ~30 M items/sec
Rust raw_h Murmur3 ~28 M items/sec

Test Suite

Metric v0.7.1 v0.8.0
Tests (NIF on) 1,186 1,317
Doctests 169 202
Properties (NIF on) 152 171
Line coverage 88% 92.7%
Credo issues 0 0

Breaking Changes

  1. EXSK v2 is one-way. v0.7.x readers cannot decode v2 frames. Stage your rollout: deploy v0.8.0 readers first, then producers.
  2. :murmur3 is no longer silently overridden. Code that passed hash_strategy: :murmur3 (and got :xxhash3 in v0.7.x) will now actually use Murmur3. Estimates are still correct but differ from v0.7.x output for the same input.
  3. Serialized binary format changes. The version byte shifts from 1 to 2. Tests asserting <<"EXSK", 1, ...>> must update to 2.

Migration

Most users need no code changes. Full guide: guides/v0.8.0_migration_notes.md (shipped in HexDocs).

# mix.exs
{:ex_data_sketch, "~> 0.8.0"}

For deployments that share persisted sketches across nodes:

  1. Deploy v0.8.0 to all readers first. v0.8.0 reads both v1 and v2.
  2. Verify reader stability for one deploy cycle.
  3. Deploy v0.8.0 to producers. They now emit v2 frames.

Known Issues

  • ULL accuracy at low p. Use p >= 12 for production. At p < 12 and high cardinality, estimates diverge significantly.
  • HLL memory at high cardinality. Streaming 10M+ items into a single HLL allocates ~1.86 GB of transient BEAM state. Workaround: smaller chunk sizes.
  • Windows ARM64 precompiled NIF has limited CI history. Fallback: EX_DATA_SKETCH_BUILD=1 mix deps.compile.
  • Backend.default/0 returns Pure even when the Rust NIF is loaded. Opt in explicitly: backend: ExDataSketch.Backend.Rust.

Non-Goals for v0.8.0

No new sketch families. No Apache DataSketches interop beyond Theta CompactSketch. No streaming integrations (Broadway/Flow). No persistence layers (ETS/DETS/CubDB). No telemetry. No SIMD. No 6-bit register packing. No raw-NIF path for membership filters.


Full design docs: plans/0.8.0_architectural_summary.md, plans/0.8.0-risks.md, plans/0.8.0_migration_notes.md, plans/0.8.0_serialization_compatibility.md.