Skip to content

Releases: vivekptnk/ProximaKit

v1.5.0

10 Jun 13:43

Choose a tag to compare

Correctness fixes from a multi-agent audit (every fix reproduced before patching, re-verified after), INT8 scalar quantization, three new distance metrics, reproducible graph construction, and a CI overhaul.

Added

  • Multiplatform demo app: ProximaDemoApp now targets iPhone, iPad,
    macOS, and visionOS from one SwiftUI target (compact widths get a
    search-first tab layout; AppKit image loading replaced with ImageIO).
    The persisted demo index is validated against the current embedder's
    dimension before reuse — NLEmbedding can pin sentence (512d) or
    word-averaging (300d) mode depending on which language assets the OS
    has, and a stale-dimension index made every search silently empty.
    Dimension mismatches now surface as an actionable error instead.
    A -demoQuery launch argument supports screenshot automation.
  • INT8 scalar quantization (ADR-007). ScalarQuantizer — symmetric
    per-vector scaling (scale = maxAbs / 127, explicit zero-vector handling)
    — plus the ScalarQuantizedHNSWIndex actor. ~4× vector-storage reduction
    (384d: 1,536 B → 388 B per vector), no training phase, and search runs
    through the configured DistanceMetric, so any serialisable metric works
    (contrast with PQ's L2-only ADC). Two-phase build (full-precision graph
    construction, then encode), binary persistence, memory introspection
    (codeStorageBytes / memorySavingsRatio), and acceptance-tested recall
    floors: Recall@10 ≥ 0.95 (euclidean) / ≥ 0.93 (cosine) against brute-force
    ground truth. Design rationale in
    ADR-007.
  • Three new distance metrics: ChebyshevDistance (L∞),
    BrayCurtisDistance, and MahalanobisDistance (constructible from a
    covariance or inverse-covariance matrix). Chebyshev and Bray-Curtis join
    DistanceMetricType and persist with any index; Mahalanobis is search-only
    (not serialisable), and persistenceSnapshot() reports it as
    PersistenceError.unserializableMetric rather than guessing.
  • HNSWConfiguration.levelSeed — seeds the layer-assignment RNG so graph
    construction is reproducible: the same insertion sequence yields the same
    topology. Build-time knob only; deliberately not persisted.
  • Persistence corruption-hardening test matrix — 42 tests across all four
    binary codecs, covering truncated sections, out-of-range graph indices,
    invalid entry points, and bad configuration values.
  • DocC published to GitHub Pages on every push to main (docs.yml),
    and automatic GitHub Releases with CHANGELOG-extracted notes on version
    tags (release.yml).
  • CI overhaul: SwiftLint job (pinned 0.63.2, strict config), iOS Simulator
    build job for ProximaKit + ProximaEmbeddings, release tag/version/
    changelog consistency check, benchmark regression gate wired to
    compare.py, SIFT1M SHA-256 verification, and fixed SwiftPM caching.
  • ADRs: ADR-007 (INT8
    scalar quantization — accepted),
    ADR-008 (filtered search —
    retrospective), ADR-010 (format
    evolution policy), ADR-011 (PQ codec —
    retrospective). ADR-006 moved into docs/adr/ with its siblings.

Changed

  • NLEmbeddingProvider sentence embeddings are now L2-normalized, matching
    the word-averaging fallback path (previously only the fallback normalized).
    Every vector the provider returns now has unit magnitude. Migration:
    indexes persisted from pre-1.5 unnormalized sentence vectors will rank
    differently under DotProductDistance/EuclideanDistance when queried with
    the new unit-length vectors — re-embed and rebuild those indexes, or pin to
    v1.4.x until you can. (CosineDistance users are unaffected.)

  • On-disk format v2. autoCompactionThreshold now survives a save/load
    round-trip. Format v1 files still load — see
    ADR-010 for the evolution policy.

  • HNSWConfiguration rejects m < 2 (m == 1 yields an infinite level
    multiplier and trapped on the first add).

  • ProximaKit.version now reports the actual release (was stuck at 1.0.0);
    a consistency test and a release-workflow check keep it that way.

Fixed

  • Critical: tombstone liveness is now identity-based. Liveness was
    presence-based (uuidToNode[uuid] != nil), which breaks after re-adding an
    existing UUID: the old tombstoned slot looked live because the UUID resolves
    to the new node. Search could return stale vectors/metadata, entry-point
    recovery could select a disconnected tombstone (collapsing the graph), and
    compact() resurrected deleted vector bodies. Affected HNSWIndex,
    QuantizedHNSWIndex, and SparseIndex; reproduced 20/20 pre-fix and locked
    in by TombstoneLivenessTests.
  • Batch cosine zero-vector parity. The batch fast path returned distance
    0 (perfect match) for zero-magnitude vectors where scalar CosineDistance
    returns 1.0 (neutral) — degenerate embeddings ranked as top hits in batch
    paths. Both zero-query and zero-stored-vector now return 1.0.
  • Store reentrancy. VectorStore.save() no longer loses concurrent
    addChunks dirty-flag updates across its suspension point;
    HybridVectorStore two-leg saves can no longer persist diverged
    dense/sparse files; removeDocument() closed its orphan window; document-map
    writes are atomic.
  • Persistence loaders validate before trusting. Graph indices, entry
    points, levels, and configuration ranges are checked on load, throwing typed
    PersistenceError instead of crashing on corrupt or hostile files.
  • QuantizedHNSWIndex.build no longer misaligns PQ codes/metadata when the
    input contains duplicate ids; HNSW remove() now repairs dangling incoming
    edges; .weightedSum fusion validates alpha ∈ [0, 1].
  • DefaultBM25Tokenizer dropped locale-sensitive lowercasing — tokenization
    is now deterministic regardless of device locale, per its contract.
  • CoreMLEmbeddingProvider now conforms to EmbeddingProvider /
    TextEmbedder as documented, so it plugs into VectorStore directly.

v1.4.0 — Hybrid retrieval + cross-library benchmarks

09 Jun 22:16

Choose a tag to compare

Hybrid BM25 + dense retrieval and a cross-library benchmark harness (FAISS + ScaNN). The core ProximaKit target remains Foundation + Accelerate only — no new external dependencies.

Added

  • Cross-library benchmark harness (Benchmarks/). Standalone SPM package
    ProximaBench that compares ProximaKit HNSW against FAISS HNSW and ScaNN
    on identical datasets and identical brute-force ground truth. The core
    ProximaKit target stays dependency-free — baselines run in Python and
    all harnesses write a flat JSON schema (see Benchmarks/JSON_SCHEMA.md).

    • Swift subcommands: ground-truth (exact k-NN via BruteForceIndex)
      and hnsw (build + timed search + recall@k against GT).
    • Python baselines under Benchmarks/python/: faiss_hnsw.py,
      scann_hnsw.py (auto-skips on unsupported platforms), compare.py
      aggregator that emits a Markdown table.
    • Datasets: SIFT1M 100K subset + MS MARCO passages 50K (MiniLM-L6-v2
      embeddings). Idempotent download scripts under Benchmarks/datasets/.
    • Metrics: recall@10 vs exact GT, p50/p95 query latency, QPS, build time,
      resident memory (mach_task_basic_info on Swift, psutil on Python).
  • docs/BENCHMARKS.md — "Cross-Library Comparison" section with
    design rules, dataset table, metrics table, and end-to-end reproduction
    steps that call the harness binaries directly.

  • docs/adr/ADR-005-benchmark-methodology.md documenting why the
    baselines live out-of-process and why Benchmarks/ is a separate SPM
    package rather than a target of Package.swift.

  • CI: .github/workflows/benchmark.yml. Smoke slice (SIFT1M 10K) runs
    on every PR that touches Sources/ProximaKit/** or the harness. Full
    slice (100K) runs nightly. Results (per-library JSON + aggregated
    compare.md) are uploaded as workflow artifacts.

  • Hybrid retrieval (BM25 + dense). Three new public types in the core
    ProximaKit target, sibling to the existing dense-only stack:

    • SparseIndex — BM25 actor (SparseVectorIndex protocol), Okapi scoring
      with Lucene-style log(1 + (N − df + 0.5) / (df + 0.5)) IDF, configurable
      k1 / b, tombstoning + auto-compaction matching HNSWIndex.
    • HybridIndex — concurrent fan-out over a dense VectorIndex and a
      SparseVectorIndex, with HybridFusionStrategy = .rrf(k:) (default,
      k = 60) or .weightedSum(alpha:).
    • HybridVectorStore — sibling of VectorStore with the same
      addChunks / query / removeDocument / save shape. Persists both
      legs side-by-side (index.pxkt + index.pxbm).
  • BM25Tokenizer protocol with DefaultBM25Tokenizer — Unicode word-break
    segmentation + lowercasing, no NaturalLanguage dependency. Bring-your-own
    tokenizer for language-aware tokenization (e.g. Lumen's NLTokenizer).

  • BM25Configuration with k1, b, autoCompactionThreshold knobs.

  • .pxbm binary persistence for SparseIndex via an extension on
    PersistenceEngine. Same header / offset layout conventions as
    .pxkt; compacts tombstones on save.

  • docs/HYBRID.md — hybrid retrieval design, fusion-strategy rationale,
    Lumen opt-in snippet.

  • 40 new tests across SparseIndexTests, DefaultBM25TokenizerTests,
    HybridIndexTests, and HybridVectorStoreTests, including a 1K-doc BM25
    parity check against an oracle implementation and the RRF
    top-k ⊇ (dense ∩ sparse) invariant on constructed cases.

Changed

  • .gitignore now tracks Benchmarks/ sources but ignores the on-demand
    Benchmarks/datasets/ payloads and Benchmarks/out/ run artifacts.
  • docs/ADR-006-lumen-integration.md — new addendum covering the hybrid
    opt-in path. The v1.1 VectorStore contract is unchanged.

Fixed

  • SparseIndexTests.testBM25ParityAgainstOracle no longer flakes when BM25
    score ties straddle the top-k truncation boundary. Both the oracle and
    SparseIndex are queried with k + 50 and the assertion walks fully
    realized score buckets until it covers the top-k window — BM25 makes no
    tie-break guarantee, so the test now verifies only what parity actually
    demands (score agreement across the top-k window).

v1.1.0

18 Mar 11:05

Choose a tag to compare

What's New in v1.1.0

SIMD & Performance

  • SIMD-accelerated batch vector operationsbatchDotProducts and batchL2Distances for bulk computation
  • vDSP vs naive loop benchmark suite

CoreML & Embedding

  • WordPiece tokenizer for BERT-compatible CoreML model input

Demo App

  • Image search via VisionEmbeddingProvider
  • Index persistence — index survives app restart
  • efSearch slider for live tuning
  • User note and image input

Docs & Project

  • README rewritten with ASCII architecture diagrams, feature comparison, and performance dashboard
  • CONTRIBUTING.md, CHANGELOG.md, BENCHMARKS.md added

Full Changelog: v1.0.0...v1.1.0