Releases · vivekptnk/ProximaKit

10 Jun 13:43

v1.5.0

d9fd917

v1.5.0 Latest

Latest

Correctness fixes from a multi-agent audit (every fix reproduced before patching, re-verified after), INT8 scalar quantization, three new distance metrics, reproducible graph construction, and a CI overhaul.

Added

Multiplatform demo app: ProximaDemoApp now targets iPhone, iPad,
macOS, and visionOS from one SwiftUI target (compact widths get a
search-first tab layout; AppKit image loading replaced with ImageIO).
The persisted demo index is validated against the current embedder's
dimension before reuse — NLEmbedding can pin sentence (512d) or
word-averaging (300d) mode depending on which language assets the OS
has, and a stale-dimension index made every search silently empty.
Dimension mismatches now surface as an actionable error instead.
A -demoQuery launch argument supports screenshot automation.
INT8 scalar quantization (ADR-007). ScalarQuantizer — symmetric
per-vector scaling (scale = maxAbs / 127, explicit zero-vector handling)
— plus the ScalarQuantizedHNSWIndex actor. ~4× vector-storage reduction
(384d: 1,536 B → 388 B per vector), no training phase, and search runs
through the configured DistanceMetric, so any serialisable metric works
(contrast with PQ's L2-only ADC). Two-phase build (full-precision graph
construction, then encode), binary persistence, memory introspection
(codeStorageBytes / memorySavingsRatio), and acceptance-tested recall
floors: Recall@10 ≥ 0.95 (euclidean) / ≥ 0.93 (cosine) against brute-force
ground truth. Design rationale in
ADR-007.
Three new distance metrics: ChebyshevDistance (L∞),
BrayCurtisDistance, and MahalanobisDistance (constructible from a
covariance or inverse-covariance matrix). Chebyshev and Bray-Curtis join
DistanceMetricType and persist with any index; Mahalanobis is search-only
(not serialisable), and persistenceSnapshot() reports it as
PersistenceError.unserializableMetric rather than guessing.
HNSWConfiguration.levelSeed — seeds the layer-assignment RNG so graph
construction is reproducible: the same insertion sequence yields the same
topology. Build-time knob only; deliberately not persisted.
Persistence corruption-hardening test matrix — 42 tests across all four
binary codecs, covering truncated sections, out-of-range graph indices,
invalid entry points, and bad configuration values.
DocC published to GitHub Pages on every push to main (docs.yml),
and automatic GitHub Releases with CHANGELOG-extracted notes on version
tags (release.yml).
CI overhaul: SwiftLint job (pinned 0.63.2, strict config), iOS Simulator
build job for ProximaKit + ProximaEmbeddings, release tag/version/
changelog consistency check, benchmark regression gate wired to
compare.py, SIFT1M SHA-256 verification, and fixed SwiftPM caching.
ADRs: ADR-007 (INT8
scalar quantization — accepted),
ADR-008 (filtered search —
retrospective), ADR-010 (format
evolution policy), ADR-011 (PQ codec —
retrospective). ADR-006 moved into docs/adr/ with its siblings.

Changed

NLEmbeddingProvider sentence embeddings are now L2-normalized, matching
the word-averaging fallback path (previously only the fallback normalized).
Every vector the provider returns now has unit magnitude. Migration:
indexes persisted from pre-1.5 unnormalized sentence vectors will rank
differently under DotProductDistance/EuclideanDistance when queried with
the new unit-length vectors — re-embed and rebuild those indexes, or pin to
v1.4.x until you can. (CosineDistance users are unaffected.)
On-disk format v2. autoCompactionThreshold now survives a save/load
round-trip. Format v1 files still load — see
ADR-010 for the evolution policy.
HNSWConfiguration rejects m < 2 (m == 1 yields an infinite level
multiplier and trapped on the first add).
ProximaKit.version now reports the actual release (was stuck at 1.0.0);
a consistency test and a release-workflow check keep it that way.

Fixed

Critical: tombstone liveness is now identity-based. Liveness was
presence-based (uuidToNode[uuid] != nil), which breaks after re-adding an
existing UUID: the old tombstoned slot looked live because the UUID resolves
to the new node. Search could return stale vectors/metadata, entry-point
recovery could select a disconnected tombstone (collapsing the graph), and
compact() resurrected deleted vector bodies. Affected HNSWIndex,
QuantizedHNSWIndex, and SparseIndex; reproduced 20/20 pre-fix and locked
in by TombstoneLivenessTests.
Batch cosine zero-vector parity. The batch fast path returned distance
0 (perfect match) for zero-magnitude vectors where scalar CosineDistance
returns 1.0 (neutral) — degenerate embeddings ranked as top hits in batch
paths. Both zero-query and zero-stored-vector now return 1.0.
Store reentrancy. VectorStore.save() no longer loses concurrent
addChunks dirty-flag updates across its suspension point;
HybridVectorStore two-leg saves can no longer persist diverged
dense/sparse files; removeDocument() closed its orphan window; document-map
writes are atomic.
Persistence loaders validate before trusting. Graph indices, entry
points, levels, and configuration ranges are checked on load, throwing typed
PersistenceError instead of crashing on corrupt or hostile files.
QuantizedHNSWIndex.build no longer misaligns PQ codes/metadata when the
input contains duplicate ids; HNSW remove() now repairs dangling incoming
edges; .weightedSum fusion validates alpha ∈ [0, 1].
DefaultBM25Tokenizer dropped locale-sensitive lowercasing — tokenization
is now deterministic regardless of device locale, per its contract.
CoreMLEmbeddingProvider now conforms to EmbeddingProvider /
TextEmbedder as documented, so it plugs into VectorStore directly.

Assets 2

09 Jun 22:16

vivekptnk

v1.4.0

7df4dd9

v1.4.0 — Hybrid retrieval + cross-library benchmarks

Hybrid BM25 + dense retrieval and a cross-library benchmark harness (FAISS + ScaNN). The core ProximaKit target remains Foundation + Accelerate only — no new external dependencies.

Added

Cross-library benchmark harness (Benchmarks/). Standalone SPM package
ProximaBench that compares ProximaKit HNSW against FAISS HNSW and ScaNN
on identical datasets and identical brute-force ground truth. The core
ProximaKit target stays dependency-free — baselines run in Python and
all harnesses write a flat JSON schema (see Benchmarks/JSON_SCHEMA.md).
- Swift subcommands: ground-truth (exact k-NN via BruteForceIndex)
  and hnsw (build + timed search + recall@k against GT).
- Python baselines under Benchmarks/python/: faiss_hnsw.py,
  scann_hnsw.py (auto-skips on unsupported platforms), compare.py
  aggregator that emits a Markdown table.
- Datasets: SIFT1M 100K subset + MS MARCO passages 50K (MiniLM-L6-v2
  embeddings). Idempotent download scripts under Benchmarks/datasets/.
- Metrics: recall@10 vs exact GT, p50/p95 query latency, QPS, build time,
  resident memory (mach_task_basic_info on Swift, psutil on Python).
docs/BENCHMARKS.md — "Cross-Library Comparison" section with
design rules, dataset table, metrics table, and end-to-end reproduction
steps that call the harness binaries directly.
docs/adr/ADR-005-benchmark-methodology.md documenting why the
baselines live out-of-process and why Benchmarks/ is a separate SPM
package rather than a target of Package.swift.
CI: .github/workflows/benchmark.yml. Smoke slice (SIFT1M 10K) runs
on every PR that touches Sources/ProximaKit/** or the harness. Full
slice (100K) runs nightly. Results (per-library JSON + aggregated
compare.md) are uploaded as workflow artifacts.
Hybrid retrieval (BM25 + dense). Three new public types in the core
ProximaKit target, sibling to the existing dense-only stack:
- SparseIndex — BM25 actor (SparseVectorIndex protocol), Okapi scoring
  with Lucene-style log(1 + (N − df + 0.5) / (df + 0.5)) IDF, configurable
  k1 / b, tombstoning + auto-compaction matching HNSWIndex.
- HybridIndex — concurrent fan-out over a dense VectorIndex and a
  SparseVectorIndex, with HybridFusionStrategy = .rrf(k:) (default,
  k = 60) or .weightedSum(alpha:).
- HybridVectorStore — sibling of VectorStore with the same
  addChunks / query / removeDocument / save shape. Persists both
  legs side-by-side (index.pxkt + index.pxbm).
BM25Tokenizer protocol with DefaultBM25Tokenizer — Unicode word-break
segmentation + lowercasing, no NaturalLanguage dependency. Bring-your-own
tokenizer for language-aware tokenization (e.g. Lumen's NLTokenizer).
BM25Configuration with k1, b, autoCompactionThreshold knobs.
.pxbm binary persistence for SparseIndex via an extension on
PersistenceEngine. Same header / offset layout conventions as
.pxkt; compacts tombstones on save.
docs/HYBRID.md — hybrid retrieval design, fusion-strategy rationale,
Lumen opt-in snippet.
40 new tests across SparseIndexTests, DefaultBM25TokenizerTests,
HybridIndexTests, and HybridVectorStoreTests, including a 1K-doc BM25
parity check against an oracle implementation and the RRF
top-k ⊇ (dense ∩ sparse) invariant on constructed cases.

Changed

.gitignore now tracks Benchmarks/ sources but ignores the on-demand
Benchmarks/datasets/ payloads and Benchmarks/out/ run artifacts.
docs/ADR-006-lumen-integration.md — new addendum covering the hybrid
opt-in path. The v1.1 VectorStore contract is unchanged.

Fixed

SparseIndexTests.testBM25ParityAgainstOracle no longer flakes when BM25
score ties straddle the top-k truncation boundary. Both the oracle and
SparseIndex are queried with k + 50 and the assertion walks fully
realized score buckets until it covers the top-k window — BM25 makes no
tie-break guarantee, so the test now verifies only what parity actually
demands (score agreement across the top-k window).

Assets 2

18 Mar 11:05

vivekptnk

v1.1.0

466e79f

v1.1.0

What's New in v1.1.0

SIMD & Performance

SIMD-accelerated batch vector operations — batchDotProducts and batchL2Distances for bulk computation
vDSP vs naive loop benchmark suite

CoreML & Embedding

WordPiece tokenizer for BERT-compatible CoreML model input

Demo App

Image search via VisionEmbeddingProvider
Index persistence — index survives app restart
efSearch slider for live tuning
User note and image input

Docs & Project

README rewritten with ASCII architecture diagrams, feature comparison, and performance dashboard
CONTRIBUTING.md, CHANGELOG.md, BENCHMARKS.md added

Full Changelog: v1.0.0...v1.1.0

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Added

Changed

Fixed

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Added

Changed

Fixed

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's New in v1.1.0

SIMD & Performance

CoreML & Embedding

Demo App

Docs & Project

Uh oh!

Releases: vivekptnk/ProximaKit

v1.5.0

Added

Changed

Fixed

Uh oh!

v1.4.0 — Hybrid retrieval + cross-library benchmarks

Added

Changed

Fixed

Uh oh!

v1.1.0

What's New in v1.1.0

SIMD & Performance

CoreML & Embedding

Demo App

Docs & Project

Uh oh!