v0.2.1

ramannanda9 released this 31 Mar 19:55

· 3 commits to main since this release

fa02618

What's changed

Fix

Replaced naive whitespace tokenization in compute_minhash with word bigrams (unigrams + bigrams) as default — detects word-order and co-occurrence drift, not just bag-of-words changes
Added char_shingle tokenizer option for structured strings and IDs (3-char shingles)
Kept whitespace as a legacy opt-in for backwards compatibility

Consistency guards

build_baseline raises ValueError if minhash records with mixed tokenizers are merged
compute_signals raises ValueError if current sketch and baseline were built with different tokenizers

Chore

Single-sourced version in lakesense/__init__.py via hatchling dynamic; removed duplicate from pyproject.toml
Added .claude/settings.local.json to .gitignore

Breaking change

Existing minhash baselines stored under the old whitespace tokenization must be rebuilt — comparing them against new word_ngram sketches will raise a ValueError.

Assets 2