v0.2.1
What's changed
Fix
- Replaced naive whitespace tokenization in
compute_minhashwith word bigrams (unigrams + bigrams) as default — detects word-order and co-occurrence drift, not just bag-of-words changes - Added
char_shingletokenizer option for structured strings and IDs (3-char shingles) - Kept
whitespaceas a legacy opt-in for backwards compatibility
Consistency guards
build_baselineraisesValueErrorif minhash records with mixed tokenizers are mergedcompute_signalsraisesValueErrorif current sketch and baseline were built with different tokenizers
Chore
- Single-sourced version in
lakesense/__init__.pyvia hatchlingdynamic; removed duplicate frompyproject.toml - Added
.claude/settings.local.jsonto.gitignore
Breaking change
Existing minhash baselines stored under the old whitespace tokenization must be rebuilt — comparing them against new word_ngram sketches will raise a ValueError.