Skip to content

v0.2.1

Choose a tag to compare

@ramannanda9 ramannanda9 released this 31 Mar 19:55
· 3 commits to main since this release
fa02618

What's changed

Fix

  • Replaced naive whitespace tokenization in compute_minhash with word bigrams (unigrams + bigrams) as default — detects word-order and co-occurrence drift, not just bag-of-words changes
  • Added char_shingle tokenizer option for structured strings and IDs (3-char shingles)
  • Kept whitespace as a legacy opt-in for backwards compatibility

Consistency guards

  • build_baseline raises ValueError if minhash records with mixed tokenizers are merged
  • compute_signals raises ValueError if current sketch and baseline were built with different tokenizers

Chore

  • Single-sourced version in lakesense/__init__.py via hatchling dynamic; removed duplicate from pyproject.toml
  • Added .claude/settings.local.json to .gitignore

Breaking change

Existing minhash baselines stored under the old whitespace tokenization must be rebuilt — comparing them against new word_ngram sketches will raise a ValueError.