Skip to content

v1.7.0

Choose a tag to compare

@thettwe thettwe released this 30 Apr 02:58
· 34 commits to develop since this release

v1.7.0 — Compound Detection & Benchmark Hygiene

Two new compound-recovery strategies and a benchmark measurement overhaul. The pipeline now catches compound typos fragmented by whitespace insertion or segmenter over-splitting. Benchmark infrastructure reclassified 100+ rows for more accurate scoring. Spelling-only composite 0.6345 → 0.6292 (measurement correction — see Benchmark section).

Full user-facing changes: Release Notes · CHANGELOG.

New validation strategies

  • CompoundMergeProbeStrategy (priority 46, default off) — slides a window of 2–N adjacent segmented tokens, concatenates their raw text, and probes SymSpell for a high-frequency dictionary match at edit distance ≤ 2. Recovers compound typos like စွမ်းဆောင်ရည (missing asat) where each fragment is individually valid. Includes particle exclusion, asat-insertion fast path, name-mask guard, compound affinity scoring, and morphology peek gates.
  • CrossWhitespaceProbeStrategy (priority 47, default off) — recovers broken compounds split by whitespace insertion. Probes SymSpell on concatenations of adjacent whitespace-delimited tokens when fragments are low-frequency and the merged form is a high-frequency dictionary word.

Segmentation pipeline

  • Lattice decoder — joint decoder for the segmentation pipeline that evaluates multiple segmentation hypotheses simultaneously. Wired into the default segmenter as an opt-in alternative to greedy left-to-right segmentation.
  • Cython Viterbi top-K — K-best path extraction for the word tokenizer, accelerated via Cython. Enables downstream strategies to evaluate alternative segmentations.

Benchmark

  • Row reclassification — 100+ benchmark rows reclassified across five hygiene passes: 58 compound-split tokenization rows, 18 empty-gold detection-only rows, 21 mislabeled M3/M4 rows, 3 byte-identical duplicates, and root-cause empty-gold checks. Added _is_scorable runner predicate to exclude non-scorable rows from composite calculation.
  • Composite rebase — the v1.6.0 composite (0.6345) was measured with the pre-hygiene runner that scored non-scorable rows. Post-hygiene runner produces 0.6292 on the same pipeline — this is a measurement correction, not a regression. The pipeline is behaviorally identical.

Code quality

  • Deduplicated _has_confident_symspell_candidate across word_validator.py and error_suppression.py into shared correction_utils.py.
  • Lifted _check_colloquial_variant from SyllableValidator and WordValidator into the Validator base class.
  • Replaced monkey-patched _boosted_by_compound_split and _structural_early_exit boolean flags with proper dataclass fields on Error.
  • Thread-safety lock on MinedConfusablePairStrategy._freq_cache.

Fixed

  • Error.to_dict() now strips private _-prefixed dataclass fields, preventing internal state from leaking to API consumers.
  • CompoundMergeProbeStrategy and CrossWhitespaceProbeStrategy set source_strategy on emitted errors for meta-fusion scoring.
  • Both new strategies registered in STRATEGY_TIER, STRATEGY_RELIABILITY, and INDEPENDENCE_CLUSTERS.
  • Dependency audit workflow switched from system-install pip-audit to uv export | pip-audit -r, avoiding false positives from runner system packages.
  • Test suite segfault and streaming timeout eliminated.

Benchmark results

Run on the production database + semantic v2.4, --domain spelling:

Metric v1.7.0
Composite 0.6292
FPR (clean) 0.1105
Clean FP 83

Install

pip install myspellchecker==1.7.0