v1.7.0
v1.7.0 — Compound Detection & Benchmark Hygiene
Two new compound-recovery strategies and a benchmark measurement overhaul. The pipeline now catches compound typos fragmented by whitespace insertion or segmenter over-splitting. Benchmark infrastructure reclassified 100+ rows for more accurate scoring. Spelling-only composite 0.6345 → 0.6292 (measurement correction — see Benchmark section).
Full user-facing changes: Release Notes · CHANGELOG.
New validation strategies
CompoundMergeProbeStrategy(priority 46, default off) — slides a window of 2–N adjacent segmented tokens, concatenates their raw text, and probes SymSpell for a high-frequency dictionary match at edit distance ≤ 2. Recovers compound typos likeစွမ်းဆောင်ရည(missing asat) where each fragment is individually valid. Includes particle exclusion, asat-insertion fast path, name-mask guard, compound affinity scoring, and morphology peek gates.CrossWhitespaceProbeStrategy(priority 47, default off) — recovers broken compounds split by whitespace insertion. Probes SymSpell on concatenations of adjacent whitespace-delimited tokens when fragments are low-frequency and the merged form is a high-frequency dictionary word.
Segmentation pipeline
- Lattice decoder — joint decoder for the segmentation pipeline that evaluates multiple segmentation hypotheses simultaneously. Wired into the default segmenter as an opt-in alternative to greedy left-to-right segmentation.
- Cython Viterbi top-K — K-best path extraction for the word tokenizer, accelerated via Cython. Enables downstream strategies to evaluate alternative segmentations.
Benchmark
- Row reclassification — 100+ benchmark rows reclassified across five hygiene passes: 58 compound-split tokenization rows, 18 empty-gold detection-only rows, 21 mislabeled M3/M4 rows, 3 byte-identical duplicates, and root-cause empty-gold checks. Added
_is_scorablerunner predicate to exclude non-scorable rows from composite calculation. - Composite rebase — the v1.6.0 composite (0.6345) was measured with the pre-hygiene runner that scored non-scorable rows. Post-hygiene runner produces 0.6292 on the same pipeline — this is a measurement correction, not a regression. The pipeline is behaviorally identical.
Code quality
- Deduplicated
_has_confident_symspell_candidateacrossword_validator.pyanderror_suppression.pyinto sharedcorrection_utils.py. - Lifted
_check_colloquial_variantfromSyllableValidatorandWordValidatorinto theValidatorbase class. - Replaced monkey-patched
_boosted_by_compound_splitand_structural_early_exitboolean flags with proper dataclass fields onError. - Thread-safety lock on
MinedConfusablePairStrategy._freq_cache.
Fixed
Error.to_dict()now strips private_-prefixed dataclass fields, preventing internal state from leaking to API consumers.CompoundMergeProbeStrategyandCrossWhitespaceProbeStrategysetsource_strategyon emitted errors for meta-fusion scoring.- Both new strategies registered in
STRATEGY_TIER,STRATEGY_RELIABILITY, andINDEPENDENCE_CLUSTERS. - Dependency audit workflow switched from system-install
pip-audittouv export | pip-audit -r, avoiding false positives from runner system packages. - Test suite segfault and streaming timeout eliminated.
Benchmark results
Run on the production database + semantic v2.4, --domain spelling:
| Metric | v1.7.0 |
|---|---|
| Composite | 0.6292 |
| FPR (clean) | 0.1105 |
| Clean FP | 83 |
Install
pip install myspellchecker==1.7.0