v1.6.0
v1.6.0 — Spelling-First Benchmark & Mined Confusable Detection
Spelling-focused release. The pipeline now catches several error classes the v1.5.0 defaults missed: real-word confusables whose partner is also dictionary-valid, compound typos hidden by segmenter over-splitting, and missing-asat / substitution typos whose fragmented form happens to be piecewise-valid. Spelling-only composite 0.6161 → 0.6345 (+0.0184).
Full user-facing changes: Release Notes · CHANGELOG.
New validation strategies
MinedConfusablePairStrategy(priority 49, default on) — flags real-word confusables using a table of 23,970 edit-distance-1 pairs mined from the production dictionary. Both forms are dictionary-valid, so SymSpell cannot surface them on its own; emissions are gated by a semantic MLM logit margin and a frequency ratio between the current word and its partner.PreSegmenterRawProbeStrategy(priority 23, default on) — runsSymSpell.lookup(raw_token, level='word')on unsegmented whitespace-delimited tokens before the segmenter fragments them. Recovers compound typos likeကုန်ကျစရိက်that would otherwise be split into piecewise-valid subtokens.
Pipeline improvements
- Compound-split confusable boost (default on) — when the long-OOV all-valid-syllable suppressor fires, the structural signal also boosts the confidence of any inner
confusable_erroremission so the combined signal clears the downstream gate. - Skip-rule confidence gate — the pre-existing "skip tokens of 4+ valid syllables" rule now defers to SymSpell when the top-1 candidate clears a configurable edit-distance / frequency gate. Recovers missing-asat and substitution typos like
စွမ်းဆောင်ရည → စွမ်းဆောင်ရည်. - Segmenter post-merge rescue (opt-in,
use_segmenter_post_merge_rescue) — adjacent-pair merge pass probes variant-map / dictionary / dictionary+asat lookups on concatenated segmenter fragments. Off by default pending FPR calibration. - Loan-word DB mining — 54 curated transliteration variants mined from the confusable-pairs table, plus a
WordValidatorshort-circuit that emits the curated correction before SymSpell's edit-distance filter runs.
Normalization & dictionary
- Consonant-gated
normalize_e_vowel_tall_aa— targeted whitelist{ပ, ခ, ဒ}rewritesေ + flat-AA + finalsequences to the classicalေါform. Deliberately narrower than the classical MLC round-bottom set; the broader set would corrupt modern gold forms likeဘောလုံး,သဘော,ရောဂါ,ဖော်. - Flat-AA dictionary migration — 17,712 word keys + 68k n-gram foreign-key repoints + 1.5M probability re-normalizations resolve the TALL_AA vs AA divergence on the consonant whitelist.
Benchmark
- Spelling-first labeling — every gold error now carries a
domainfield (spelling/grammar/ambiguous), andbenchmarks/run_benchmark.pyaccepts a--domainfilter so spelling-only regression runs no longer need a sibling YAML.
Changed
- Sentence-level honorific detector normalizes input at entry, resolving silent misses on honorific-plus-casual-particle detection when callers pass unnormalized text.
REGISTER_CRITICAL_PRONOUNSconstant consolidated intovalidators/base—WordValidatorandSyllableValidatornow import a single source of truth.- Greedy syllable-reassembly and compound-split predicate extracted into shared helpers (
_greedy_syllable_reassembly,_compound_split_reassembly) to prevent drift between suppressor and boost call sites. error_suppressionimports hoisted to module top.
Fixed
- Honorific-plus-casual-particle regressions (
ဒော်ခင်မာ လာပြီကွာ → ရှင်,ဒော်ခင်မာ ထမင်းစားပြီးပြီကွာ → ရှင့်). - Defensive bounds guards on
context.word_positionsinLoanWordValidationStrategyandVisargaStrategy. _ClassifierScorer.scorenow anchorssentence.find(target, position)on the correct occurrence when the target word repeats (backend=classifieronly).- Defensive
max(0, ...)clamp + warning in_resolve_sentence_basefor bothPreSegmenterRawProbeStrategyandHiddenCompoundStrategy. - ByT5 decoder preallocates its buffer instead of per-step
np.concatenate.
Benchmark results
Run on the production database + semantic v2.4, --domain spelling:
| Metric | v1.6.0 |
|---|---|
| Composite | 0.6345 |
| F1 | 0.6462 |
| Precision | 0.8311 |
| Recall | 0.5287 |
| FPR (clean) | 0.0975 |
| Top-1 | 0.4360 |
| MRR | 0.5413 |
| p95 latency | 289.8ms |
Install
pip install myspellchecker==1.6.0Acknowledgements
The release-prep pass included eight small correctness and code-quality fixes identified by a pre-tag review. See commits v16p-14 through v16p-20 on main for details.