Skip to content

v1.6.0

Choose a tag to compare

@thettwe thettwe released this 21 Apr 23:51
· 71 commits to develop since this release
71c4844

v1.6.0 — Spelling-First Benchmark & Mined Confusable Detection

Spelling-focused release. The pipeline now catches several error classes the v1.5.0 defaults missed: real-word confusables whose partner is also dictionary-valid, compound typos hidden by segmenter over-splitting, and missing-asat / substitution typos whose fragmented form happens to be piecewise-valid. Spelling-only composite 0.6161 → 0.6345 (+0.0184).

Full user-facing changes: Release Notes · CHANGELOG.

New validation strategies

  • MinedConfusablePairStrategy (priority 49, default on) — flags real-word confusables using a table of 23,970 edit-distance-1 pairs mined from the production dictionary. Both forms are dictionary-valid, so SymSpell cannot surface them on its own; emissions are gated by a semantic MLM logit margin and a frequency ratio between the current word and its partner.
  • PreSegmenterRawProbeStrategy (priority 23, default on) — runs SymSpell.lookup(raw_token, level='word') on unsegmented whitespace-delimited tokens before the segmenter fragments them. Recovers compound typos like ကုန်ကျစရိက် that would otherwise be split into piecewise-valid subtokens.

Pipeline improvements

  • Compound-split confusable boost (default on) — when the long-OOV all-valid-syllable suppressor fires, the structural signal also boosts the confidence of any inner confusable_error emission so the combined signal clears the downstream gate.
  • Skip-rule confidence gate — the pre-existing "skip tokens of 4+ valid syllables" rule now defers to SymSpell when the top-1 candidate clears a configurable edit-distance / frequency gate. Recovers missing-asat and substitution typos like စွမ်းဆောင်ရည → စွမ်းဆောင်ရည်.
  • Segmenter post-merge rescue (opt-in, use_segmenter_post_merge_rescue) — adjacent-pair merge pass probes variant-map / dictionary / dictionary+asat lookups on concatenated segmenter fragments. Off by default pending FPR calibration.
  • Loan-word DB mining — 54 curated transliteration variants mined from the confusable-pairs table, plus a WordValidator short-circuit that emits the curated correction before SymSpell's edit-distance filter runs.

Normalization & dictionary

  • Consonant-gated normalize_e_vowel_tall_aa — targeted whitelist {ပ, ခ, ဒ} rewrites ေ + flat-AA + final sequences to the classical ေါ form. Deliberately narrower than the classical MLC round-bottom set; the broader set would corrupt modern gold forms like ဘောလုံး, သဘော, ရောဂါ, ဖော်.
  • Flat-AA dictionary migration — 17,712 word keys + 68k n-gram foreign-key repoints + 1.5M probability re-normalizations resolve the TALL_AA vs AA divergence on the consonant whitelist.

Benchmark

  • Spelling-first labeling — every gold error now carries a domain field (spelling / grammar / ambiguous), and benchmarks/run_benchmark.py accepts a --domain filter so spelling-only regression runs no longer need a sibling YAML.

Changed

  • Sentence-level honorific detector normalizes input at entry, resolving silent misses on honorific-plus-casual-particle detection when callers pass unnormalized text.
  • REGISTER_CRITICAL_PRONOUNS constant consolidated into validators/baseWordValidator and SyllableValidator now import a single source of truth.
  • Greedy syllable-reassembly and compound-split predicate extracted into shared helpers (_greedy_syllable_reassembly, _compound_split_reassembly) to prevent drift between suppressor and boost call sites.
  • error_suppression imports hoisted to module top.

Fixed

  • Honorific-plus-casual-particle regressions (ဒော်ခင်မာ လာပြီကွာ → ရှင်, ဒော်ခင်မာ ထမင်းစားပြီးပြီကွာ → ရှင့်).
  • Defensive bounds guards on context.word_positions in LoanWordValidationStrategy and VisargaStrategy.
  • _ClassifierScorer.score now anchors sentence.find(target, position) on the correct occurrence when the target word repeats (backend=classifier only).
  • Defensive max(0, ...) clamp + warning in _resolve_sentence_base for both PreSegmenterRawProbeStrategy and HiddenCompoundStrategy.
  • ByT5 decoder preallocates its buffer instead of per-step np.concatenate.

Benchmark results

Run on the production database + semantic v2.4, --domain spelling:

Metric v1.6.0
Composite 0.6345
F1 0.6462
Precision 0.8311
Recall 0.5287
FPR (clean) 0.0975
Top-1 0.4360
MRR 0.5413
p95 latency 289.8ms

Install

pip install myspellchecker==1.6.0

Acknowledgements

The release-prep pass included eight small correctness and code-quality fixes identified by a pre-tag review. See commits v16p-14 through v16p-20 on main for details.