Release v1.5.0 · thettwe/myspellchecker

v1.5.0 — Real-World FPR & Segmentation

HiddenCompoundStrategy (priority 23): Detects multi-token compound typos that the segmenter over-splits into individually-valid syllables. Walks curated-vocabulary bigram/trigram windows and checks whether a phonetic/tonal/medial variant of the leading token forms a high-frequency dictionary compound with the following token(s). Enabled by default.
SyllableWindowOOVStrategy (priority 22): Detects multi-syllable OOV typos that the segmenter decomposes into individually-valid syllables. Disabled by default pending per-process SymSpell caching.
Suffix-aware re-segmentation in DefaultSegmenter: New post-processing pass that reassembles tokens where the segmenter left an oversized compound or split a colloquial-locative merge.
Meta-classifier v2 with compound-aware features; threshold retuned from 0.40 → 0.42 based on FP/TP sweep.
Benchmark expansion from 1,146 → 1,304 sentences with annotation corrections.

Ternary compound splits in MorphemeSuggestionStrategy for three-morpheme compound typo corrections.
Formal register benchmark subset for FPR regression testing.
Particle-tone confusable pairs: ခဲ → ခဲ့ and မဲ → မယ် (unidirectional, protects standalone uses).
Curated-pair promotion in StatisticalConfusableStrategy: bigram-ratio detections matching curated homophone/confusable maps receive a confidence boost.
7 unidirectional homophone pairs for homophone-confusion false-negative recovery.
New confusable pairs: ကယာ/ကရာ consonant confusion and 5 false_compound entries.

Fusion arbiter: HiddenCompoundStrategy promoted to Tier 3 for confidence tiebreak against Homophone.
N-gram probability fields now enforce le=1.0 upper bound in Pydantic validation.
SQLite n-gram lookups deduplicated into a single _lookup_ngram_prob helper with shared cache path.
Renamed core.constants.is_myanmar_text → contains_myanmar to disambiguate from ratio-based detection.
Optional-dependency extras consolidated via self-referencing in pyproject.toml.
Training module modernized: lowercase generics, pathlib migration across trainer/exporter/semantic_checker.
Blanket # type: ignore comments narrowed to specific error codes across 20 sites.

Fusion arbiter: untrained error types (hidden_compound_typo, syllable_window_oov) isolated from meta-classifier context features.
Confusable FPs on punctuation boundaries: valid words with attached Myanmar punctuation no longer trigger confusable detection.
Invalid-word FPs from boundary punctuation attachment.
tense_mismatch confidence lowered for data-driven FP filtering.
Dot-below confusable FPs suppressed in error suppression pipeline.
Visarga compound skip threshold lowered to protect established words.
Reduplication guard added to BrokenCompoundStrategy, nominalizer particles excluded from broken-compound detection.
Logger f-string eager evaluation replaced with %-style formatting on hot paths.