You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
v1.5.0 — Real-World FPR & Segmentation
Highlights
HiddenCompoundStrategy (priority 23): Detects multi-token compound typos that the segmenter over-splits into individually-valid syllables. Walks curated-vocabulary bigram/trigram windows and checks whether a phonetic/tonal/medial variant of the leading token forms a high-frequency dictionary compound with the following token(s). Enabled by default.
SyllableWindowOOVStrategy (priority 22): Detects multi-syllable OOV typos that the segmenter decomposes into individually-valid syllables. Disabled by default pending per-process SymSpell caching.
Suffix-aware re-segmentation in DefaultSegmenter: New post-processing pass that reassembles tokens where the segmenter left an oversized compound or split a colloquial-locative merge.
Meta-classifier v2 with compound-aware features; threshold retuned from 0.40 → 0.42 based on FP/TP sweep.
Benchmark expansion from 1,146 → 1,304 sentences with annotation corrections.
Added
Ternary compound splits in MorphemeSuggestionStrategy for three-morpheme compound typo corrections.
Formal register benchmark subset for FPR regression testing.