Skip to content

v1.5.0

Choose a tag to compare

@thettwe thettwe released this 12 Apr 03:35
· 133 commits to develop since this release
ed7b813

v1.5.0 — Real-World FPR & Segmentation

Highlights

  • HiddenCompoundStrategy (priority 23): Detects multi-token compound typos that the segmenter over-splits into individually-valid syllables. Walks curated-vocabulary bigram/trigram windows and checks whether a phonetic/tonal/medial variant of the leading token forms a high-frequency dictionary compound with the following token(s). Enabled by default.
  • SyllableWindowOOVStrategy (priority 22): Detects multi-syllable OOV typos that the segmenter decomposes into individually-valid syllables. Disabled by default pending per-process SymSpell caching.
  • Suffix-aware re-segmentation in DefaultSegmenter: New post-processing pass that reassembles tokens where the segmenter left an oversized compound or split a colloquial-locative merge.
  • Meta-classifier v2 with compound-aware features; threshold retuned from 0.400.42 based on FP/TP sweep.
  • Benchmark expansion from 1,146 → 1,304 sentences with annotation corrections.

Added

  • Ternary compound splits in MorphemeSuggestionStrategy for three-morpheme compound typo corrections.
  • Formal register benchmark subset for FPR regression testing.
  • Particle-tone confusable pairs: ခဲ → ခဲ့ and မဲ → မယ် (unidirectional, protects standalone uses).
  • Curated-pair promotion in StatisticalConfusableStrategy: bigram-ratio detections matching curated homophone/confusable maps receive a confidence boost.
  • 7 unidirectional homophone pairs for homophone-confusion false-negative recovery.
  • New confusable pairs: ကယာ/ကရာ consonant confusion and 5 false_compound entries.

Changed

  • Fusion arbiter: HiddenCompoundStrategy promoted to Tier 3 for confidence tiebreak against Homophone.
  • N-gram probability fields now enforce le=1.0 upper bound in Pydantic validation.
  • SQLite n-gram lookups deduplicated into a single _lookup_ngram_prob helper with shared cache path.
  • Renamed core.constants.is_myanmar_textcontains_myanmar to disambiguate from ratio-based detection.
  • Optional-dependency extras consolidated via self-referencing in pyproject.toml.
  • Training module modernized: lowercase generics, pathlib migration across trainer/exporter/semantic_checker.
  • Blanket # type: ignore comments narrowed to specific error codes across 20 sites.

Fixed

  • Fusion arbiter: untrained error types (hidden_compound_typo, syllable_window_oov) isolated from meta-classifier context features.
  • Confusable FPs on punctuation boundaries: valid words with attached Myanmar punctuation no longer trigger confusable detection.
  • Invalid-word FPs from boundary punctuation attachment.
  • tense_mismatch confidence lowered for data-driven FP filtering.
  • Dot-below confusable FPs suppressed in error suppression pipeline.
  • Visarga compound skip threshold lowered to protect established words.
  • Reduplication guard added to BrokenCompoundStrategy, nominalizer particles excluded from broken-compound detection.
  • Logger f-string eager evaluation replaced with %-style formatting on hot paths.

Removed

  • 9 dead ValidationConfig fields with zero consumers.
  • Unused freezegun dev dependency.
  • 15 exact-duplicate entries in rules/confusable_pairs.yaml.
  • 12 unnecessary # type: ignore[arg-type] comments on SQLite cache calls.