Skip to content

v1.4.0

Choose a tag to compare

@thettwe thettwe released this 09 Apr 08:34
· 173 commits to main since this release

v1.4.0 — Meta-Classifier & Semantic Confusable Detection

Highlights

  • Meta-classifier post-filter: A learned logistic regression model (41 features including one-hot error type, word frequency, and context signals) replaces manual per-strategy confidence thresholds. FPR dropped from 34.5% to 18.6%.
  • ConfusableSemanticStrategy (priority 48): MLM-enhanced confusable detection using masked language model logit comparison with asymmetric thresholds.
  • Rich Suggestion objects: Suggestion class with confidence and source metadata, backward-compatible (inherits from str).
  • Per-request CheckOptions: Runtime overrides for context_checking, grammar_checking, max_suggestions, and use_semantic.

Added

  • Kinzi (င်္) and consonant stacking variant support in confusable candidate generation.
  • MLM post-filter to suppress invalid_word and dangling_word false positives.
  • Expanded confusable pairs from 87 to 124+ with 9 linguistics-audit additions.
  • Expanded colloquial variants from 83 to 91 entries; removed 20 standard modern Burmese words incorrectly classified as colloquial.
  • Homophone morphological guard expanded to 2 prefixes and 4 compound suffixes.
  • Error.severity property with computed severity based on action type.
  • Candidate fusion enabled by default.

Changed

  • Config split: algorithm_configs.py split into 4 focused modules (algorithm_configs.py, text_configs.py, strategy_configs.py, infra_configs.py). All existing imports continue to work.
  • Benchmark consolidated to 1,146 sentences with 18 duplicate IDs fixed.
  • Confidence gates expanded to 15 error types.
  • Zero-TP detectors disabled, weak detectors heavily gated.
  • Mutex/override infrastructure fully removed.

Fixed

  • MLM threshold comparison now operates in logit space (was incorrectly using probability space).
  • context_checking=False in CheckOptions preserves word validation instead of skipping all validation.
  • Suggestion objects properly coerced to str for Cython compatibility in normalize functions.
  • Streaming checker no longer dumps tracebacks to stderr for handled exceptions.
  • Word segmentation degrades gracefully when resources are unavailable (offline/CI environments).
  • Resource download timeouts wrapped as clean TokenizationError instead of unhandled exceptions.
  • Long verb chains reassembled correctly, dangling particle scan capped.
  • Invalid_word and dangling_word FPs reduced via post-validation compound splitting.