Release v1.4.0 · thettwe/myspellchecker

v1.4.0 — Meta-Classifier & Semantic Confusable Detection

Meta-classifier post-filter: A learned logistic regression model (41 features including one-hot error type, word frequency, and context signals) replaces manual per-strategy confidence thresholds. FPR dropped from 34.5% to 18.6%.
ConfusableSemanticStrategy (priority 48): MLM-enhanced confusable detection using masked language model logit comparison with asymmetric thresholds.
Rich Suggestion objects: Suggestion class with confidence and source metadata, backward-compatible (inherits from str).
Per-request CheckOptions: Runtime overrides for context_checking, grammar_checking, max_suggestions, and use_semantic.

Kinzi (င်္) and consonant stacking variant support in confusable candidate generation.
MLM post-filter to suppress invalid_word and dangling_word false positives.
Expanded confusable pairs from 87 to 124+ with 9 linguistics-audit additions.
Expanded colloquial variants from 83 to 91 entries; removed 20 standard modern Burmese words incorrectly classified as colloquial.
Homophone morphological guard expanded to 2 prefixes and 4 compound suffixes.
Error.severity property with computed severity based on action type.
Candidate fusion enabled by default.

Config split: algorithm_configs.py split into 4 focused modules (algorithm_configs.py, text_configs.py, strategy_configs.py, infra_configs.py). All existing imports continue to work.
Benchmark consolidated to 1,146 sentences with 18 duplicate IDs fixed.
Confidence gates expanded to 15 error types.
Zero-TP detectors disabled, weak detectors heavily gated.
Mutex/override infrastructure fully removed.

MLM threshold comparison now operates in logit space (was incorrectly using probability space).
context_checking=False in CheckOptions preserves word validation instead of skipping all validation.
Suggestion objects properly coerced to str for Cython compatibility in normalize functions.
Streaming checker no longer dumps tracebacks to stderr for handled exceptions.
Word segmentation degrades gracefully when resources are unavailable (offline/CI environments).
Resource download timeouts wrapped as clean TokenizationError instead of unhandled exceptions.
Long verb chains reassembled correctly, dangling particle scan capped.
Invalid_word and dangling_word FPs reduced via post-validation compound splitting.