v0.4.0

raeq released this 29 Mar 19:43

· 166 commits to main since this release

dc54260

v0.4.0

Breaking changes

Batch functions removed. transliterate_batch(), slugify_batch(), normalize_batch(), and strip_accents_batch() are gone. The base functions now accept both str and list[str] via @typing.overload:
```
transliterate("café")              # → "cafe"
transliterate(["café", "naïve"])   # → ["cafe", "naive"]
```
strip_obfuscation() no longer transliterates. Uses TR39 confusable mapping (visual similarity) instead of phonetic transliteration. lang= parameter removed. Chain with transliterate() if romanization is also needed.

New features

strip_obfuscation() — maximum-strength deobfuscation preset. Resolves homoglyph spoofing (Cyrillic р→p, с→c), strips zalgo, invisible chars, bidi attacks, expands emoji.
lang_info() / script_info() — structured metadata for all 83 languages and 57 scripts, with import-time drift assertions.
18 new languages (Balinese, Bamum, Buginese, Cherokee, Cham, Coptic, Tai Lue, Lisu, Meitei, Northern Thai, N'Ko, Santali, Sundanese, Syriac, Tai Le, Tagalog, Tamazight, Vai) and 10 new Script enum members.

Bug fixes

Combining marks and zero-width characters no longer produce [?] (283 new TSV mappings)
TextPipeline confusable ordering fixed (transliterate before confusables)
demojize() spaces adjacent emoji replacements ("🔥🔥" → "fire fire")
SCRIPT_RANGES sort order fix + invariant test
Tibetan documentation corrected (Indic-phonetic, not Wylie)

Infrastructure

API stability tests (133), mutation testing killers (92)
CI restructured: 10× faster Python tests, path-filtered CodeQL, no duplicate runs
Transliteration provenance documentation
docs/index.md generated from README.md (single source of truth)

See CHANGELOG.md for full details.

Assets 2