Skip to content

v0.4.0

Choose a tag to compare

@raeq raeq released this 29 Mar 19:43
· 166 commits to main since this release
dc54260

v0.4.0

Breaking changes

  • Batch functions removed. transliterate_batch(), slugify_batch(), normalize_batch(), and strip_accents_batch() are gone. The base functions now accept both str and list[str] via @typing.overload:

    transliterate("café")              # → "cafe"
    transliterate(["café", "naïve"])   # → ["cafe", "naive"]
  • strip_obfuscation() no longer transliterates. Uses TR39 confusable mapping (visual similarity) instead of phonetic transliteration. lang= parameter removed. Chain with transliterate() if romanization is also needed.

New features

  • strip_obfuscation() — maximum-strength deobfuscation preset. Resolves homoglyph spoofing (Cyrillic р→p, с→c), strips zalgo, invisible chars, bidi attacks, expands emoji.
  • lang_info() / script_info() — structured metadata for all 83 languages and 57 scripts, with import-time drift assertions.
  • 18 new languages (Balinese, Bamum, Buginese, Cherokee, Cham, Coptic, Tai Lue, Lisu, Meitei, Northern Thai, N'Ko, Santali, Sundanese, Syriac, Tai Le, Tagalog, Tamazight, Vai) and 10 new Script enum members.

Bug fixes

  • Combining marks and zero-width characters no longer produce [?] (283 new TSV mappings)
  • TextPipeline confusable ordering fixed (transliterate before confusables)
  • demojize() spaces adjacent emoji replacements ("🔥🔥""fire fire")
  • SCRIPT_RANGES sort order fix + invariant test
  • Tibetan documentation corrected (Indic-phonetic, not Wylie)

Infrastructure

  • API stability tests (133), mutation testing killers (92)
  • CI restructured: 10× faster Python tests, path-filtered CodeQL, no duplicate runs
  • Transliteration provenance documentation
  • docs/index.md generated from README.md (single source of truth)

See CHANGELOG.md for full details.