Release v0.6.0 — security hardening · raeq/translit

[0.6.0] — 2026-06-07

A hardening and bug-fix release. Two new opt-in helpers (dedup_batch,
make_cached_transliterator) make this a minor bump; no public API was
removed. Several fixes change output for specific inputs — read Upgrade
notes before upgrading if you cache or persist transliterator/normalizer output.

Upgrade notes (output-affecting fixes)

Each of these was a bug; the new output is the correct one. If you store or cache
results that were keyed on the old (buggy) behaviour, regenerate them:

register_replacements() now actually applies. It was a silent no-op — the
registered table was never consulted. Registered replacements now take effect
across transliterate() (scalar, list, and context=True). If you registered
replacements and (knowingly or not) relied on them being ignored, output changes.
transliterate(list, tones=True) now returns toned pinyin (was silently
toneless on the list path); transliterate(list, target=…, tones=True) now
raises ValueError for the forward-only parameter (was silently ignored).
normalize_confusables(text, target="cyrillic") no longer maps characters
onto invisible combining marks (28 such mappings removed).
strip_obfuscation now folds intra-Latin ASCII homoglyphs (þ→p, ſ→f,
ı→i, …) and is idempotent; sanitize_user_input is idempotent for
control/invisible characters between combining marks; demojize no longer
inserts a stray space after a tab/newline that precedes an emoji.
Context-aware transliteration (context=True, ar/fa/he) distribution
changed. The empty arabic/hebrew/context pip extras have been removed
(they never installed anything). The ~37 MB dictionaries are no longer tracked
in git, and are not shipped in the wheel. Context mode now loads dictionaries
from $TRANSLIT_DICT_DIR (build them with scripts/bootstrap_dicts.sh), or use
the embed-dicts Cargo feature for a self-contained build. A packaged
pip-installable distribution is tracked in #56/#60.
decode_to_utf8 default min_confidence changed 0.0 → 0.5. Low-confidence
encoding guesses are now rejected by default instead of silently accepted; pass
min_confidence=0.0 to restore the old behaviour. (#66)
Unknown lang codes now raise instead of silently falling back (#68). A
typo'd code (lang="RU", lang="russian") used to behave exactly like
lang=None — quietly-wrong output — while errors=/form= rejected bad
values. transliterate, slugify, sanitize_filename, catalog_key,
search_key, sort_key, and ml_normalize now raise TranslitError listing
the valid codes. "auto", the nb/nn/da aliases, and register_lang()
codes are accepted. (target= already validated.)

Changed

No library-imposed input-size limit (#80, #65). The 10 MiB input cap on
transliterate, normalize, fold_case, and the preset pipelines has been
removed — it was paternalistic, inconsistently applied (the ASCII fast
path bypassed it; slugify/normalize_confusables/strip_zalgo never had it),
and the threat model already disclaims DoS. All operations are linear time and
memory; bounding untrusted input is the caller's responsibility, documented
in the threat model and docstrings. The single retained size guard is the
register_replacements output amplification bound (a tiny input can expand to
an enormous string via a caller-registered value — an amplification a caller's
own input check cannot foresee). Backward-compatible: only previously-rejected
large inputs now succeed.
External wording: capability, not promise. Security-relevant features are now
described as mechanisms (TR39 confusable mapping, bidi/zalgo stripping, hostname
analysis) rather than outcome guarantees. Package descriptions, README, and docs no
longer claim to "prevent"/"neutralize" attacks or achieve "perfect" recovery; the XMR
benchmark figure is always stated with its tested-pairs scope. Engineering rigor is held
to a high internal bar (see below); the external surface promises nothing it cannot
measure.

Added

dedup_batch(texts, …) — transliterate a list, processing each distinct
value once and mapping back (large win for repeated/categorical data; ~146× on a
high-locality column). Stateless — no cache to invalidate; unique values are chunked
at the 100k batch cap. (#31)
make_cached_transliterator(maxsize=…, …) — opt-in LRU-cached single-string
transliterator with options fixed at construction. Self-invalidating: the next
call after any register_lang/register_replacements/remove_replacement/
clear_replacements clears the cache (via an internal table-generation counter), so
it never serves stale results. Never enabled by default. (#31)
THREAT_MODEL.md — defines in-scope mechanisms, explicit out-of-scope items
(confusables outside the bundled TR39 table, whole-script and multi-character
confusables, Unicode-version skew, semantic attacks, DoS), and a vulnerability-vs-
known-limitation policy, grounded in the literature (Holgers 2006, Deng 2020,
BitAbuse 2025).
SECURITY.md rewritten on real footing: supported-version policy stated, triage
scope defined, and linked to the threat model.
Security-invariant property tests + fuzzing. proptest invariants in Rust
(src/presets.rs) assert no-panic, idempotence, and "no bidi/format control
survives" for strip_obfuscation / security_clean / sanitize_user_input /
strip_bidi across the Unicode input space; a deterministic, CI-gating
adversarial attack-corpus regression (tests/test_attack_corpus.py:
homoglyph / zalgo / invisible / bidi / combined, XMR-style); and a cargo-fuzz
harness (fuzz/) for continuous coverage-guided fuzzing of the defense
pipelines.
Confusable coverage for intra-Latin homoglyphs of basic ASCII letters
(e.g. þ→p, ſ→f, ı→i, ƒ→f, Ɩ→l, ꜱ→s). The TR39 generator previously
skipped all Latin-script sources for the Latin target, dropping ~83 genuine
homoglyphs of A–Z/a–z; normalize_confusables/strip_obfuscation now fold
them. Single-letter Latin confusable coverage of UTS#39 is now complete.
Pinned data/confusables.txt (UTS#39 17.0.0) as the reproducible, version-
controlled input for scripts/gen_confusables.py (--download refreshes it),
and a tests/test_confusable_coverage.py gate against Unicode-version drift.

Fixed

register_replacements() was a silent no-op — the global table was stored
but never consulted by transliterate(). It now applies as a longest-match
pre-pass (no cascade) across the scalar, list, and context=True forward paths,
including ASCII-keyed replacements that previously bypassed Rust via the Python
fast path. (#51)
tones= on the list/batch path was dropped: transliterate(["北京"], tones=True) returned toneless pinyin while the scalar path returned toned, and
transliterate([...], target=…, tones=True) silently ignored the forward-only
parameter instead of raising. Both now match the scalar path. (#14, #15)
normalize_confusables(target="cyrillic") emitted invisible combining marks —
28 mappings folded a visible character onto a combining Cyrillic-Extended mark (an
obfuscation vector). The generator now excludes combining-mark targets. (#24)
script_info("CanadianAboriginal")["context_aware"] raised KeyError — the
entry omitted a required ScriptMeta field; a completeness guard now prevents
recurrence. (#18)
Context path skipped strict_iso9/gost7034 mutual-exclusion validation —
transliterate(text, context=True, strict_iso9=True, gost7034=True) now raises
ValueError like the non-context path; the missing-dictionary error hint is now
language-specific (he→hebrew). (#18)
demojize inserted a stray space after a tab/newline preceding an emoji
("a\t😀" → "a\t grinning face"); it now checks for any whitespace. (#12)
Compatibility digit variants fold to digits, not letters (#89). The
confusables table mapped Mathematical Alphanumeric digits 𝟎/𝟏 (and the
other four families, plus superscripts) to the look-alike letters O/l, so
normalize_confusables("𝟏𝟎") gave "lO" and strip_obfuscation corrupted
digit runs. The generator now folds any character whose NFKC form is an ASCII
digit to that digit. They remain detected as confusable (is_confusable),
but canonicalize to the correct number. (ASCII 0/1 were already unaffected.)
NFKC-compatible Latin is recovered instead of dropped to [?] (#81).
Mathematical Alphanumeric Symbols (𝕳𝖊𝖑𝖑𝖔 𝟙𝟚𝟛 → Hello 123), presentation
ligatures (ﬁ/ﬂ → fi/fl), and superscripts (x² → x2) now
transliterate: an unmapped non-ASCII char is NFKC-decomposed and re-tried
before the error fallback. This matches unidecode/anyascii and closes a
filter-evasion ("fancy text") gap. Purely additive — only chars that were
previously [?] are affected; emoji (no ASCII decomposition) still map to [?].
Defense pipelines are now idempotent (bugs found by the property tests):
- strip_obfuscation: emoji whose CLDR name contains typographic punctuation
  (e.g. 👒 → woman’s hat, U+2019 ’) weren't folded because confusables ran
  before demojize; a second pass folded ’→'. Confusables now runs after demojize.
- sanitize_user_input: an invisible or control character between combining
  marks (e.g. soft-hyphen, NUL) split a mark-run, so removing it after
  zalgo-capping merged runs that a second pass then capped differently. Bidi,
  zero-width, and control characters are now stripped before zalgo-capping.
Build-time and doc corrections: build.rs now rejects malformed \u{…} escapes
in TSV data; embedded-dictionary parse errors are logged (not silently dropped);
and numerous stale docstrings/comments were corrected (script_to_lang returns
ISO 639-1 or 639-3; normalize() ASCII fast-path; list single-Rust-call caveats).

Security

seal_registrations() / registrations_sealed() (#64, high). The
register_lang/register_replacements APIs mutate process-global tables
consulted by every transliterate/slugify/catalog_key/… call, so in a
multi-tenant or web process one import or request handler could silently alter
everyone's canonicalization. seal_registrations() is a one-way latch: after
it is called, register/remove/clear raise TranslitError. The registration
APIs are now documented as startup-only/single-writer. Separately, a poisoned
lock no longer resets registrations to defaults (a panic in one thread
could previously wipe another caller's registered languages) — it now recovers
the data as-is.
is_safe_hostname now decodes IDN/xn-- labels (#63, high). Previously an
xn-- ACE label was pure ASCII → single-script → reported safe, so the
on-the-wire form of the IDN homograph attack (a Cyrillic xn--80ak6aa92e.com
"apple" spoof) sailed through — the exact blind spot for a library marketing
idn/anti-spoofing. ACE labels are now UTS#46-decoded (via the idna crate)
before script/confusable analysis; a malformed ACE label is treated as unsafe.
Non-xn-- labels are untouched (no false positives on, e.g., my_host.local).
is_safe_hostname fails closed (#67.1). A confusable-check error no longer
silently degrades to "not confusable" (unwrap_or(false)) → "safe"; it now
marks the hostname unsafe.
strip_bidi/display_clean now also strip deprecated format controls
(U+206A–U+206F) and interlinear annotation marks (U+FFF9–U+FFFB) (#67.2),
which were previously only handled as transliteration-table entries.
NFKC×confusables composition pinned (#67.3). Added a regression test fixing
the exact set of NFKC-ASCII results that normalize_confusables re-maps
(`→', "→'', |→l) so a data/ordering change — e.g. reintroducing
digit→letter — fails loudly; and that presets resolve NFKC/TR39 conflicts
(ſ→s) via NFKC.
Context dictionaries are no longer loaded from a CWD-relative path (#61).
load_dict_from_fs previously probed ./data/{name}_dict.bin first, so a
process whose working directory an attacker influences (or where they can drop
./data/) could inject a substitute dictionary and silently change ar/fa/he
output. Dictionaries now load only from $TRANSLIT_DICT_DIR (explicit opt-in)
or the crate's own absolute data/ path in source builds.
Supply-chain: corpus inputs are verified/pinned (#62). The Tashkeela corpus
archive is now checksum-verified before it feeds the builders (fail-closed — an
unpinned checksum aborts unless ALLOW_UNVERIFIED_CORPUS=1), and the Project
Ben Yehuda corpus is fetched at a pinned commit instead of an unpinned live HEAD.
ContextDict::from_bytes is fully bounds-checked. A malformed or truncated
context dictionary previously caused an out-of-bounds panic (the crate is
unsafe_code = forbid, so a panic aborts the process). Every read is now
bounds-checked and section offsets are validated; capacity hints are clamped.
Added truncation/bogus-offset/u32::MAX-count unit tests. (#18)
register_replacements expansion is bounded. Replacement values are
caller-controlled and unbounded; a small input with a large value could expand
past the transliterate input cap. Output is now bounded during construction and
rejected once it would exceed MAX_TRANSLITERATE_INPUT_BYTES. (#51)

Internal / tests

170 deterministic tests were excluded from CI. A module-level
pytestmark = pytest.mark.hypothesis in test_filename_regressions.py and
test_case_folding.py (filename-security and case-folding regressions) deselected
the entire files under CI's -m "not hypothesis" filter; only ~10 were actual
property tests. The mark is now scoped to the property-test class in each file, so
the deterministic tests run in CI. (#12)
New tests: register_replacements (unit + Hypothesis property), context-dict
parser robustness, resolve_auto_lang for all 18 scripts added in v0.3.0+, and a
SCRIPT_META field-completeness guard.
CI/workflow hygiene: concurrency group on secret-scan, uv.lock in the benchmark
path filter, and CodeQL no longer triggered by Rust-only changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0 — security hardening

Choose a tag to compare

Sorry, something went wrong.