A feature and architecture release. Headlines: a unified, catchable exception
hierarchy; terminal column-width measurement (terminal_width /
grapheme_width); native errors="strict" transliteration; LLM/RAG
guardrail pipeline presets; and a substantial push of validation and
configuration logic down into the Rust core, so the upcoming multi-language
bindings inherit one behaviour instead of reimplementing it. Most changes are
behaviour-preserving; the exceptions are called out under Upgrade notes.
Upgrade notes
- Exceptions now form a hierarchy. Every library error subclasses
TranslitError, withInvalidArgumentError,ResourceLimitError, and
UnsupportedErrorbeneath it.TranslitErrorremains aValueError
subclass, so existingexcept ValueErrorkeeps working. Several error
message strings were enriched/standardised (#186, #187) — code matching
exact message text may need updating; code matching exception types is
unaffected. lang=is validated even for ASCII input (#197). A binding-side ASCII
fast path previously skipped language validation, so
transliterate("abc", lang="zz")silently returned the input; it now raises
InvalidArgumentError, matching how non-ASCII input always behaved.slugify_filename/Slugify(safe_chars=…)output corrected (see Fixed):
slugify_filename("My Report.pdf")now returns"My_Report.pdf", not
"My.Report_pdf". Output for inputs that usesafe_charsmay change.- New modes:
errors="strict"fortransliterate(#184) and
decode_to_utf8(strict=True)(#189).
Added
terminal_width/grapheme_width(#224): terminal column width per
grapheme cluster (UAX #11 East Asian Width). Wide/fullwidth and
emoji-presented clusters are 2 columns; combining marks, controls, and
zero-width characters are 0. Ambiguous characters are 1 by default, or 2 with
ambiguous_wide=True. Width data is generated at build time from the pinned
UCD (no runtime data, nounsafe). Measures cells, not pixels; tabs are not
expanded.errors="strict"+find_untranslatable(#184): strict transliteration
raises on the first untranslatable character (reporting it and its byte
offset);find_untranslatablereturns all of them without raising.- Guardrail pipeline presets (#139):
TextPipelinegainsstrip_bidiand
strip_zalgosteps and thellm_guardrail/rag_ingestnamed profiles for
LLM/RAG input sanitisation. get_pipeline/list_profiles(#229): the named policy-profile registry
now lives in the Rust core; the Python helpers are thin wrappers over it.decode_to_utf8(strict=True)(#189): raise on lossy/replacement decoding
instead of silently substituting U+FFFD.
Changed
- Unified exception hierarchy (#183): the Python error surface is a
TranslitErrorbase with categorised subclasses; sites that previously raised
bareValueErrorare unified (foundation laid in 0.6.3 via #181). - Validation moved into the Rust core (#185, #217, #229, #230, #231): enum
validation, thetransliterate()argument-conflict matrix, non-negative
max_length/max_graphemeschecks,safe_chars, andmin_confidence
range-checking now live in the core, so other bindings enforce the identical
contract without reimplementing it. The Python layer keeps only type guards. - Actionable error messages (#186, #187): weak messages now name the
offending value, list valid options, and suggest a "did you mean…?" where
applicable; message style is standardised across the surface. - Error cause chains (#188): wrapped errors surface the underlying cause via
__cause__rather than flattening it into the message. TextPipelinestep ordering (#174) is derived from a single source of
truth, removing drift between configuration and execution order.- All-ASCII preset fast path (#198): presets skip the NFKC pass for pure-ASCII
input (behaviour-preserving).
Fixed
slugify_filename/Slugify(safe_chars=…)preserved safe characters at
the wrong positions —slugify_filename("My Report.pdf")returned
"My.Report_pdf"instead of the awesome-slugify-correct"My_Report.pdf".
safe_charsare now handled natively in the Rust core: kept verbatim and
treated as word characters so they hold their position (#156, #230). The prior
test only covered a dot-free input, so the bug was uncaught; regression tests
now cover filenames with extensions, multiple dots, andUniqueSlugify+
max_length.slugify(default=…)is now sanitised through the same slug pipeline (so a
caller-supplied fallback cannot smuggle path-traversal or URL metacharacters
into output documented as URL-safe), threads through the statefulSlugifier/
UniqueSlugifierforms, and a negativemax_lengthnow raises a catchable
InvalidArgumentErroron both the scalar and batch paths instead of an
uncatchableOverflowError(#193, #169).- Low-severity hardening bundle (#200): eight small robustness fixes
(bounds, overflow, and edge-case handling) gathered into one pass.
Security
- The RustSec advisory audit (
cargo-audit) now blocks merge via the
required "Rust checks passed" gate on every PR — an advisory can land on a
dependency without any code change here (#195).
Removed
- Docker image build/publish and its Trivy CVE scan (#138). translit is a
pip install-first library; previously published images remain as historical
artifacts, but no new ones are produced. Install the CLI via
pip install translit-rs.
Documentation
- Executable cookbook (#154, #91, #140, #156, #172): a Sybil doc-test harness
with a CI gate, unidecode→translit migration recipes, an "LLM pipelines" page,
a tokenizer-preprocessing page, and an anti-rot lint that turned 307 decorative
# =>claims into checked assertions. - normalize-first canonicalisation recipe (#174) and a formal-verification
assurance taxonomy (#223 — proof-by-exhaustion / structural / property-tested,
tagging each I1–I7 invariant), plus grapheme-integrity property tests (#174). - The project adopted the Developer Certificate of Origin (#165); all commits
are signed off. The custom-emoji-provider 9-codepoint window cap is now
documented (#199).