v0.4.0
·
12 commits
to main
since this release
[0.4.0] - 2026-06-11
Changed
- The
CodePointcontract is now documented as such, on the type itself
(encoding.CodePoint) and in the README: values of this type are presumed
valid Unicode scalars; producers uphold the contract, consumers rely on it
and skip decoding/validation (which is what makes the[]const CodePoint
API variants the cheap path for already-decoded text). The README quick-look
examples were also brought back in sync with the real signatures
(initUTF8Viewout-param,view.iter(),nfcover code points). - Breaking: "unchecked" now means one thing everywhere — the caller
guarantees the documented preconditions; violations are asserted /
safety-checked (trap in Debug/ReleaseSafe, undefined in
ReleaseFast/ReleaseSmall); unchecked functions never return errors or
panic. Accordingly:encoding.utf8.codePointLenReverseUncheckedreturns plainu3
(wasUTF8ValidationError!u3).encoding.utf16.utf16SequenceLenReverseUncheckedreturns plainu2
(wasUTF16ValidationError!u2) and assertsend_index < buf.len
instead of returningZeroLengthUnits.encoding.utf8.decodeCodePointReverseUncheckedno longer documents (or
contains) a panic path; its preconditions are asserted.
Removed
- Breaking: the
Undefinedmember is gone from all six UTF-16/UTF-32
error sets (UTF16ValidationError,UTF16ValidationLossyError,
UTF16EncodeError,UTF32ValidationError,UTF32ValidationLossyError,
UTF32EncodeError). It was never returned by any code path; exhaustive
switches over these error sets can drop their deaderror.Undefinedarm.
Fixed
unicode.segmentationbyte-level iterators (grapheme / word / sentence /
line) and step helpers no longer containcatch @panicshims around lossy
decoding. They decode throughencoding.utf8.decodeCodePointLossy, so the
"lossy never errors" promise now holds structurally: malformed UTF-8 yields
U+FFFD segments and can never trap.lineStepBytesdocuments its
byte_pos < bytes.lencontract (asserted, safety-checked).
Added
- Early-exit collation compare (
@stable-since: v0.4.0):
Collator.compareCodePointsIncremental/compareUtf8Incrementalgenerate
collation elements lazily and stop at the first differing weight of the
shallowest differing level — no sort keys are materialized. Identical
results tocompareCodePoints/compareUtf8for every input and option
set (verified across the strength × variable-weighting matrix). Strings that
differ early at the primary level — the common case — pay for only a few
collation elements; the weighting logic is shared withbuildKeyvia an
internalresolveCEso the two paths cannot diverge. - BOM utilities — new
encoding.bommodule (@stable-since: v0.4.0):
Bomenum (utf8 / utf16_le / utf16_be / utf32_le / utf32_be) with
bytes/len/endian/match, plusdetect(longest-match: the
ambiguousFF FE 00 00reports UTF-32 LE) and zero-copystrip. The
codecs themselves still never consume or produce BOMs; this is the
explicit seam, re-exported at the package root asbom. unicode.normalization.nfkcCaseFold— sequence-level NFKC_Casefold
(@stable-since: v0.4.0), per the UCD definition
NFKC_CF(X) = NFC(toNFKC_Casefold(X)), built on the already-shipped
per-scalarnfkcCaseFoldMaptable. The identifier-caseless form used by
UAX #31 and security profiles: fullwidth compatibility variants fold
(ABC→abc), Default_Ignorables map away, and the result is
idempotent (verified by a BMP-wide sweep).- String-level titlecase in
unicode.casing
(@stable-since: v0.4.0):titlecaseAlloc([]const CodePoint) and
titlecaseUtf8Alloc(UTF-8), implementing the Unicode default algorithm
(R3) — UAX #29 word segmentation, full titlecase mapping on each word's
first cased scalar, full lowercase on the rest, with Final_Sigma context
for U+03A3 ("ΜΕΓΑΣ"→"Μεγας"). Default root-locale mappings; no
Turkic/Lithuanian tailoring. - Case-insensitive search in
unicode.casing
(@stable-since: v0.4.0):indexOfFold/containsFoldover UTF-8 bytes
andindexOfFoldCodePoints/containsFoldCodePointsover scalar slices.
Both sides fold lazily during the scan — no allocation — honoring expanding
folds in.fullmode ("STRASSE"matches"Straße"), with a whole-scalar
boundary rule (needle"s"never matches inside"ß"'s expansion). The
CodePoint variants skip decoding/validation entirely per theCodePoint
contract. - SIMD chunked scanners for UTF-16, mirroring the v0.3.0 UTF-8 set
(@stable-since: v0.4.0, portable@Vectorcompares with scalar tails, no
target intrinsics):utf16.nonSurrogateRunLength(length of the leading
run of standalone scalars — the UTF-16 analogue ofasciiRunLength) and
utf16.countScalarsSimd(unchecked scalar count via the high-surrogate
rule).utf16.validatenow skips surrogate-free runs in SIMD strides and
falls back to scalar pair checks only at actual surrogates. - Bulk encode-direction APIs taking
[]const CodePoint
(@stable-since: v0.4.0):encodeCodePoints{Len,Buffer,Alloc}on all three
codecs, the inverse ofbytesToUTF8String/bufToUTF16String/
bufToUTF32String. Callers holding already-decoded scalars encode without
any decoding or validation, per theCodePointcontract. encoding.utf8.StreamingValidator— incremental, resumable validation over
arbitrarily-chunked input (@stable-since: v0.4.0). The Höhrmann DFA state
carries across chunk boundaries (no buffering, no copies);updatereports
the absolute offset of the first malformed sequence,finishdistinguishes
a truncated trailing scalar from valid end-of-input, and ASCII runs are
skipped in SIMD strides.- Error position reporting (
@stable-since: v0.4.0):invalidIndexon
all three codecs returns the unit/byte offset where the first malformed
sequence starts (nullwhen valid), so diagnostics no longer require a
re-scan. Decoding strictly at the reported offset recovers the fine-grained
error. The UTF-8 variant skips ASCII runs in SIMD strides. - Unchecked forward decode entry points, completing the strict / unchecked
/ lossy matrix in the forward direction (@stable-since: v0.4.0): callers
holding already-validated text can decode without re-validating and without
wrapping the input in a View.encoding.utf8.decodeCodePointUncheckedencoding.utf16.decodeU16CodePointUncheckedencoding.utf32.decodeU32CodePointUnchecked
encoding.utf8.decodeCodePointLossy— infallible lossy decode primitive
(@stable-since: v0.4.0). Malformed sequences yield U+FFFD and are never
reported as errors; the only precondition (offset < bytes.len) is asserted
(safety-checked), not error-returned, and the returnedlenis always >= 1
so forward scans are guaranteed to make progress.UTF8LossyIteratorand
UTF8SimdLossyIteratornow decode through it, removing their internal
catch unreachable/catch breakshims.