Releases: shaik-abdul-thouhid/ezi-code
v0.4.1
[0.4.1] - 2026-06-11
Added
- String-level full-expansion upper/lower case drivers in
unicode.casing
(@stable-since: v0.4.1):upperFull{Len,Buffer,Alloc}/
lowerFull{Len,Buffer,Alloc}over[]const CodePointand
upperFullUtf8{Alloc,Writer}/lowerFullUtf8{Alloc,Writer}over UTF-8,
mirroring the existingfoldFull*surface. These apply expanding case
mappings the simple drivers cannot —"straße"upper-cases to"STRASSE",
"ff"to"FF". Default root locale (no Turkic tailoring); the lower
drivers use the context-free mapping (no Greek Final_Sigma — use
titlecaseUtf8Allocor the per-scalar API for that). - Grapheme-cluster-aware display width in
unicode.width
(@stable-since: v0.4.1):stringWidthGraphemes/stringWidthGraphemesLossy
/stringWidthGraphemesCodePointsand the per-clustergraphemeClusterWidth.
Unlike the per-scalarstringWidth*estimators (unchanged), these count each
UAX #29 grapheme cluster once, so a ZWJ emoji family (👨👩👧) measures 2
columns instead of 6. Emoji-presentation sequences (VS16) and flags (regional
indicator pairs) are counted as 2. - Infallible lossy decode primitives
encoding.utf16.decodeU16CodePointLossy
andencoding.utf32.decodeU32CodePointLossy(@stable-since: v0.4.1),
mirroringutf8.decodeCodePointLossy: malformed units yield U+FFFD with no
error union, preconditions asserted. The UTF-16/UTF-32 lossy iterators now
decode through them, so the "lossy never errors, structurally" guarantee
(previously UTF-8 only) holds for all three codecs — nocatch @panic
remains on any lossy path.
Changed
- Breaking: the UTF-8 stream's
OutputBufferTooSmallerror is renamed to
BufferTooSmall, the name the encoding, transcoding, casing, and collation
layers already use, so the whole library reports one error for "output
buffer too small" (UTF8Stream.nextCodePoint/nextCodePointLossy). The
siblingNeedMoreBytes/EOFReachedkeep input-side starvation distinct. unicode.properties.isAsciinow delegates toencoding.isAsciiinstead of
hardcoding<= 0x7F; every scalar ASCII check in the library now routes
through the singleencoding.isAsciipredicate. The hex-digit predicate docs
(isHexDigit,isHexDigitWide,isAsciiHexDigit) were clarified so the
ASCII-vs-Unicode distinction is explicit (no renames).- Internal: collation-element generation is unified behind one
Collator.recordAt, shared bybuildKeyand the incremental comparator, so
DUCET record lookup, discontiguous-contraction extension, and implicit
weighting cannot diverge between the sort-key and early-exit paths (closes a
divergence risk from 0.4.0; no API change). - Internal:
build.zigno longer imports library source (src/utils/root.zig)
for itssomebuild-option predicate — it carries a local copy, decoupling
the build graph from module layout.
Removed
- Breaking: the dead
BufferIsEmptyerror variant is gone from
UTF8Stream.nextCodePoint/nextCodePointLossy— it was declared but
returned by no code path (the same cleanup as 0.4.0'serror.Undefined).
Fixed
- The unchecked-decode contract ("never panics … undefined in ReleaseFast") is
now honored uniformly: three internal sites used@panic(which traps in all
modes) instead ofunreachable— most visiblydecodeCodePointReverseUnchecked,
whose doc promised no panic. Replaced withunreachableso behavior matches
the documented contract.
v0.4.0
[0.4.0] - 2026-06-11
Changed
- The
CodePointcontract is now documented as such, on the type itself
(encoding.CodePoint) and in the README: values of this type are presumed
valid Unicode scalars; producers uphold the contract, consumers rely on it
and skip decoding/validation (which is what makes the[]const CodePoint
API variants the cheap path for already-decoded text). The README quick-look
examples were also brought back in sync with the real signatures
(initUTF8Viewout-param,view.iter(),nfcover code points). - Breaking: "unchecked" now means one thing everywhere — the caller
guarantees the documented preconditions; violations are asserted /
safety-checked (trap in Debug/ReleaseSafe, undefined in
ReleaseFast/ReleaseSmall); unchecked functions never return errors or
panic. Accordingly:encoding.utf8.codePointLenReverseUncheckedreturns plainu3
(wasUTF8ValidationError!u3).encoding.utf16.utf16SequenceLenReverseUncheckedreturns plainu2
(wasUTF16ValidationError!u2) and assertsend_index < buf.len
instead of returningZeroLengthUnits.encoding.utf8.decodeCodePointReverseUncheckedno longer documents (or
contains) a panic path; its preconditions are asserted.
Removed
- Breaking: the
Undefinedmember is gone from all six UTF-16/UTF-32
error sets (UTF16ValidationError,UTF16ValidationLossyError,
UTF16EncodeError,UTF32ValidationError,UTF32ValidationLossyError,
UTF32EncodeError). It was never returned by any code path; exhaustive
switches over these error sets can drop their deaderror.Undefinedarm.
Fixed
unicode.segmentationbyte-level iterators (grapheme / word / sentence /
line) and step helpers no longer containcatch @panicshims around lossy
decoding. They decode throughencoding.utf8.decodeCodePointLossy, so the
"lossy never errors" promise now holds structurally: malformed UTF-8 yields
U+FFFD segments and can never trap.lineStepBytesdocuments its
byte_pos < bytes.lencontract (asserted, safety-checked).
Added
- Early-exit collation compare (
@stable-since: v0.4.0):
Collator.compareCodePointsIncremental/compareUtf8Incrementalgenerate
collation elements lazily and stop at the first differing weight of the
shallowest differing level — no sort keys are materialized. Identical
results tocompareCodePoints/compareUtf8for every input and option
set (verified across the strength × variable-weighting matrix). Strings that
differ early at the primary level — the common case — pay for only a few
collation elements; the weighting logic is shared withbuildKeyvia an
internalresolveCEso the two paths cannot diverge. - BOM utilities — new
encoding.bommodule (@stable-since: v0.4.0):
Bomenum (utf8 / utf16_le / utf16_be / utf32_le / utf32_be) with
bytes/len/endian/match, plusdetect(longest-match: the
ambiguousFF FE 00 00reports UTF-32 LE) and zero-copystrip. The
codecs themselves still never consume or produce BOMs; this is the
explicit seam, re-exported at the package root asbom. unicode.normalization.nfkcCaseFold— sequence-level NFKC_Casefold
(@stable-since: v0.4.0), per the UCD definition
NFKC_CF(X) = NFC(toNFKC_Casefold(X)), built on the already-shipped
per-scalarnfkcCaseFoldMaptable. The identifier-caseless form used by
UAX #31 and security profiles: fullwidth compatibility variants fold
(ABC→abc), Default_Ignorables map away, and the result is
idempotent (verified by a BMP-wide sweep).- String-level titlecase in
unicode.casing
(@stable-since: v0.4.0):titlecaseAlloc([]const CodePoint) and
titlecaseUtf8Alloc(UTF-8), implementing the Unicode default algorithm
(R3) — UAX #29 word segmentation, full titlecase mapping on each word's
first cased scalar, full lowercase on the rest, with Final_Sigma context
for U+03A3 ("ΜΕΓΑΣ"→"Μεγας"). Default root-locale mappings; no
Turkic/Lithuanian tailoring. - Case-insensitive search in
unicode.casing
(@stable-since: v0.4.0):indexOfFold/containsFoldover UTF-8 bytes
andindexOfFoldCodePoints/containsFoldCodePointsover scalar slices.
Both sides fold lazily during the scan — no allocation — honoring expanding
folds in.fullmode ("STRASSE"matches"Straße"), with a whole-scalar
boundary rule (needle"s"never matches inside"ß"'s expansion). The
CodePoint variants skip decoding/validation entirely per theCodePoint
contract. - SIMD chunked scanners for UTF-16, mirroring the v0.3.0 UTF-8 set
(@stable-since: v0.4.0, portable@Vectorcompares with scalar tails, no
target intrinsics):utf16.nonSurrogateRunLength(length of the leading
run of standalone scalars — the UTF-16 analogue ofasciiRunLength) and
utf16.countScalarsSimd(unchecked scalar count via the high-surrogate
rule).utf16.validatenow skips surrogate-free runs in SIMD strides and
falls back to scalar pair checks only at actual surrogates. - Bulk encode-direction APIs taking
[]const CodePoint
(@stable-since: v0.4.0):encodeCodePoints{Len,Buffer,Alloc}on all three
codecs, the inverse ofbytesToUTF8String/bufToUTF16String/
bufToUTF32String. Callers holding already-decoded scalars encode without
any decoding or validation, per theCodePointcontract. encoding.utf8.StreamingValidator— incremental, resumable validation over
arbitrarily-chunked input (@stable-since: v0.4.0). The Höhrmann DFA state
carries across chunk boundaries (no buffering, no copies);updatereports
the absolute offset of the first malformed sequence,finishdistinguishes
a truncated trailing scalar from valid end-of-input, and ASCII runs are
skipped in SIMD strides.- Error position reporting (
@stable-since: v0.4.0):invalidIndexon
all three codecs returns the unit/byte offset where the first malformed
sequence starts (nullwhen valid), so diagnostics no longer require a
re-scan. Decoding strictly at the reported offset recovers the fine-grained
error. The UTF-8 variant skips ASCII runs in SIMD strides. - Unchecked forward decode entry points, completing the strict / unchecked
/ lossy matrix in the forward direction (@stable-since: v0.4.0): callers
holding already-validated text can decode without re-validating and without
wrapping the input in a View.encoding.utf8.decodeCodePointUncheckedencoding.utf16.decodeU16CodePointUncheckedencoding.utf32.decodeU32CodePointUnchecked
encoding.utf8.decodeCodePointLossy— infallible lossy decode primitive
(@stable-since: v0.4.0). Malformed sequences yield U+FFFD and are never
reported as errors; the only precondition (offset < bytes.len) is asserted
(safety-checked), not error-returned, and the returnedlenis always >= 1
so forward scans are guaranteed to make progress.UTF8LossyIteratorand
UTF8SimdLossyIteratornow decode through it, removing their internal
catch unreachable/catch breakshims.
v0.3.0
[0.3.0] - 2026-06-09
Added
- SIMD chunked scanners in
encoding.utf8(additive — every pre-existing API
keeps its exact signature and behaviour). All are portable@Vectorcompares
and reductions with a scalar tail (no target intrinsics, no dynamic shuffles),
stridingstd.simd.suggestVectorLength(u8)bytes at a time.@stable-since: v0.3.0:asciiRunLength— length of the leading ASCII run (<= 0x7F), the shared
primitive behind the others and usable directly for an ASCII fast path.countScalarsSimd— unchecked scalar count via the non-continuation-byte
rule ((b & 0xC0) != 0x80); equalscountScalarson valid input.simdLossyIterator/UTF8SimdLossyIterator— a buffered lossy decode
iterator that widens ASCII runs in bulk; output is identical to
lossyIterator(malformed → U+FFFD, orphaned continuation runs collapse to a
single replacement).
- Enumerable code-point range tables for Unicode properties, so consumers
can resolve property classes into sorted ranges at comptime (the per-code-point
page tables cannot be enumerated without walking all 1.1M code points). New
zig build generate-rangesstep (no network; reuses the committed page tables)
emits:properties.category_runs(CategoryRun{ start, end, category }) — a full
partition of 0..=0x10FFFF byGeneral_Category, including unassigned runs.properties.derived_runs(DerivedRun{ start, end, mask }) —
DerivedCoreProperties runs keyed by the same bitmask asderivedPropertyMask.properties.white_space_rangesandproperties.join_control_ranges
(CodePointRange{ start, end }) — PropList bases for\sand\w.scripts.script_runs(ScriptRun{ start, end, script }) — Script runs for
assigned code points.
properties.isWord— Perl\w/ word-boundary predicate
(Alphabetic ∪ Mark ∪ Decimal_Number ∪ Connector_Punctuation ∪ Join_Control).
Resolved from the enumerable range tables (with an ASCII fast path), not the
per-code-point page tries, so a consumer that needs onlyisWordnever links
the page tables.@stable-since: v0.3.0.- Range-table-backed per-code-point queries — equivalent to the page-table
predicates but linking only the enumerable range tables (no two-level page
tries), so a size-sensitive consumer can drop the tries entirely. Each is
proven equal to its page-table twin for every code point.@stable-since: v0.3.0:properties.categoryFromRuns—General_Categoryvia binary search over
category_runs(twin ofgeneralCategory).properties.derivedMaskFromRuns— DerivedCoreProperties bitmask via binary
search overderived_runs(twin ofderivedPropertyMask).properties.isIdentifierStartByRanges/isIdentifierContinueByRanges—
twins ofisIdentifierStart/isIdentifierContinue.
- A dedicated
unicode.emojimodule for the UTS #51 emoji character
properties (emoji-data.txt), promoting the six emoji predicates out of
unicode.segmentationinto a first-class property module alongsidescripts,
blocks, etc. The generated page/range tables (emoji.generated, regenerated
byzig build generate) now live underunicode/emoji/generated/. All
@stable-since: v0.3.0:- Per-code-point predicates
emoji.isEmoji,isEmojiPresentation,
isEmojiModifier,isEmojiModifierBase,isEmojiComponent, and
isExtendedPictographic(also surfaced asunicode.isEmoji, … ). emoji.EmojiProperty(enum of the six properties),emoji.EmojiProperties
(apacked structof all six bools with.any()),emoji.emojiProperties
(resolve all six at once),emoji.hasEmojiProperty(runtime-selected
dispatch), andemoji.hasAnyEmojiProperty.- Enumerable code-point range tables so consumers can resolve
\p{Emoji},
\p{Extended_Pictographic}, etc. into sorted ranges at comptime without
walking all 1.1M code points (same rationale asscripts.script_runs).
Emitted by an extendedzig build generate-rangesinto
unicode/emoji/generated/emoji_ranges.zigand re-exported as
emoji.emoji_ranges,emoji.emoji_presentation_ranges,
emoji.emoji_modifier_ranges,emoji.emoji_modifier_base_ranges,
emoji.emoji_component_ranges, andemoji.extended_pictographic_ranges
(EmojiRange{ start, end }), withemoji.rangesFor(property)for
runtime selection. Each table is proven (test) to enumerate exactly its
predicate over the whole code space.
- Per-code-point predicates
Changed
-
The emoji predicates moved from
unicode.segmentationto the new
unicode.emojimodule (see Added).segmentation.isEmoji,
isEmojiPresentation,isEmojiModifier,isEmojiModifierBase,
isEmojiComponent, andisExtendedPictographicremain as deprecated
re-export aliases (sosegmentationkeeps compiling and UAX #29 grapheme
clustering still resolvesExtended_Pictographic); preferunicode.emoji.*.
unicode.emoji_datanow points atemoji.generatedrather than
segmentation.emoji_data. All still v0.3.0-unreleased. -
The Unicode range-table re-exports (
properties.category_runs,
properties.derived_runs,properties.white_space_ranges,
properties.join_control_ranges,scripts.script_runs,
casing.case_folding.common_simple_table) are now[]const Tslices over a
single backing array instead of by-value array re-exports. Iteration,
indexing, slicing and.lenare unchanged; this removes a duplicate copy of
each table that the by-value alias materialized in consumer binaries (and the
extra comptime-materialized copy). Still all v0.3.0-unreleased. -
Performance:
encoding.utf8.validatenow skips ASCII runs in bulk via SIMD
(asciiRunLength) while the Höhrmann DFA is on a scalar boundary, instead of
feeding every byte through the DFA. ASCII bytes always keep the DFA in accept,
so the verdict is identical; only the dominant ASCII case is faster. Signature
and result are unchanged. -
Performance: the UAX #14 line-break steppers (
lineStep,lineStepBytes,
and theLineBreakIterator/CodePointLineBoundaryIteratorthey drive) now
compute the forward look-ahead only when a look-ahead-dependent rule
(LB15b, LB15c, LB19a, LB25, LB28a) can actually fire, instead of on every
code point. Roughly 25–37% faster line iteration on the benchmark corpora. -
Performance: the streaming sentence iterators (
SentenceIterator,
CodePointSentenceIterator) memoise the SB8 look-ahead across an
ATerm Close* Sp*window, eliminating repeated forward rescans (~10–15%
fasterCodePointSentenceIterator) and bounding a previously quadratic
worst case for long ATerm runs. -
Performance: deduplicated the
General_Categorylookup shared by LB15b and
LB19 within the line-break rule scan. -
The Regional_Indicator run trackers
BoundaryState.ri_runand
WordStepState.ri_countare now a single parity bit (u1) rather than a
fullusize; only the run parity was ever consulted, so the per-step state
structs are smaller. No behavioural change.
v0.2.0
Added
- Collation module with DUCET (Default Unicode Collation Element Table) support
- Serialization and comparison for collation keys
- Bidi conformance test files
Changed
- Refactored code structure for improved readability and maintainability
- Updated generated Unicode tables and documentation
- Added
sources.tarcontaining documentation sources - Cleaned up
build.zig.zon
Full changelog: v0.1.0...v0.2.0