Releases · shaik-abdul-thouhid/ezi-code

11 Jun 16:48

shaik-abdul-thouhid

v0.4.1

be7fb6b

v0.4.1 Latest

Latest

[0.4.1] - 2026-06-11

Added

String-level full-expansion upper/lower case drivers in unicode.casing
(@stable-since: v0.4.1): upperFull{Len,Buffer,Alloc} /
lowerFull{Len,Buffer,Alloc} over []const CodePoint and
upperFullUtf8{Alloc,Writer} / lowerFullUtf8{Alloc,Writer} over UTF-8,
mirroring the existing foldFull* surface. These apply expanding case
mappings the simple drivers cannot — "straße" upper-cases to "STRASSE",
"ﬀ" to "FF". Default root locale (no Turkic tailoring); the lower
drivers use the context-free mapping (no Greek Final_Sigma — use
titlecaseUtf8Alloc or the per-scalar API for that).
Grapheme-cluster-aware display width in unicode.width
(@stable-since: v0.4.1): stringWidthGraphemes / stringWidthGraphemesLossy
/ stringWidthGraphemesCodePoints and the per-cluster graphemeClusterWidth.
Unlike the per-scalar stringWidth* estimators (unchanged), these count each
UAX #29 grapheme cluster once, so a ZWJ emoji family (👨‍👩‍👧) measures 2
columns instead of 6. Emoji-presentation sequences (VS16) and flags (regional
indicator pairs) are counted as 2.
Infallible lossy decode primitives encoding.utf16.decodeU16CodePointLossy
and encoding.utf32.decodeU32CodePointLossy (@stable-since: v0.4.1),
mirroring utf8.decodeCodePointLossy: malformed units yield U+FFFD with no
error union, preconditions asserted. The UTF-16/UTF-32 lossy iterators now
decode through them, so the "lossy never errors, structurally" guarantee
(previously UTF-8 only) holds for all three codecs — no catch @panic
remains on any lossy path.

Changed

Breaking: the UTF-8 stream's OutputBufferTooSmall error is renamed to
BufferTooSmall, the name the encoding, transcoding, casing, and collation
layers already use, so the whole library reports one error for "output
buffer too small" (UTF8Stream.nextCodePoint / nextCodePointLossy). The
sibling NeedMoreBytes / EOFReached keep input-side starvation distinct.
unicode.properties.isAscii now delegates to encoding.isAscii instead of
hardcoding <= 0x7F; every scalar ASCII check in the library now routes
through the single encoding.isAscii predicate. The hex-digit predicate docs
(isHexDigit, isHexDigitWide, isAsciiHexDigit) were clarified so the
ASCII-vs-Unicode distinction is explicit (no renames).
Internal: collation-element generation is unified behind one
Collator.recordAt, shared by buildKey and the incremental comparator, so
DUCET record lookup, discontiguous-contraction extension, and implicit
weighting cannot diverge between the sort-key and early-exit paths (closes a
divergence risk from 0.4.0; no API change).
Internal: build.zig no longer imports library source (src/utils/root.zig)
for its some build-option predicate — it carries a local copy, decoupling
the build graph from module layout.

Removed

Breaking: the dead BufferIsEmpty error variant is gone from
UTF8Stream.nextCodePoint / nextCodePointLossy — it was declared but
returned by no code path (the same cleanup as 0.4.0's error.Undefined).

Fixed

The unchecked-decode contract ("never panics … undefined in ReleaseFast") is
now honored uniformly: three internal sites used @panic (which traps in all
modes) instead of unreachable — most visibly decodeCodePointReverseUnchecked,
whose doc promised no panic. Replaced with unreachable so behavior matches
the documented contract.

Assets 2

11 Jun 14:22

shaik-abdul-thouhid

v0.4.0

650f12f

v0.4.0

[0.4.0] - 2026-06-11

Changed

The CodePoint contract is now documented as such, on the type itself
(encoding.CodePoint) and in the README: values of this type are presumed
valid Unicode scalars; producers uphold the contract, consumers rely on it
and skip decoding/validation (which is what makes the []const CodePoint
API variants the cheap path for already-decoded text). The README quick-look
examples were also brought back in sync with the real signatures
(initUTF8View out-param, view.iter(), nfc over code points).
Breaking: "unchecked" now means one thing everywhere — the caller
guarantees the documented preconditions; violations are asserted /
safety-checked (trap in Debug/ReleaseSafe, undefined in
ReleaseFast/ReleaseSmall); unchecked functions never return errors or
panic. Accordingly:
- encoding.utf8.codePointLenReverseUnchecked returns plain u3
  (was UTF8ValidationError!u3).
- encoding.utf16.utf16SequenceLenReverseUnchecked returns plain u2
  (was UTF16ValidationError!u2) and asserts end_index < buf.len
  instead of returning ZeroLengthUnits.
- encoding.utf8.decodeCodePointReverseUnchecked no longer documents (or
  contains) a panic path; its preconditions are asserted.

Removed

Breaking: the Undefined member is gone from all six UTF-16/UTF-32
error sets (UTF16ValidationError, UTF16ValidationLossyError,
UTF16EncodeError, UTF32ValidationError, UTF32ValidationLossyError,
UTF32EncodeError). It was never returned by any code path; exhaustive
switches over these error sets can drop their dead error.Undefined arm.

Fixed

unicode.segmentation byte-level iterators (grapheme / word / sentence /
line) and step helpers no longer contain catch @panic shims around lossy
decoding. They decode through encoding.utf8.decodeCodePointLossy, so the
"lossy never errors" promise now holds structurally: malformed UTF-8 yields
U+FFFD segments and can never trap. lineStepBytes documents its
byte_pos < bytes.len contract (asserted, safety-checked).

Added

Early-exit collation compare (@stable-since: v0.4.0):
Collator.compareCodePointsIncremental / compareUtf8Incremental generate
collation elements lazily and stop at the first differing weight of the
shallowest differing level — no sort keys are materialized. Identical
results to compareCodePoints / compareUtf8 for every input and option
set (verified across the strength × variable-weighting matrix). Strings that
differ early at the primary level — the common case — pay for only a few
collation elements; the weighting logic is shared with buildKey via an
internal resolveCE so the two paths cannot diverge.
BOM utilities — new encoding.bom module (@stable-since: v0.4.0):
Bom enum (utf8 / utf16_le / utf16_be / utf32_le / utf32_be) with
bytes / len / endian / match, plus detect (longest-match: the
ambiguous FF FE 00 00 reports UTF-32 LE) and zero-copy strip. The
codecs themselves still never consume or produce BOMs; this is the
explicit seam, re-exported at the package root as bom.
unicode.normalization.nfkcCaseFold — sequence-level NFKC_Casefold
(@stable-since: v0.4.0), per the UCD definition
NFKC_CF(X) = NFC(toNFKC_Casefold(X)), built on the already-shipped
per-scalar nfkcCaseFoldMap table. The identifier-caseless form used by
UAX #31 and security profiles: fullwidth compatibility variants fold
(ＡＢＣ → abc), Default_Ignorables map away, and the result is
idempotent (verified by a BMP-wide sweep).
String-level titlecase in unicode.casing
(@stable-since: v0.4.0): titlecaseAlloc ([]const CodePoint) and
titlecaseUtf8Alloc (UTF-8), implementing the Unicode default algorithm
(R3) — UAX #29 word segmentation, full titlecase mapping on each word's
first cased scalar, full lowercase on the rest, with Final_Sigma context
for U+03A3 ("ΜΕΓΑΣ" → "Μεγας"). Default root-locale mappings; no
Turkic/Lithuanian tailoring.
Case-insensitive search in unicode.casing
(@stable-since: v0.4.0): indexOfFold / containsFold over UTF-8 bytes
and indexOfFoldCodePoints / containsFoldCodePoints over scalar slices.
Both sides fold lazily during the scan — no allocation — honoring expanding
folds in .full mode ("STRASSE" matches "Straße"), with a whole-scalar
boundary rule (needle "s" never matches inside "ß"'s expansion). The
CodePoint variants skip decoding/validation entirely per the CodePoint
contract.
SIMD chunked scanners for UTF-16, mirroring the v0.3.0 UTF-8 set
(@stable-since: v0.4.0, portable @Vector compares with scalar tails, no
target intrinsics): utf16.nonSurrogateRunLength (length of the leading
run of standalone scalars — the UTF-16 analogue of asciiRunLength) and
utf16.countScalarsSimd (unchecked scalar count via the high-surrogate
rule). utf16.validate now skips surrogate-free runs in SIMD strides and
falls back to scalar pair checks only at actual surrogates.
Bulk encode-direction APIs taking []const CodePoint
(@stable-since: v0.4.0): encodeCodePoints{Len,Buffer,Alloc} on all three
codecs, the inverse of bytesToUTF8String / bufToUTF16String /
bufToUTF32String. Callers holding already-decoded scalars encode without
any decoding or validation, per the CodePoint contract.
encoding.utf8.StreamingValidator — incremental, resumable validation over
arbitrarily-chunked input (@stable-since: v0.4.0). The Höhrmann DFA state
carries across chunk boundaries (no buffering, no copies); update reports
the absolute offset of the first malformed sequence, finish distinguishes
a truncated trailing scalar from valid end-of-input, and ASCII runs are
skipped in SIMD strides.
Error position reporting (@stable-since: v0.4.0): invalidIndex on
all three codecs returns the unit/byte offset where the first malformed
sequence starts (null when valid), so diagnostics no longer require a
re-scan. Decoding strictly at the reported offset recovers the fine-grained
error. The UTF-8 variant skips ASCII runs in SIMD strides.
Unchecked forward decode entry points, completing the strict / unchecked
/ lossy matrix in the forward direction (@stable-since: v0.4.0): callers
holding already-validated text can decode without re-validating and without
wrapping the input in a View.
- encoding.utf8.decodeCodePointUnchecked
- encoding.utf16.decodeU16CodePointUnchecked
- encoding.utf32.decodeU32CodePointUnchecked
encoding.utf8.decodeCodePointLossy — infallible lossy decode primitive
(@stable-since: v0.4.0). Malformed sequences yield U+FFFD and are never
reported as errors; the only precondition (offset < bytes.len) is asserted
(safety-checked), not error-returned, and the returned len is always >= 1
so forward scans are guaranteed to make progress. UTF8LossyIterator and
UTF8SimdLossyIterator now decode through it, removing their internal
catch unreachable/catch break shims.

Assets 2

09 Jun 18:14

shaik-abdul-thouhid

v0.3.0

daa2670

v0.3.0

[0.3.0] - 2026-06-09

Added

SIMD chunked scanners in encoding.utf8 (additive — every pre-existing API
keeps its exact signature and behaviour). All are portable @Vector compares
and reductions with a scalar tail (no target intrinsics, no dynamic shuffles),
striding std.simd.suggestVectorLength(u8) bytes at a time. @stable-since: v0.3.0:
- asciiRunLength — length of the leading ASCII run (<= 0x7F), the shared
  primitive behind the others and usable directly for an ASCII fast path.
- countScalarsSimd — unchecked scalar count via the non-continuation-byte
  rule ((b & 0xC0) != 0x80); equals countScalars on valid input.
- simdLossyIterator / UTF8SimdLossyIterator — a buffered lossy decode
  iterator that widens ASCII runs in bulk; output is identical to
  lossyIterator (malformed → U+FFFD, orphaned continuation runs collapse to a
  single replacement).
Enumerable code-point range tables for Unicode properties, so consumers
can resolve property classes into sorted ranges at comptime (the per-code-point
page tables cannot be enumerated without walking all 1.1M code points). New
zig build generate-ranges step (no network; reuses the committed page tables)
emits:
- properties.category_runs (CategoryRun{ start, end, category }) — a full
  partition of 0..=0x10FFFF by General_Category, including unassigned runs.
- properties.derived_runs (DerivedRun{ start, end, mask }) —
  DerivedCoreProperties runs keyed by the same bitmask as derivedPropertyMask.
- properties.white_space_ranges and properties.join_control_ranges
  (CodePointRange{ start, end }) — PropList bases for \s and \w.
- scripts.script_runs (ScriptRun{ start, end, script }) — Script runs for
  assigned code points.
properties.isWord — Perl \w / word-boundary predicate
(Alphabetic ∪ Mark ∪ Decimal_Number ∪ Connector_Punctuation ∪ Join_Control).
Resolved from the enumerable range tables (with an ASCII fast path), not the
per-code-point page tries, so a consumer that needs only isWord never links
the page tables. @stable-since: v0.3.0.
Range-table-backed per-code-point queries — equivalent to the page-table
predicates but linking only the enumerable range tables (no two-level page
tries), so a size-sensitive consumer can drop the tries entirely. Each is
proven equal to its page-table twin for every code point. @stable-since: v0.3.0:
- properties.categoryFromRuns — General_Category via binary search over
  category_runs (twin of generalCategory).
- properties.derivedMaskFromRuns — DerivedCoreProperties bitmask via binary
  search over derived_runs (twin of derivedPropertyMask).
- properties.isIdentifierStartByRanges / isIdentifierContinueByRanges —
  twins of isIdentifierStart / isIdentifierContinue.
A dedicated unicode.emoji module for the UTS #51 emoji character
properties (emoji-data.txt), promoting the six emoji predicates out of
unicode.segmentation into a first-class property module alongside scripts,
blocks, etc. The generated page/range tables (emoji.generated, regenerated
by zig build generate) now live under unicode/emoji/generated/. All
@stable-since: v0.3.0:
- Per-code-point predicates emoji.isEmoji, isEmojiPresentation,
  isEmojiModifier, isEmojiModifierBase, isEmojiComponent, and
  isExtendedPictographic (also surfaced as unicode.isEmoji, … ).
- emoji.EmojiProperty (enum of the six properties), emoji.EmojiProperties
  (a packed struct of all six bools with .any()), emoji.emojiProperties
  (resolve all six at once), emoji.hasEmojiProperty (runtime-selected
  dispatch), and emoji.hasAnyEmojiProperty.
- Enumerable code-point range tables so consumers can resolve \p{Emoji},
  \p{Extended_Pictographic}, etc. into sorted ranges at comptime without
  walking all 1.1M code points (same rationale as scripts.script_runs).
  Emitted by an extended zig build generate-ranges into
  unicode/emoji/generated/emoji_ranges.zig and re-exported as
  emoji.emoji_ranges, emoji.emoji_presentation_ranges,
  emoji.emoji_modifier_ranges, emoji.emoji_modifier_base_ranges,
  emoji.emoji_component_ranges, and emoji.extended_pictographic_ranges
  (EmojiRange{ start, end }), with emoji.rangesFor(property) for
  runtime selection. Each table is proven (test) to enumerate exactly its
  predicate over the whole code space.

Changed

The emoji predicates moved from unicode.segmentation to the new
unicode.emoji module (see Added). segmentation.isEmoji,
isEmojiPresentation, isEmojiModifier, isEmojiModifierBase,
isEmojiComponent, and isExtendedPictographic remain as deprecated
re-export aliases (so segmentation keeps compiling and UAX #29 grapheme
clustering still resolves Extended_Pictographic); prefer unicode.emoji.*.
unicode.emoji_data now points at emoji.generated rather than
segmentation.emoji_data. All still v0.3.0-unreleased.
The Unicode range-table re-exports (properties.category_runs,
properties.derived_runs, properties.white_space_ranges,
properties.join_control_ranges, scripts.script_runs,
casing.case_folding.common_simple_table) are now []const T slices over a
single backing array instead of by-value array re-exports. Iteration,
indexing, slicing and .len are unchanged; this removes a duplicate copy of
each table that the by-value alias materialized in consumer binaries (and the
extra comptime-materialized copy). Still all v0.3.0-unreleased.
Performance: encoding.utf8.validate now skips ASCII runs in bulk via SIMD
(asciiRunLength) while the Höhrmann DFA is on a scalar boundary, instead of
feeding every byte through the DFA. ASCII bytes always keep the DFA in accept,
so the verdict is identical; only the dominant ASCII case is faster. Signature
and result are unchanged.
Performance: the UAX #14 line-break steppers (lineStep, lineStepBytes,
and the LineBreakIterator / CodePointLineBoundaryIterator they drive) now
compute the forward look-ahead only when a look-ahead-dependent rule
(LB15b, LB15c, LB19a, LB25, LB28a) can actually fire, instead of on every
code point. Roughly 25–37% faster line iteration on the benchmark corpora.
Performance: the streaming sentence iterators (SentenceIterator,
CodePointSentenceIterator) memoise the SB8 look-ahead across an
ATerm Close* Sp* window, eliminating repeated forward rescans (~10–15%
faster CodePointSentenceIterator) and bounding a previously quadratic
worst case for long ATerm runs.
Performance: deduplicated the General_Category lookup shared by LB15b and
LB19 within the line-break rule scan.
The Regional_Indicator run trackers BoundaryState.ri_run and
WordStepState.ri_count are now a single parity bit (u1) rather than a
full usize; only the run parity was ever consulted, so the per-step state
structs are smaller. No behavioural change.

Assets 2

02 Jun 17:21

shaik-abdul-thouhid

v0.2.0

321ef6e

v0.2.0

Added

Collation module with DUCET (Default Unicode Collation Element Table) support
Serialization and comparison for collation keys
Bidi conformance test files

Changed

Refactored code structure for improved readability and maintainability
Updated generated Unicode tables and documentation
Added sources.tar containing documentation sources
Cleaned up build.zig.zon

Full changelog: v0.1.0...v0.2.0

Assets 2

Releases: shaik-abdul-thouhid/ezi-code

v0.4.1

[0.4.1] - 2026-06-11

Added

Changed

Removed

Fixed

Uh oh!

v0.4.0

[0.4.0] - 2026-06-11

Changed

Removed

Fixed

Added

Uh oh!

v0.3.0

[0.3.0] - 2026-06-09

Added

Changed

Uh oh!

v0.2.0

Added

Changed

Uh oh!