v0.3.0
[0.3.0] - 2026-06-09
Added
- SIMD chunked scanners in
encoding.utf8(additive — every pre-existing API
keeps its exact signature and behaviour). All are portable@Vectorcompares
and reductions with a scalar tail (no target intrinsics, no dynamic shuffles),
stridingstd.simd.suggestVectorLength(u8)bytes at a time.@stable-since: v0.3.0:asciiRunLength— length of the leading ASCII run (<= 0x7F), the shared
primitive behind the others and usable directly for an ASCII fast path.countScalarsSimd— unchecked scalar count via the non-continuation-byte
rule ((b & 0xC0) != 0x80); equalscountScalarson valid input.simdLossyIterator/UTF8SimdLossyIterator— a buffered lossy decode
iterator that widens ASCII runs in bulk; output is identical to
lossyIterator(malformed → U+FFFD, orphaned continuation runs collapse to a
single replacement).
- Enumerable code-point range tables for Unicode properties, so consumers
can resolve property classes into sorted ranges at comptime (the per-code-point
page tables cannot be enumerated without walking all 1.1M code points). New
zig build generate-rangesstep (no network; reuses the committed page tables)
emits:properties.category_runs(CategoryRun{ start, end, category }) — a full
partition of 0..=0x10FFFF byGeneral_Category, including unassigned runs.properties.derived_runs(DerivedRun{ start, end, mask }) —
DerivedCoreProperties runs keyed by the same bitmask asderivedPropertyMask.properties.white_space_rangesandproperties.join_control_ranges
(CodePointRange{ start, end }) — PropList bases for\sand\w.scripts.script_runs(ScriptRun{ start, end, script }) — Script runs for
assigned code points.
properties.isWord— Perl\w/ word-boundary predicate
(Alphabetic ∪ Mark ∪ Decimal_Number ∪ Connector_Punctuation ∪ Join_Control).
Resolved from the enumerable range tables (with an ASCII fast path), not the
per-code-point page tries, so a consumer that needs onlyisWordnever links
the page tables.@stable-since: v0.3.0.- Range-table-backed per-code-point queries — equivalent to the page-table
predicates but linking only the enumerable range tables (no two-level page
tries), so a size-sensitive consumer can drop the tries entirely. Each is
proven equal to its page-table twin for every code point.@stable-since: v0.3.0:properties.categoryFromRuns—General_Categoryvia binary search over
category_runs(twin ofgeneralCategory).properties.derivedMaskFromRuns— DerivedCoreProperties bitmask via binary
search overderived_runs(twin ofderivedPropertyMask).properties.isIdentifierStartByRanges/isIdentifierContinueByRanges—
twins ofisIdentifierStart/isIdentifierContinue.
- A dedicated
unicode.emojimodule for the UTS #51 emoji character
properties (emoji-data.txt), promoting the six emoji predicates out of
unicode.segmentationinto a first-class property module alongsidescripts,
blocks, etc. The generated page/range tables (emoji.generated, regenerated
byzig build generate) now live underunicode/emoji/generated/. All
@stable-since: v0.3.0:- Per-code-point predicates
emoji.isEmoji,isEmojiPresentation,
isEmojiModifier,isEmojiModifierBase,isEmojiComponent, and
isExtendedPictographic(also surfaced asunicode.isEmoji, … ). emoji.EmojiProperty(enum of the six properties),emoji.EmojiProperties
(apacked structof all six bools with.any()),emoji.emojiProperties
(resolve all six at once),emoji.hasEmojiProperty(runtime-selected
dispatch), andemoji.hasAnyEmojiProperty.- Enumerable code-point range tables so consumers can resolve
\p{Emoji},
\p{Extended_Pictographic}, etc. into sorted ranges at comptime without
walking all 1.1M code points (same rationale asscripts.script_runs).
Emitted by an extendedzig build generate-rangesinto
unicode/emoji/generated/emoji_ranges.zigand re-exported as
emoji.emoji_ranges,emoji.emoji_presentation_ranges,
emoji.emoji_modifier_ranges,emoji.emoji_modifier_base_ranges,
emoji.emoji_component_ranges, andemoji.extended_pictographic_ranges
(EmojiRange{ start, end }), withemoji.rangesFor(property)for
runtime selection. Each table is proven (test) to enumerate exactly its
predicate over the whole code space.
- Per-code-point predicates
Changed
-
The emoji predicates moved from
unicode.segmentationto the new
unicode.emojimodule (see Added).segmentation.isEmoji,
isEmojiPresentation,isEmojiModifier,isEmojiModifierBase,
isEmojiComponent, andisExtendedPictographicremain as deprecated
re-export aliases (sosegmentationkeeps compiling and UAX #29 grapheme
clustering still resolvesExtended_Pictographic); preferunicode.emoji.*.
unicode.emoji_datanow points atemoji.generatedrather than
segmentation.emoji_data. All still v0.3.0-unreleased. -
The Unicode range-table re-exports (
properties.category_runs,
properties.derived_runs,properties.white_space_ranges,
properties.join_control_ranges,scripts.script_runs,
casing.case_folding.common_simple_table) are now[]const Tslices over a
single backing array instead of by-value array re-exports. Iteration,
indexing, slicing and.lenare unchanged; this removes a duplicate copy of
each table that the by-value alias materialized in consumer binaries (and the
extra comptime-materialized copy). Still all v0.3.0-unreleased. -
Performance:
encoding.utf8.validatenow skips ASCII runs in bulk via SIMD
(asciiRunLength) while the Höhrmann DFA is on a scalar boundary, instead of
feeding every byte through the DFA. ASCII bytes always keep the DFA in accept,
so the verdict is identical; only the dominant ASCII case is faster. Signature
and result are unchanged. -
Performance: the UAX #14 line-break steppers (
lineStep,lineStepBytes,
and theLineBreakIterator/CodePointLineBoundaryIteratorthey drive) now
compute the forward look-ahead only when a look-ahead-dependent rule
(LB15b, LB15c, LB19a, LB25, LB28a) can actually fire, instead of on every
code point. Roughly 25–37% faster line iteration on the benchmark corpora. -
Performance: the streaming sentence iterators (
SentenceIterator,
CodePointSentenceIterator) memoise the SB8 look-ahead across an
ATerm Close* Sp*window, eliminating repeated forward rescans (~10–15%
fasterCodePointSentenceIterator) and bounding a previously quadratic
worst case for long ATerm runs. -
Performance: deduplicated the
General_Categorylookup shared by LB15b and
LB19 within the line-break rule scan. -
The Regional_Indicator run trackers
BoundaryState.ri_runand
WordStepState.ri_countare now a single parity bit (u1) rather than a
fullusize; only the run parity was ever consulted, so the per-step state
structs are smaller. No behavioural change.