optimize implementation of utf8ValidateSlice #17329

karlseguin · 2023-09-30T00:34:40Z

I noticed this function showing up while profiling. It's used in the std's JSON serializer.

I did some simple measurements, on a simple ASCII string, it was about 6x faster. On a large Chinese lorem ipsum string, it was about 2.5x faster.

Ported Go's utf8.Valid function for improved performance, added some test case from Go's test also.

lib/std/unicode.zig

andrewrk · 2023-09-30T03:21:38Z

I did some simple measurements, on a simple ASCII string, it was about 6x faster. On a large Chinese lorem ipsum string, it was about 2.5x faster.

I'd like to try to reproduce your results - how can I do that?

karlseguin · 2023-09-30T03:49:24Z

Not sure if there's a better way, I used std.time.Timer in a loop. This is the script I used.

https://gist.github.com/karlseguin/4518694523af63f5edec8048049b95ec

I originally only ran in Debug and ReleaseSafe, where this implementation was faster in all cases. But I just tried with ReleaseFast and now I see that this version is slower for long uft8 strings (but still much faster for ascii-only ones). So it's not the slam-dunk that I thought it was.

Previous version was slower than the existing implementation in ReleaseFast for long UTF8 strings. This version is now faster in all tested cases (short/long ascii/UTF8), in all release modes

karlseguin · 2023-09-30T06:05:30Z

With the newest implementation, in ReleaseFast, I now get:

empty.std:  1
empty.go: 1

short.ascii.std:  63
short.ascii.go: 4

long.ascii.std:  7881
long.ascii.go: 332

long.chinese.std:  6322
long.chinese.go: 4183

short.invalid.std:  7
short.invalid.go: 3

long.invalid.std:  6378
long.invalid.go: 415

And in ReleaseSafe, I get:

empty.std:  2
empty.go: 1

short.ascii.std:  187
short.ascii.go: 12

long.ascii.std:  16079
long.ascii.go: 320

long.chinese.std:  11415
long.chinese.go: 5519

short.invalid.std:  13
short.invalid.go: 3

long.invalid.std:  11570
long.invalid.go: 5521

squeek502 · 2023-10-04T23:22:05Z

I'm not sure why Zig should keep the Go naming here, especially with regards to the term 'rune.' What Go calls a rune is called a codepoint in Zig.

Personally, I'd also change things like:

var p = input;

to

var remaining = input;

and

const rune_self = 0x80;

to

const min_non_ascii_codepoint = 0x80;

or something like that.

andrewrk

Thanks for working on this.

lib/std/unicode.zig

andrewrk

Here are a couple ideas I'd like to play with before settling on something:

lib/std/unicode.zig

andrewrk · 2023-10-06T00:06:35Z

andy@ark ~/tmp> zig run test.zig -OReleaseFast
empty: bad benchmark

short.ascii.std:  22
short.ascii.go: 2
short.ascii.usize: 3
short.ascii.vector: 11

long.ascii.std:  8481
long.ascii.go: 356
long.ascii.usize: 366
long.ascii.vector: 160

long.chinese.std:  9673
long.chinese.go: 5520
long.chinese.usize: 6310
long.chinese.vector: 5080

short.invalid.std:  6
short.invalid.go: 2
short.invalid.usize: 2
short.invalid.vector: 3

long.invalid.std:  9770
long.invalid.go: 5518
long.invalid.usize: 6252
long.invalid.vector: 5055

output from my computer (Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz)

emidoots · 2023-10-22T01:52:14Z

Out of curiosity, have you seen https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/ ?

karlseguin · 2023-10-22T02:48:28Z

I was familiar with it, but I'm not sure I'd be able to port it. There's a port of simdjson worth keeping an eye on, but I notice that it only works on x86_64 and AARM64.

lemire · 2023-10-23T00:12:40Z

@karlseguin Ping me if you'd like to implement it.

@travisstaloch has a pretty decent implementation, as you pointed out: https://github.com/travisstaloch/simdjzon/blob/13247baeb681f3c37bfd0bffd959bbc63e9eb0c1/src/dom.zig#L109

I notice that it only works on x86_64 and AARM64.

The algorithm is general. E.g., in simdjson we have a POWER implementation. It works on x86 (32 bits) as long as you have SSSE3 (x86 processors without SSSE3 belong to museums).

Reference:

Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021

travisstaloch · 2023-10-23T23:38:36Z

@Validark has also used the simdjzon utf8 validation code in https://github.com/Validark/Accelerated-Zig-Parser. That readme lists some ideas to potentially make it faster using something called SWAR which I'm not familar with so I'll leave it to them to comment if they wish.

karlseguin added 2 commits September 30, 2023 08:27

Use Go's implementation for utf8ValidateSlice

7b55ff2

Ported Go's utf8.Valid function for improved performance, added some test case from Go's test also.

zig fmt

4f5f4d8

The-King-of-Toasters suggested changes Sep 30, 2023

View reviewed changes

lib/std/unicode.zig Outdated Show resolved Hide resolved

lib/std/unicode.zig Outdated Show resolved Hide resolved

lib/std/unicode.zig Outdated Show resolved Hide resolved

Optimize implementation

c33f2fc

Previous version was slower than the existing implementation in ReleaseFast for long UTF8 strings. This version is now faster in all tested cases (short/long ascii/UTF8), in all release modes

andrewrk requested changes Oct 4, 2023

View reviewed changes

lib/std/unicode.zig Outdated Show resolved Hide resolved

remove license statement and other Goisms

5bb557d

andrewrk requested changes Oct 5, 2023

View reviewed changes

lib/std/unicode.zig Outdated Show resolved Hide resolved

use vector for ASCII fast-path

5b3df76

andrewrk merged commit d68f39b into ziglang:master Oct 7, 2023

karlseguin deleted the faster-utf8-validation branch October 18, 2023 05:27

andrewrk changed the title ~~Use Go's implementation for utf8ValidateSlice~~ optimize implementation of utf8ValidateSlice Oct 29, 2023

squeek502 mentioned this pull request Oct 31, 2023

std.unicode: Add ASCII fast path to UTF-16 <-> UTF-8 conversion functions #17797

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize implementation of utf8ValidateSlice #17329

optimize implementation of utf8ValidateSlice #17329

karlseguin commented Sep 30, 2023

andrewrk commented Sep 30, 2023

karlseguin commented Sep 30, 2023

karlseguin commented Sep 30, 2023

squeek502 commented Oct 4, 2023 •

edited

Loading

andrewrk left a comment

andrewrk left a comment

andrewrk commented Oct 6, 2023 •

edited

Loading

emidoots commented Oct 22, 2023

karlseguin commented Oct 22, 2023

lemire commented Oct 23, 2023

travisstaloch commented Oct 23, 2023

optimize implementation of utf8ValidateSlice #17329

optimize implementation of utf8ValidateSlice #17329

Conversation

karlseguin commented Sep 30, 2023

andrewrk commented Sep 30, 2023

karlseguin commented Sep 30, 2023

karlseguin commented Sep 30, 2023

squeek502 commented Oct 4, 2023 • edited Loading

andrewrk left a comment

Choose a reason for hiding this comment

andrewrk left a comment

Choose a reason for hiding this comment

andrewrk commented Oct 6, 2023 • edited Loading

emidoots commented Oct 22, 2023

karlseguin commented Oct 22, 2023

lemire commented Oct 23, 2023

travisstaloch commented Oct 23, 2023

squeek502 commented Oct 4, 2023 •

edited

Loading

andrewrk commented Oct 6, 2023 •

edited

Loading