Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize implementation of utf8ValidateSlice #17329

Merged
merged 5 commits into from
Oct 7, 2023

Conversation

karlseguin
Copy link
Contributor

I noticed this function showing up while profiling. It's used in the std's JSON serializer.

I did some simple measurements, on a simple ASCII string, it was about 6x faster. On a large Chinese lorem ipsum string, it was about 2.5x faster.

Ported Go's utf8.Valid function for improved performance, added some test
case from Go's test also.
lib/std/unicode.zig Outdated Show resolved Hide resolved
lib/std/unicode.zig Outdated Show resolved Hide resolved
lib/std/unicode.zig Outdated Show resolved Hide resolved
@andrewrk
Copy link
Member

I did some simple measurements, on a simple ASCII string, it was about 6x faster. On a large Chinese lorem ipsum string, it was about 2.5x faster.

I'd like to try to reproduce your results - how can I do that?

@karlseguin
Copy link
Contributor Author

Not sure if there's a better way, I used std.time.Timer in a loop. This is the script I used.

https://gist.github.com/karlseguin/4518694523af63f5edec8048049b95ec

I originally only ran in Debug and ReleaseSafe, where this implementation was faster in all cases. But I just tried with ReleaseFast and now I see that this version is slower for long uft8 strings (but still much faster for ascii-only ones). So it's not the slam-dunk that I thought it was.

Previous version was slower than the existing implementation in ReleaseFast for
long UTF8 strings. This version is now faster in all tested cases (short/long
ascii/UTF8), in all release modes
@karlseguin
Copy link
Contributor Author

With the newest implementation, in ReleaseFast, I now get:

empty.std:  1
empty.go: 1

short.ascii.std:  63
short.ascii.go: 4

long.ascii.std:  7881
long.ascii.go: 332

long.chinese.std:  6322
long.chinese.go: 4183

short.invalid.std:  7
short.invalid.go: 3

long.invalid.std:  6378
long.invalid.go: 415

And in ReleaseSafe, I get:

empty.std:  2
empty.go: 1

short.ascii.std:  187
short.ascii.go: 12

long.ascii.std:  16079
long.ascii.go: 320

long.chinese.std:  11415
long.chinese.go: 5519

short.invalid.std:  13
short.invalid.go: 3

long.invalid.std:  11570
long.invalid.go: 5521

@squeek502
Copy link
Collaborator

squeek502 commented Oct 4, 2023

I'm not sure why Zig should keep the Go naming here, especially with regards to the term 'rune.' What Go calls a rune is called a codepoint in Zig.

Personally, I'd also change things like:

var p = input;

to

var remaining = input;

and

const rune_self = 0x80;

to

const min_non_ascii_codepoint = 0x80;

or something like that.

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.

lib/std/unicode.zig Outdated Show resolved Hide resolved
Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a couple ideas I'd like to play with before settling on something:

lib/std/unicode.zig Outdated Show resolved Hide resolved
@andrewrk
Copy link
Member

andrewrk commented Oct 6, 2023

andy@ark ~/tmp> zig run test.zig -OReleaseFast
empty: bad benchmark

short.ascii.std:  22
short.ascii.go: 2
short.ascii.usize: 3
short.ascii.vector: 11

long.ascii.std:  8481
long.ascii.go: 356
long.ascii.usize: 366
long.ascii.vector: 160

long.chinese.std:  9673
long.chinese.go: 5520
long.chinese.usize: 6310
long.chinese.vector: 5080

short.invalid.std:  6
short.invalid.go: 2
short.invalid.usize: 2
short.invalid.vector: 3

long.invalid.std:  9770
long.invalid.go: 5518
long.invalid.usize: 6252
long.invalid.vector: 5055

output from my computer (Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz)

@andrewrk andrewrk merged commit d68f39b into ziglang:master Oct 7, 2023
10 checks passed
@karlseguin karlseguin deleted the faster-utf8-validation branch October 18, 2023 05:27
@slimsag
Copy link
Contributor

slimsag commented Oct 22, 2023

@karlseguin
Copy link
Contributor Author

I was familiar with it, but I'm not sure I'd be able to port it. There's a port of simdjson worth keeping an eye on, but I notice that it only works on x86_64 and AARM64.

@lemire
Copy link

lemire commented Oct 23, 2023

@karlseguin Ping me if you'd like to implement it.

@travisstaloch has a pretty decent implementation, as you pointed out: https://github.com/travisstaloch/simdjzon/blob/13247baeb681f3c37bfd0bffd959bbc63e9eb0c1/src/dom.zig#L109

I notice that it only works on x86_64 and AARM64.

The algorithm is general. E.g., in simdjson we have a POWER implementation. It works on x86 (32 bits) as long as you have SSSE3 (x86 processors without SSSE3 belong to museums).

Reference:

@travisstaloch
Copy link
Contributor

@Validark has also used the simdjzon utf8 validation code in https://github.com/Validark/Accelerated-Zig-Parser. That readme lists some ideas to potentially make it faster using something called SWAR which I'm not familar with so I'll leave it to them to comment if they wish.

@andrewrk andrewrk changed the title Use Go's implementation for utf8ValidateSlice optimize implementation of utf8ValidateSlice Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants