-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize implementation of utf8ValidateSlice #17329
Conversation
Ported Go's utf8.Valid function for improved performance, added some test case from Go's test also.
I'd like to try to reproduce your results - how can I do that? |
Not sure if there's a better way, I used https://gist.github.com/karlseguin/4518694523af63f5edec8048049b95ec I originally only ran in Debug and ReleaseSafe, where this implementation was faster in all cases. But I just tried with ReleaseFast and now I see that this version is slower for long uft8 strings (but still much faster for ascii-only ones). So it's not the slam-dunk that I thought it was. |
Previous version was slower than the existing implementation in ReleaseFast for long UTF8 strings. This version is now faster in all tested cases (short/long ascii/UTF8), in all release modes
With the newest implementation, in ReleaseFast, I now get:
And in ReleaseSafe, I get:
|
I'm not sure why Zig should keep the Go naming here, especially with regards to the term 'rune.' What Go calls a rune is called a codepoint in Zig. Personally, I'd also change things like: var p = input; to var remaining = input; and const rune_self = 0x80; to const min_non_ascii_codepoint = 0x80; or something like that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a couple ideas I'd like to play with before settling on something:
output from my computer ( |
Out of curiosity, have you seen https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/ ? |
I was familiar with it, but I'm not sure I'd be able to port it. There's a port of simdjson worth keeping an eye on, but I notice that it only works on x86_64 and AARM64. |
@karlseguin Ping me if you'd like to implement it. @travisstaloch has a pretty decent implementation, as you pointed out: https://github.com/travisstaloch/simdjzon/blob/13247baeb681f3c37bfd0bffd959bbc63e9eb0c1/src/dom.zig#L109
The algorithm is general. E.g., in simdjson we have a POWER implementation. It works on x86 (32 bits) as long as you have SSSE3 (x86 processors without SSSE3 belong to museums). Reference:
|
@Validark has also used the simdjzon utf8 validation code in https://github.com/Validark/Accelerated-Zig-Parser. That readme lists some ideas to potentially make it faster using something called SWAR which I'm not familar with so I'll leave it to them to comment if they wish. |
I noticed this function showing up while profiling. It's used in the std's JSON serializer.
I did some simple measurements, on a simple ASCII string, it was about 6x faster. On a large Chinese lorem ipsum string, it was about 2.5x faster.