-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Reject surrogate pairs with invalid low surrogate #1896
Conversation
Closes simdjson#1894 Reject low surrogates outside of the range U+DC00—U+DFFF Related to https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs A surrogate pair should consist of a high surrogate and low surrogate. They're used to represent 0x010000-0x10FFFF in the JSON spec because the JavaScript specification originally only supported `\uXXXX`. Previously, simdjson would accept some combinations of valid high surrogates and invalid low surrogates due to a bug in the check. (e.g. `\uD888\u1234` was accepted) U+D800—U+DBFF (1,024 code points): high surrogates U+DC00—U+DFFF (1,024 code points): low surrogates
The performance check job fails for twitter.json, which isn't using
if (((*src_ptr)[0] != '\\') || (*src_ptr)[1] != 'u') {
return false;
} |
Likely spurious.
Do you think you could design a benchmark where this optimization would shine? |
For the combining comparisons, a string which had lots of surrogate pairs might have a tiny but consistent performance difference. I'll look into creating a separate PR for that. What are your thoughts on this PR? |
Merged. |
Load 2 bytes at a time and compare 2 bytes at a time. Compilers with optimizations turned on will turn this into a 16-bit load then 16-bit compare on supported platforms (with smaller compiled code size). Add parse_surrogate_pairs to show the difference exists. See discussion in simdjson#1896
Load 2 bytes at a time and compare 2 bytes at a time. Compilers with optimizations turned on will turn this into a 16-bit load then 16-bit compare on supported platforms (with smaller compiled code size). Make it obvious to the compiler that it's reading two consecutive bytes of the same pointer Add parse_surrogate_pairs to show the difference exists. See discussion in simdjson#1896
Load 2 bytes and compare the 2 bytes against `"\u"` Compilers with optimizations turned on will turn this into a 16-bit load then 16-bit compare on supported platforms (with smaller compiled code size). Make it obvious to the compiler that it's reading two consecutive bytes of the same pointer Add parse_surrogate_pairs to show the difference exists. See discussion in simdjson#1896
Load 2 bytes and compare the 2 bytes against `"\u"` Compilers with optimizations turned on will turn this into a 16-bit load then 16-bit compare on supported platforms (with smaller compiled code size). Make it obvious to the compiler that it's reading two consecutive bytes of the same pointer Add parse_surrogate_pairs to show the difference exists. See discussion in simdjson#1896
Load 2 bytes and compare the 2 bytes against `"\u"` Compilers with optimizations turned on will turn this into a 16-bit load then 16-bit compare on supported platforms (with smaller compiled code size). Make it obvious to the compiler that it's reading two consecutive bytes of the same pointer Add parse_surrogate_pairs to show the difference exists. See discussion in simdjson#1896
Load 2 bytes and compare the 2 bytes against `"\u"` Compilers with optimizations turned on will turn this into a 16-bit load then 16-bit compare on supported platforms (with smaller compiled code size). Make it obvious to the compiler that it's reading two consecutive bytes of the same pointer Add parse_surrogate_pairs to show the difference exists. See discussion in #1896
Closes #1894
Reject low surrogates outside of the range U+DC00—U+DFFF
Related to https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs
A surrogate pair should consist of a high surrogate and low surrogate. They're used to represent 0x010000-0x10FFFF in the JSON spec because the JavaScript specification originally only supported
\uXXXX
.Previously, simdjson would accept some combinations of valid high surrogates and invalid low surrogates due to a bug in the check. (e.g.
\uD888\u1234
was accepted)U+D800—U+DBFF (1,024 code points): high surrogates
U+DC00—U+DFFF (1,024 code points): low surrogates