fix: Reject surrogate pairs with invalid low surrogate #1896

TysonAndre · 2022-09-30T12:26:17Z

Reject low surrogates outside of the range U+DC00—U+DFFF

Related to https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs

A surrogate pair should consist of a high surrogate and low surrogate. They're used to represent 0x010000-0x10FFFF in the JSON spec because the JavaScript specification originally only supported \uXXXX.

Previously, simdjson would accept some combinations of valid high surrogates and invalid low surrogates due to a bug in the check. (e.g. \uD888\u1234 was accepted)

U+D800—U+DBFF (1,024 code points): high surrogates
U+DC00—U+DFFF (1,024 code points): low surrogates

Closes simdjson#1894 Reject low surrogates outside of the range U+DC00—U+DFFF Related to https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs A surrogate pair should consist of a high surrogate and low surrogate. They're used to represent 0x010000-0x10FFFF in the JSON spec because the JavaScript specification originally only supported `\uXXXX`. Previously, simdjson would accept some combinations of valid high surrogates and invalid low surrogates due to a bug in the check. (e.g. `\uD888\u1234` was accepted) U+D800—U+DBFF (1,024 code points): high surrogates U+DC00—U+DFFF (1,024 code points): low surrogates

TysonAndre · 2022-09-30T12:51:22Z

The performance check job fails for twitter.json, which isn't using \u escapes https://github.com/simdjson/simdjson/actions/runs/3158742870/jobs/5141198629 - maybe the code size increased, or is this spurious?

Would using the simdjson_likely/simdjson_unlikely macros help in benchmarks for the code change?
Unrelatedly, this is a padded string so I'm assuming (*src_ptr)[1] points to valid memory all the time. Can the check above be combined into something along the lines of (((*src_ptr)[0] << 8) | (*src_ptr)[1]) == ('\\' << 8) | 'u') (which would get combined to loading an int16 and comparing an int64 on supported platforms?) or is there a reason to compare byte by byte.

    if (((*src_ptr)[0] != '\\') || (*src_ptr)[1] != 'u') {
      return false;
    }

lemire · 2022-09-30T13:50:05Z

is this spurious?

Likely spurious.

or is there a reason to compare byte by byte.

Do you think you could design a benchmark where this optimization would shine?

TysonAndre · 2022-09-30T16:05:23Z

Do you think you could design a benchmark where this optimization would shine?

For the combining comparisons, a string which had lots of surrogate pairs might have a tiny but consistent performance difference.

I'll look into creating a separate PR for that.

What are your thoughts on this PR?

lemire · 2022-09-30T16:12:59Z

What are your thoughts on this PR?

Merged.

Load 2 bytes at a time and compare 2 bytes at a time. Compilers with optimizations turned on will turn this into a 16-bit load then 16-bit compare on supported platforms (with smaller compiled code size). Add parse_surrogate_pairs to show the difference exists. See discussion in simdjson#1896

Load 2 bytes at a time and compare 2 bytes at a time. Compilers with optimizations turned on will turn this into a 16-bit load then 16-bit compare on supported platforms (with smaller compiled code size). Make it obvious to the compiler that it's reading two consecutive bytes of the same pointer Add parse_surrogate_pairs to show the difference exists. See discussion in simdjson#1896

Load 2 bytes and compare the 2 bytes against `"\u"` Compilers with optimizations turned on will turn this into a 16-bit load then 16-bit compare on supported platforms (with smaller compiled code size). Make it obvious to the compiler that it's reading two consecutive bytes of the same pointer Add parse_surrogate_pairs to show the difference exists. See discussion in simdjson#1896

Load 2 bytes and compare the 2 bytes against `"\u"` Compilers with optimizations turned on will turn this into a 16-bit load then 16-bit compare on supported platforms (with smaller compiled code size). Make it obvious to the compiler that it's reading two consecutive bytes of the same pointer Add parse_surrogate_pairs to show the difference exists. See discussion in #1896

lemire merged commit 5809e51 into simdjson:master Sep 30, 2022

TysonAndre mentioned this pull request Sep 30, 2022

Micro-optimization for parsing surrogate pairs #1897

Merged

FourierTransformer mentioned this pull request Oct 2, 2022

v2.2.3 FourierTransformer/lua-simdjson#43

Closed

TysonAndre deleted the fix-1894 branch October 12, 2022 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Reject surrogate pairs with invalid low surrogate #1896

fix: Reject surrogate pairs with invalid low surrogate #1896

TysonAndre commented Sep 30, 2022 •

edited

TysonAndre commented Sep 30, 2022 •

edited

lemire commented Sep 30, 2022

TysonAndre commented Sep 30, 2022

lemire commented Sep 30, 2022

fix: Reject surrogate pairs with invalid low surrogate #1896

fix: Reject surrogate pairs with invalid low surrogate #1896

Conversation

TysonAndre commented Sep 30, 2022 • edited

TysonAndre commented Sep 30, 2022 • edited

lemire commented Sep 30, 2022

TysonAndre commented Sep 30, 2022

lemire commented Sep 30, 2022

TysonAndre commented Sep 30, 2022 •

edited

TysonAndre commented Sep 30, 2022 •

edited