Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Reject surrogate pairs with invalid low surrogate #1896

Merged
merged 1 commit into from
Sep 30, 2022

Conversation

TysonAndre
Copy link
Contributor

@TysonAndre TysonAndre commented Sep 30, 2022

Closes #1894

Reject low surrogates outside of the range U+DC00—U+DFFF

Related to https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs

A surrogate pair should consist of a high surrogate and low surrogate. They're used to represent 0x010000-0x10FFFF in the JSON spec because the JavaScript specification originally only supported \uXXXX.

Previously, simdjson would accept some combinations of valid high surrogates and invalid low surrogates due to a bug in the check. (e.g. \uD888\u1234 was accepted)

U+D800—U+DBFF (1,024 code points): high surrogates
U+DC00—U+DFFF (1,024 code points): low surrogates

Closes simdjson#1894

Reject low surrogates outside of the range U+DC00—U+DFFF

Related to https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs

A surrogate pair should consist of a high surrogate and low surrogate.
They're used to represent 0x010000-0x10FFFF in the JSON spec because
the JavaScript specification originally only supported `\uXXXX`.

Previously, simdjson would accept some combinations of valid high
surrogates and invalid low surrogates due to a bug in the check.
(e.g. `\uD888\u1234` was accepted)

U+D800—U+DBFF (1,024 code points): high surrogates
U+DC00—U+DFFF (1,024 code points): low surrogates
@TysonAndre
Copy link
Contributor Author

TysonAndre commented Sep 30, 2022

The performance check job fails for twitter.json, which isn't using \u escapes https://github.com/simdjson/simdjson/actions/runs/3158742870/jobs/5141198629 - maybe the code size increased, or is this spurious?

  1. Would using the simdjson_likely/simdjson_unlikely macros help in benchmarks for the code change?
  2. Unrelatedly, this is a padded string so I'm assuming (*src_ptr)[1] points to valid memory all the time. Can the check above be combined into something along the lines of (((*src_ptr)[0] << 8) | (*src_ptr)[1]) == ('\\' << 8) | 'u') (which would get combined to loading an int16 and comparing an int64 on supported platforms?) or is there a reason to compare byte by byte.
    if (((*src_ptr)[0] != '\\') || (*src_ptr)[1] != 'u') {
      return false;
    }

@lemire
Copy link
Member

lemire commented Sep 30, 2022

is this spurious?

Likely spurious.

or is there a reason to compare byte by byte.

Do you think you could design a benchmark where this optimization would shine?

@TysonAndre
Copy link
Contributor Author

Do you think you could design a benchmark where this optimization would shine?

For the combining comparisons, a string which had lots of surrogate pairs might have a tiny but consistent performance difference.

I'll look into creating a separate PR for that.

What are your thoughts on this PR?

@lemire
Copy link
Member

lemire commented Sep 30, 2022

What are your thoughts on this PR?

Merged.

@lemire lemire merged commit 5809e51 into simdjson:master Sep 30, 2022
TysonAndre added a commit to TysonAndre/simdjson that referenced this pull request Sep 30, 2022
Load 2 bytes at a time and compare 2 bytes at a time.
Compilers with optimizations turned on will turn this into a 16-bit load
then 16-bit compare on supported platforms
(with smaller compiled code size).

Add parse_surrogate_pairs to show the difference exists.
See discussion in simdjson#1896
TysonAndre added a commit to TysonAndre/simdjson that referenced this pull request Sep 30, 2022
Load 2 bytes at a time and compare 2 bytes at a time.
Compilers with optimizations turned on will turn this into a 16-bit load
then 16-bit compare on supported platforms
(with smaller compiled code size).

Make it obvious to the compiler that it's reading two
consecutive bytes of the same pointer

Add parse_surrogate_pairs to show the difference exists.
See discussion in simdjson#1896
TysonAndre added a commit to TysonAndre/simdjson that referenced this pull request Sep 30, 2022
Load 2 bytes and compare the 2 bytes against `"\u"`
Compilers with optimizations turned on will turn this into a 16-bit load
then 16-bit compare on supported platforms
(with smaller compiled code size).

Make it obvious to the compiler that it's reading two
consecutive bytes of the same pointer

Add parse_surrogate_pairs to show the difference exists.
See discussion in simdjson#1896
TysonAndre added a commit to TysonAndre/simdjson that referenced this pull request Sep 30, 2022
Load 2 bytes and compare the 2 bytes against `"\u"`
Compilers with optimizations turned on will turn this into a 16-bit load
then 16-bit compare on supported platforms
(with smaller compiled code size).

Make it obvious to the compiler that it's reading two
consecutive bytes of the same pointer

Add parse_surrogate_pairs to show the difference exists.
See discussion in simdjson#1896
TysonAndre added a commit to TysonAndre/simdjson that referenced this pull request Oct 1, 2022
Load 2 bytes and compare the 2 bytes against `"\u"`
Compilers with optimizations turned on will turn this into a 16-bit load
then 16-bit compare on supported platforms
(with smaller compiled code size).

Make it obvious to the compiler that it's reading two
consecutive bytes of the same pointer

Add parse_surrogate_pairs to show the difference exists.
See discussion in simdjson#1896
lemire pushed a commit that referenced this pull request Oct 2, 2022
Load 2 bytes and compare the 2 bytes against `"\u"`
Compilers with optimizations turned on will turn this into a 16-bit load
then 16-bit compare on supported platforms
(with smaller compiled code size).

Make it obvious to the compiler that it's reading two
consecutive bytes of the same pointer

Add parse_surrogate_pairs to show the difference exists.
See discussion in #1896
@TysonAndre TysonAndre deleted the fix-1894 branch October 12, 2022 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

simdjson does not check that the value after high surrogate is in the range of a low surrogate?
2 participants