Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

json: disallow overlong and out-of-range UTF-8 #4097

Merged
merged 1 commit into from Jan 7, 2020

Conversation

hryx
Copy link
Sponsor Contributor

@hryx hryx commented Jan 7, 2020

Fixes #2379

Today we done learn a thing or two about UTF-8 and JSON! If you are curious about the encoding-related details, please see my commit message 馃拃 馃敤

Changes

  • Overhaul the std/json parser states related to multibyte sequences, using this official table of valid UTF-8 byte sequences as the guiding map. The parser now correctly disallows "overlong" encoding and codepoints beyond UTF-16 range.
  • Add several test cases to cover the above as well as some preexisting functionality.

Notes

and pre毛mption of awkward feels

Fixes ziglang#2379

= Overlong (non-shortest) sequences

UTF-8's unique encoding scheme allows for some Unicode codepoints
to be represented in multiple ways. For any of these characters,
the spec forbids all but the shortest form. These disallowed longer
sequences are called "overlong". As an interesting side effect of
this rule, the bytes C0 and C1 never appear in valid UTF-8.

= Codepoint range

UTF-8 disallows representation of codepoints beyond U+10FFFF,
which is the highest character which can be encoded in UTF-16.
Because a 4-byte sequence is capable of resulting in such characters,
they must be explicitly rejected. This rule also has an interesting
side effect, which is that bytes F5 to FF never appear.

= References

Detecting an overlong version of a codepoint could get gnarly, but
luckily The Unicode Consortium did the hard work by creating this
handy table of valid byte sequences:

https://unicode.org/versions/corrigendum1.html

I thought this mapped nicely to the parser's state machine, so I
rearranged the relevant states to make use of it.
@tiehuis
Copy link
Member

tiehuis commented Jan 7, 2020

Regarding the naming convention in tests, these correspond to the tests found in this repository: https://github.com/nst/JSONTestSuite

@hryx
Copy link
Sponsor Contributor Author

hryx commented Jan 7, 2020

Regarding the naming convention in tests, these correspond to the tests found in this repository: https://github.com/nst/JSONTestSuite

Ahhh, I see, thanks. I'm happy to rename the tests or move them to std/json.zig if you'd prefer.

@tiehuis
Copy link
Member

tiehuis commented Jan 7, 2020

I think it is fine to add these to the existing file. A few have already been done. A comment indicating what part of the spec they refer to would be more than enough.

@andrewrk andrewrk merged commit 2933a82 into ziglang:master Jan 7, 2020
@hryx hryx deleted the json-invalid-utf8 branch January 7, 2020 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

std.json failing test cases
3 participants