Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs #782

Merged
merged 2 commits into from
Aug 14, 2014
Merged

Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs #782

merged 2 commits into from
Aug 14, 2014

Conversation

etcimon
Copy link
Contributor

@etcimon etcimon commented Aug 13, 2014

I'm having issues with utf-8 in json documents, I never managed to find a way even if I use readAllUTF8. The error I get is invalid UTF-8 surrogate character.

@etcimon
Copy link
Contributor Author

etcimon commented Aug 13, 2014

Actually, this is a problem with UTF-8 support. Here's an example: \ud83d\udcec. The character should be interpreted as: 📬

However, the error is: Invalid UTF sequence: d83d - Encoding a surrogate code point in UTF-8

@etcimon
Copy link
Contributor Author

etcimon commented Aug 13, 2014

Looks like the unicode parser/encoder is incomplete. I'm going to work on one based off of this:

https://github.com/akheron/jansson/blob/master/src/load.c#L402

@etcimon etcimon changed the title Added try/catch on the .put call Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs Aug 13, 2014
@etcimon
Copy link
Contributor Author

etcimon commented Aug 13, 2014

Should be good now

@s-ludwig
Copy link
Member

That's a lot of logic for code that was supposed to be encoding agnostic. It was one of the goals to not have any overhead due to UTF decoding or encoding (Andrei also mentioned this as a primary goal for a potential std.data.json module). Also, this logic would have to be in almost any other function that processes strings as ranges, too, so that can't be the right solution.

On top of that, surrogate pairs, AFAIK, are simply invalid in UTF-8 (they should be replaced by a single Unicode endpoint), even though they could be handled. So the correct way to handle this would be to pre-process the input with the logic of this PR before sending it to the JSON parser (i.e. there should theoretically be an in contract for writeJsonString that all input strings properly validate()). So maybe that would be a good feature for vibe.utils.string.sanitizeUTF8?

@etcimon
Copy link
Contributor Author

etcimon commented Aug 14, 2014

It was one of the goals to not have any overhead due to UTF decoding or encoding (Andrei also mentioned this as a primary goal for a potential std.data.json module).

That's unfortunate because it's in the Json specs : http://www.ietf.org/rfc/rfc4627.txt


Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits
[..]
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair.


This full feature should be automatic and available out-of-the-box with quotation-enclosed strings. This was actually breaking an e-commerce on an external API and it's a very unexpected misfeature

@s-ludwig
Copy link
Member

Hm.. then is seems like the JSON spec contradicts itself. The next section says that the default encoding is UTF-8, but UTF-8 quite clearly disallows surrogate pairs. I'm actually surprised that the spec goes into so much detail w.r.t character encoding.

Okay, but it seems like this stuff is out in the wild. So my suggested solution would still be to preprocess the string, so that it is valid UTF-8 (after all D defines strings to be UTF-8 and it's also D which validates the encoding on various occasions). What we could maybe do would be to add a flag to parseJson which optionally causes the input to be wrapped in a surrogate decoder range. But I don't want to work against D/Phobos. An alternative would be to change std.utf.validate to accept such sequences by default.

@etcimon
Copy link
Contributor Author

etcimon commented Aug 14, 2014

What we could maybe do would be to add a flag to parseJson which optionally causes the input to be wrapped in a surrogate decoder range. But I don't want to work against D/Phobos. An alternative would be to change std.utf.validate to accept such sequences by default.

I can see that happening, although it should be a flag to disable the utf-8/surrogate pair decoder. It seems like disabling it is mostly a performance concern, whereas the bugs involved if it's not enabled are hard to track and disastrous (500 internal error). jsonEscape should also have a flag to return with codepoints rather than utf-8 encoding

@s-ludwig
Copy link
Member

You can still get 500 errors when the user sends other malformed UTF-8 sequences and technically it's just that. I'd really be interested who was responsible for this part of the JSON spec.

What exactly do you mean with jsonEscape returning codepoints rather than utf-8 encoding? Just a mode where it outputs to a range of dchars vs. chars? Currently the whole parser is tailored for UTF-8, so that would just be a performance hog, but when UTF-16 and UTF-32 support gets built in, that would make sense.

@etcimon
Copy link
Contributor Author

etcimon commented Aug 14, 2014

What exactly do you mean with jsonEscape returning codepoints rather than utf-8 encoding? Just a mode where it outputs to a range of dchars vs. chars?

I meant the option of escaping the sequence in \uXXXX or \uXXXX\uXXXX when it goes out of bonds on ASCII-7.

I'd really be interested who was responsible for this part of the JSON spec.

I'd say let the majority decide. This is how the Json documents are returned by facebook, instagram and google, e.g.

http://instagram.com/mileycyrus (search \ud83d\udc85 in source code)

@s-ludwig
Copy link
Member

Wait, I misunderstood the spec there. So this is really only allowed for explicitly escaped UTF-16 code points ("\u1234") - in that case, please just forget what I've just said ;)

@s-ludwig
Copy link
Member

So the change for the parser would be OK, the escapeJSON change is also OK functionally. But let's make a little optimization by testing ch < 0x80 instead of performing a full stride computation, so that the performance impact is as small as possible.

Sorry for the confusion.

@etcimon
Copy link
Contributor Author

etcimon commented Aug 14, 2014

XD

Thanks :)

@s-ludwig
Copy link
Member

Thanks! Looks good indeed (I'm gonna trust you on those Unicode magic numbers ;)

s-ludwig added a commit that referenced this pull request Aug 14, 2014
Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs
@s-ludwig s-ludwig merged commit df80d24 into vibe-d:master Aug 14, 2014
@etcimon
Copy link
Contributor Author

etcimon commented Aug 14, 2014

Yep, don't trust me though - I just ripped it off here and here, although it's heavily maintained I'm passing him the blame if anything goes wrong ;)

@etcimon etcimon deleted the appender-fix branch August 14, 2014 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants