Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs #782

etcimon · 2014-08-13T18:25:09Z

I'm having issues with utf-8 in json documents, I never managed to find a way even if I use readAllUTF8. The error I get is invalid UTF-8 surrogate character.

etcimon · 2014-08-13T18:52:47Z

Actually, this is a problem with UTF-8 support. Here's an example: \ud83d\udcec. The character should be interpreted as: 📬

However, the error is: Invalid UTF sequence: d83d - Encoding a surrogate code point in UTF-8

etcimon · 2014-08-13T19:25:29Z

Looks like the unicode parser/encoder is incomplete. I'm going to work on one based off of this:

https://github.com/akheron/jansson/blob/master/src/load.c#L402

removed example

etcimon · 2014-08-13T21:19:10Z

Should be good now

s-ludwig · 2014-08-14T06:43:38Z

That's a lot of logic for code that was supposed to be encoding agnostic. It was one of the goals to not have any overhead due to UTF decoding or encoding (Andrei also mentioned this as a primary goal for a potential std.data.json module). Also, this logic would have to be in almost any other function that processes strings as ranges, too, so that can't be the right solution.

On top of that, surrogate pairs, AFAIK, are simply invalid in UTF-8 (they should be replaced by a single Unicode endpoint), even though they could be handled. So the correct way to handle this would be to pre-process the input with the logic of this PR before sending it to the JSON parser (i.e. there should theoretically be an in contract for writeJsonString that all input strings properly validate()). So maybe that would be a good feature for vibe.utils.string.sanitizeUTF8?

etcimon · 2014-08-14T12:25:41Z

It was one of the goals to not have any overhead due to UTF decoding or encoding (Andrei also mentioned this as a primary goal for a potential std.data.json module).

That's unfortunate because it's in the Json specs : http://www.ietf.org/rfc/rfc4627.txt

Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits
[..]
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair.

This full feature should be automatic and available out-of-the-box with quotation-enclosed strings. This was actually breaking an e-commerce on an external API and it's a very unexpected misfeature

s-ludwig · 2014-08-14T12:32:52Z

Hm.. then is seems like the JSON spec contradicts itself. The next section says that the default encoding is UTF-8, but UTF-8 quite clearly disallows surrogate pairs. I'm actually surprised that the spec goes into so much detail w.r.t character encoding.

Okay, but it seems like this stuff is out in the wild. So my suggested solution would still be to preprocess the string, so that it is valid UTF-8 (after all D defines strings to be UTF-8 and it's also D which validates the encoding on various occasions). What we could maybe do would be to add a flag to parseJson which optionally causes the input to be wrapped in a surrogate decoder range. But I don't want to work against D/Phobos. An alternative would be to change std.utf.validate to accept such sequences by default.

etcimon · 2014-08-14T12:37:56Z

What we could maybe do would be to add a flag to parseJson which optionally causes the input to be wrapped in a surrogate decoder range. But I don't want to work against D/Phobos. An alternative would be to change std.utf.validate to accept such sequences by default.

I can see that happening, although it should be a flag to disable the utf-8/surrogate pair decoder. It seems like disabling it is mostly a performance concern, whereas the bugs involved if it's not enabled are hard to track and disastrous (500 internal error). jsonEscape should also have a flag to return with codepoints rather than utf-8 encoding

s-ludwig · 2014-08-14T12:43:35Z

You can still get 500 errors when the user sends other malformed UTF-8 sequences and technically it's just that. I'd really be interested who was responsible for this part of the JSON spec.

What exactly do you mean with jsonEscape returning codepoints rather than utf-8 encoding? Just a mode where it outputs to a range of dchars vs. chars? Currently the whole parser is tailored for UTF-8, so that would just be a performance hog, but when UTF-16 and UTF-32 support gets built in, that would make sense.

etcimon · 2014-08-14T12:49:02Z

What exactly do you mean with jsonEscape returning codepoints rather than utf-8 encoding? Just a mode where it outputs to a range of dchars vs. chars?

I meant the option of escaping the sequence in \uXXXX or \uXXXX\uXXXX when it goes out of bonds on ASCII-7.

I'd really be interested who was responsible for this part of the JSON spec.

I'd say let the majority decide. This is how the Json documents are returned by facebook, instagram and google, e.g.

http://instagram.com/mileycyrus (search \ud83d\udc85 in source code)

s-ludwig · 2014-08-14T12:49:27Z

Wait, I misunderstood the spec there. So this is really only allowed for explicitly escaped UTF-16 code points ("\u1234") - in that case, please just forget what I've just said ;)

s-ludwig · 2014-08-14T12:54:57Z

So the change for the parser would be OK, the escapeJSON change is also OK functionally. But let's make a little optimization by testing ch < 0x80 instead of performing a full stride computation, so that the performance impact is as small as possible.

Sorry for the confusion.

etcimon · 2014-08-14T12:55:45Z

XD

Thanks :)

s-ludwig · 2014-08-14T15:31:33Z

Thanks! Looks good indeed (I'm gonna trust you on those Unicode magic numbers ;)

Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs

etcimon · 2014-08-14T16:26:01Z

Yep, don't trust me though - I just ripped it off here and here, although it's heavily maintained I'm passing him the blame if anything goes wrong ;)

Fixed UTF-8 support in Json serialization

f5728b8

removed example

etcimon changed the title ~~Added try/catch on the .put call~~ Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs Aug 13, 2014

Removed stride in favor of a quick byte compare

1f2a332

s-ludwig added a commit that referenced this pull request Aug 14, 2014

Merge pull request #782 from etcimon/appender-fix

df80d24

Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs

s-ludwig merged commit df80d24 into vibe-d:master Aug 14, 2014

etcimon deleted the appender-fix branch August 14, 2014 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs #782

Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs #782

etcimon commented Aug 13, 2014

etcimon commented Aug 13, 2014

etcimon commented Aug 13, 2014

etcimon commented Aug 13, 2014

s-ludwig commented Aug 14, 2014

etcimon commented Aug 14, 2014

s-ludwig commented Aug 14, 2014

etcimon commented Aug 14, 2014

s-ludwig commented Aug 14, 2014

etcimon commented Aug 14, 2014

s-ludwig commented Aug 14, 2014

s-ludwig commented Aug 14, 2014

etcimon commented Aug 14, 2014

s-ludwig commented Aug 14, 2014

etcimon commented Aug 14, 2014

Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs #782

Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs #782

Conversation

etcimon commented Aug 13, 2014

etcimon commented Aug 13, 2014

etcimon commented Aug 13, 2014

etcimon commented Aug 13, 2014

s-ludwig commented Aug 14, 2014

etcimon commented Aug 14, 2014

s-ludwig commented Aug 14, 2014

etcimon commented Aug 14, 2014

s-ludwig commented Aug 14, 2014

etcimon commented Aug 14, 2014

s-ludwig commented Aug 14, 2014

s-ludwig commented Aug 14, 2014

etcimon commented Aug 14, 2014

s-ludwig commented Aug 14, 2014

etcimon commented Aug 14, 2014