New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs #782

Merged
merged 2 commits into from Aug 14, 2014

Conversation

Projects
None yet
2 participants
@etcimon
Contributor

etcimon commented Aug 13, 2014

I'm having issues with utf-8 in json documents, I never managed to find a way even if I use readAllUTF8. The error I get is invalid UTF-8 surrogate character.

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Aug 13, 2014

Contributor

Actually, this is a problem with UTF-8 support. Here's an example: \ud83d\udcec. The character should be interpreted as: 📬

However, the error is: Invalid UTF sequence: d83d - Encoding a surrogate code point in UTF-8

Contributor

etcimon commented Aug 13, 2014

Actually, this is a problem with UTF-8 support. Here's an example: \ud83d\udcec. The character should be interpreted as: 📬

However, the error is: Invalid UTF sequence: d83d - Encoding a surrogate code point in UTF-8

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Aug 13, 2014

Contributor

Looks like the unicode parser/encoder is incomplete. I'm going to work on one based off of this:

https://github.com/akheron/jansson/blob/master/src/load.c#L402

Contributor

etcimon commented Aug 13, 2014

Looks like the unicode parser/encoder is incomplete. I'm going to work on one based off of this:

https://github.com/akheron/jansson/blob/master/src/load.c#L402

@etcimon etcimon changed the title from Added try/catch on the .put call to Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs Aug 13, 2014

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Aug 13, 2014

Contributor

Should be good now

Contributor

etcimon commented Aug 13, 2014

Should be good now

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Aug 14, 2014

Member

That's a lot of logic for code that was supposed to be encoding agnostic. It was one of the goals to not have any overhead due to UTF decoding or encoding (Andrei also mentioned this as a primary goal for a potential std.data.json module). Also, this logic would have to be in almost any other function that processes strings as ranges, too, so that can't be the right solution.

On top of that, surrogate pairs, AFAIK, are simply invalid in UTF-8 (they should be replaced by a single Unicode endpoint), even though they could be handled. So the correct way to handle this would be to pre-process the input with the logic of this PR before sending it to the JSON parser (i.e. there should theoretically be an in contract for writeJsonString that all input strings properly validate()). So maybe that would be a good feature for vibe.utils.string.sanitizeUTF8?

Member

s-ludwig commented Aug 14, 2014

That's a lot of logic for code that was supposed to be encoding agnostic. It was one of the goals to not have any overhead due to UTF decoding or encoding (Andrei also mentioned this as a primary goal for a potential std.data.json module). Also, this logic would have to be in almost any other function that processes strings as ranges, too, so that can't be the right solution.

On top of that, surrogate pairs, AFAIK, are simply invalid in UTF-8 (they should be replaced by a single Unicode endpoint), even though they could be handled. So the correct way to handle this would be to pre-process the input with the logic of this PR before sending it to the JSON parser (i.e. there should theoretically be an in contract for writeJsonString that all input strings properly validate()). So maybe that would be a good feature for vibe.utils.string.sanitizeUTF8?

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Aug 14, 2014

Contributor

It was one of the goals to not have any overhead due to UTF decoding or encoding (Andrei also mentioned this as a primary goal for a potential std.data.json module).

That's unfortunate because it's in the Json specs : http://www.ietf.org/rfc/rfc4627.txt


Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits
[..]
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair.


This full feature should be automatic and available out-of-the-box with quotation-enclosed strings. This was actually breaking an e-commerce on an external API and it's a very unexpected misfeature

Contributor

etcimon commented Aug 14, 2014

It was one of the goals to not have any overhead due to UTF decoding or encoding (Andrei also mentioned this as a primary goal for a potential std.data.json module).

That's unfortunate because it's in the Json specs : http://www.ietf.org/rfc/rfc4627.txt


Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits
[..]
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair.


This full feature should be automatic and available out-of-the-box with quotation-enclosed strings. This was actually breaking an e-commerce on an external API and it's a very unexpected misfeature

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Aug 14, 2014

Member

Hm.. then is seems like the JSON spec contradicts itself. The next section says that the default encoding is UTF-8, but UTF-8 quite clearly disallows surrogate pairs. I'm actually surprised that the spec goes into so much detail w.r.t character encoding.

Okay, but it seems like this stuff is out in the wild. So my suggested solution would still be to preprocess the string, so that it is valid UTF-8 (after all D defines strings to be UTF-8 and it's also D which validates the encoding on various occasions). What we could maybe do would be to add a flag to parseJson which optionally causes the input to be wrapped in a surrogate decoder range. But I don't want to work against D/Phobos. An alternative would be to change std.utf.validate to accept such sequences by default.

Member

s-ludwig commented Aug 14, 2014

Hm.. then is seems like the JSON spec contradicts itself. The next section says that the default encoding is UTF-8, but UTF-8 quite clearly disallows surrogate pairs. I'm actually surprised that the spec goes into so much detail w.r.t character encoding.

Okay, but it seems like this stuff is out in the wild. So my suggested solution would still be to preprocess the string, so that it is valid UTF-8 (after all D defines strings to be UTF-8 and it's also D which validates the encoding on various occasions). What we could maybe do would be to add a flag to parseJson which optionally causes the input to be wrapped in a surrogate decoder range. But I don't want to work against D/Phobos. An alternative would be to change std.utf.validate to accept such sequences by default.

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Aug 14, 2014

Contributor

What we could maybe do would be to add a flag to parseJson which optionally causes the input to be wrapped in a surrogate decoder range. But I don't want to work against D/Phobos. An alternative would be to change std.utf.validate to accept such sequences by default.

I can see that happening, although it should be a flag to disable the utf-8/surrogate pair decoder. It seems like disabling it is mostly a performance concern, whereas the bugs involved if it's not enabled are hard to track and disastrous (500 internal error). jsonEscape should also have a flag to return with codepoints rather than utf-8 encoding

Contributor

etcimon commented Aug 14, 2014

What we could maybe do would be to add a flag to parseJson which optionally causes the input to be wrapped in a surrogate decoder range. But I don't want to work against D/Phobos. An alternative would be to change std.utf.validate to accept such sequences by default.

I can see that happening, although it should be a flag to disable the utf-8/surrogate pair decoder. It seems like disabling it is mostly a performance concern, whereas the bugs involved if it's not enabled are hard to track and disastrous (500 internal error). jsonEscape should also have a flag to return with codepoints rather than utf-8 encoding

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Aug 14, 2014

Member

You can still get 500 errors when the user sends other malformed UTF-8 sequences and technically it's just that. I'd really be interested who was responsible for this part of the JSON spec.

What exactly do you mean with jsonEscape returning codepoints rather than utf-8 encoding? Just a mode where it outputs to a range of dchars vs. chars? Currently the whole parser is tailored for UTF-8, so that would just be a performance hog, but when UTF-16 and UTF-32 support gets built in, that would make sense.

Member

s-ludwig commented Aug 14, 2014

You can still get 500 errors when the user sends other malformed UTF-8 sequences and technically it's just that. I'd really be interested who was responsible for this part of the JSON spec.

What exactly do you mean with jsonEscape returning codepoints rather than utf-8 encoding? Just a mode where it outputs to a range of dchars vs. chars? Currently the whole parser is tailored for UTF-8, so that would just be a performance hog, but when UTF-16 and UTF-32 support gets built in, that would make sense.

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Aug 14, 2014

Contributor

What exactly do you mean with jsonEscape returning codepoints rather than utf-8 encoding? Just a mode where it outputs to a range of dchars vs. chars?

I meant the option of escaping the sequence in \uXXXX or \uXXXX\uXXXX when it goes out of bonds on ASCII-7.

I'd really be interested who was responsible for this part of the JSON spec.

I'd say let the majority decide. This is how the Json documents are returned by facebook, instagram and google, e.g.

http://instagram.com/mileycyrus (search \ud83d\udc85 in source code)

Contributor

etcimon commented Aug 14, 2014

What exactly do you mean with jsonEscape returning codepoints rather than utf-8 encoding? Just a mode where it outputs to a range of dchars vs. chars?

I meant the option of escaping the sequence in \uXXXX or \uXXXX\uXXXX when it goes out of bonds on ASCII-7.

I'd really be interested who was responsible for this part of the JSON spec.

I'd say let the majority decide. This is how the Json documents are returned by facebook, instagram and google, e.g.

http://instagram.com/mileycyrus (search \ud83d\udc85 in source code)

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Aug 14, 2014

Member

Wait, I misunderstood the spec there. So this is really only allowed for explicitly escaped UTF-16 code points ("\u1234") - in that case, please just forget what I've just said ;)

Member

s-ludwig commented Aug 14, 2014

Wait, I misunderstood the spec there. So this is really only allowed for explicitly escaped UTF-16 code points ("\u1234") - in that case, please just forget what I've just said ;)

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Aug 14, 2014

Member

So the change for the parser would be OK, the escapeJSON change is also OK functionally. But let's make a little optimization by testing ch < 0x80 instead of performing a full stride computation, so that the performance impact is as small as possible.

Sorry for the confusion.

Member

s-ludwig commented Aug 14, 2014

So the change for the parser would be OK, the escapeJSON change is also OK functionally. But let's make a little optimization by testing ch < 0x80 instead of performing a full stride computation, so that the performance impact is as small as possible.

Sorry for the confusion.

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Aug 14, 2014

Contributor

XD

Thanks :)

Contributor

etcimon commented Aug 14, 2014

XD

Thanks :)

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Aug 14, 2014

Member

Thanks! Looks good indeed (I'm gonna trust you on those Unicode magic numbers ;)

Member

s-ludwig commented Aug 14, 2014

Thanks! Looks good indeed (I'm gonna trust you on those Unicode magic numbers ;)

s-ludwig added a commit that referenced this pull request Aug 14, 2014

Merge pull request #782 from etcimon/appender-fix
Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs

@s-ludwig s-ludwig merged commit df80d24 into vibe-d:master Aug 14, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Aug 14, 2014

Contributor

Yep, don't trust me though - I just ripped it off here and here, although it's heavily maintained I'm passing him the blame if anything goes wrong ;)

Contributor

etcimon commented Aug 14, 2014

Yep, don't trust me though - I just ripped it off here and here, although it's heavily maintained I'm passing him the blame if anything goes wrong ;)

@etcimon etcimon deleted the etcimon:appender-fix branch Aug 14, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment