-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs #782
Conversation
Actually, this is a problem with UTF-8 support. Here's an example: However, the error is: |
Looks like the unicode parser/encoder is incomplete. I'm going to work on one based off of this: https://github.com/akheron/jansson/blob/master/src/load.c#L402 |
removed example
Should be good now |
That's a lot of logic for code that was supposed to be encoding agnostic. It was one of the goals to not have any overhead due to UTF decoding or encoding (Andrei also mentioned this as a primary goal for a potential On top of that, surrogate pairs, AFAIK, are simply invalid in UTF-8 (they should be replaced by a single Unicode endpoint), even though they could be handled. So the correct way to handle this would be to pre-process the input with the logic of this PR before sending it to the JSON parser (i.e. there should theoretically be an |
That's unfortunate because it's in the Json specs : http://www.ietf.org/rfc/rfc4627.txt Any character may be escaped. If the character is in the Basic This full feature should be automatic and available out-of-the-box with quotation-enclosed strings. This was actually breaking an e-commerce on an external API and it's a very unexpected misfeature |
Hm.. then is seems like the JSON spec contradicts itself. The next section says that the default encoding is UTF-8, but UTF-8 quite clearly disallows surrogate pairs. I'm actually surprised that the spec goes into so much detail w.r.t character encoding. Okay, but it seems like this stuff is out in the wild. So my suggested solution would still be to preprocess the string, so that it is valid UTF-8 (after all D defines strings to be UTF-8 and it's also D which validates the encoding on various occasions). What we could maybe do would be to add a flag to |
I can see that happening, although it should be a flag to disable the utf-8/surrogate pair decoder. It seems like disabling it is mostly a performance concern, whereas the bugs involved if it's not enabled are hard to track and disastrous (500 internal error). |
You can still get 500 errors when the user sends other malformed UTF-8 sequences and technically it's just that. I'd really be interested who was responsible for this part of the JSON spec. What exactly do you mean with |
I meant the option of escaping the sequence in
I'd say let the majority decide. This is how the Json documents are returned by facebook, instagram and google, e.g. http://instagram.com/mileycyrus (search \ud83d\udc85 in source code) |
Wait, I misunderstood the spec there. So this is really only allowed for explicitly escaped UTF-16 code points ( |
So the change for the parser would be OK, the Sorry for the confusion. |
XD Thanks :) |
Thanks! Looks good indeed (I'm gonna trust you on those Unicode magic numbers ;) |
Json serialization doesn't support UTF-8 Encoding & UTF-16 surrogate pairs
I'm having issues with utf-8 in json documents, I never managed to find a way even if I use readAllUTF8. The error I get is invalid UTF-8 surrogate character.