Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upJSON.stringify produces invalid UTF-16 #944
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mathiasbynens
Jun 30, 2017
Member
Prior discussion: https://esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content-14
I’m in favor of making JSON.stringify() produce ASCII-safe output through the use of escape sequences if this change doesn’t break the Web. Sadly, we don’t know if it does.
|
Prior discussion: https://esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content-14 I’m in favor of making |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
allenwb
Jun 30, 2017
Member
I couple follow-up thoughts to the old discussion.
-
Perhaps it's time to consider extending JSON.stringify to accept an options object. Note that under the current spec (JSON.stringify step 4) the second argument is ignored if it is an object that is neither an array or a function. So the API could be extended to accept an options object in that position.
-
This issue extends beyond JSON.stringify. Perhaps we should consider adding a String.prototype.escapeForUTF8 method. (or escapeUnmatchedSurrogates )
|
I couple follow-up thoughts to the old discussion.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
domenic
Jun 30, 2017
Member
I think it's at least worth pointing out the easy fix, which is to change the phrase "String in UTF-16 encoded JSON format" to "String in JSON format".
|
I think it's at least worth pointing out the easy fix, which is to change the phrase "String in UTF-16 encoded JSON format" to "String in JSON format". |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
annevk
Jun 30, 2017
Contributor
If you need UTF-8, you need to replace unpaired surrogates in a string with U+FFFD. Most platform APIs will do this for you. Perhaps that's a primitive we should expose somehow...
|
If you need UTF-8, you need to replace unpaired surrogates in a string with U+FFFD. Most platform APIs will do this for you. Perhaps that's a primitive we should expose somehow... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bathos
Jun 30, 2017
Contributor
I’m curious, does escaping not just move the problem around? ES permits invalid Unicode, and even if unpaired surrogate code units are escaped, the result still describes an invalid Unicode string, doesn’t it? Maybe that’s still better than nothing since other agents can then safely decode. Is that the only goal?
|
I’m curious, does escaping not just move the problem around? ES permits invalid Unicode, and even if unpaired surrogate code units are escaped, the result still describes an invalid Unicode string, doesn’t it? Maybe that’s still better than nothing since other agents can then safely decode. Is that the only goal? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
annevk
Jun 30, 2017
Contributor
If you replace unpaired surrogates with U+FFFD, you no longer have code points that cannot be represented by UTF-8 (or UTF-16). I'm not really sure what you mean with "invalid Unicode".
|
If you replace unpaired surrogates with U+FFFD, you no longer have code points that cannot be represented by UTF-8 (or UTF-16). I'm not really sure what you mean with "invalid Unicode". |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bathos
Jun 30, 2017
Contributor
@annevk Oh, I misunderstood — I thought the idea was to replace with \u{SOME_SURROGATE}. (Got mixed up when following a link into a related thread).
|
@annevk Oh, I misunderstood — I thought the idea was to replace with |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Maxdamantus
Jul 1, 2017
@bathos It describes a Unicode string that is not well-formed, but as mentioned in 3.9 of the standard (http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf), "Unicode strings need not contain well-formed code unit sequences under all conditions". It gives an example there of concatenating ill-formed UTF-16 sequences to create a well-formed UTF-16 sequence, like in my split/join examples.
Maxdamantus
commented
Jul 1, 2017
|
@bathos It describes a Unicode string that is not well-formed, but as mentioned in 3.9 of the standard (http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf), "Unicode strings need not contain well-formed code unit sequences under all conditions". It gives an example there of concatenating ill-formed UTF-16 sequences to create a well-formed UTF-16 sequence, like in my split/join examples. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bathos
Jul 1, 2017
Contributor
Thanks, @Maxdamantus. "Ill-formed" is probably the right term for what I meant, though either way, my comment was an error — I got this confused with something proposed elsewhere.
Though re-reading the initial issue here, I guess my comment was still applicable after all? This does seem to be about escaping bad surrogate sequences to \u notation as well. I don’t know what higher-level goals inform this so my two cents are pretty worthless, but my naive expectation would be U+FFFD conversions to maximize interoperability, since JSON is largely a data interchange format; on the other hand there are reasons I can think of why one might not want it to ever be lossy (e.g. perhaps there are string encodings used to transmit binary data that might produce these sequences).
|
Thanks, @Maxdamantus. "Ill-formed" is probably the right term for what I meant, though either way, my comment was an error — I got this confused with something proposed elsewhere. Though re-reading the initial issue here, I guess my comment was still applicable after all? This does seem to be about escaping bad surrogate sequences to |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Maxdamantus
Jul 3, 2017
I imagine it would be undesirable at this point to produce replacement characters, since parsing the string that is currently produced works perfectly well for the relevant cases just as long as it never gets encoded into another UTF (or more accurately, gets interpreted as UTF-16) in the middle (i.e., JSON.parse(JSON.stringify(s)) === s, regardless of whether s is well-formed UTF-16), and there must be at least some code somewhere that relies on this.
I think generally it should be perfectly fine to encode arbitrary ES strings and that since the JSON encoding is supposedly a textual one based on Unicode characters (eg, {, }, \, u, ", [, …), it should produce an actual well-formed Unicode string. If JSON.stringify were able to be respecified disregarding existing applications, I'd actually be approximately in the middle of thinking it should be well-formed Unicode and thinking it should be printable ASCII (as @mathiasbynens mentioned). I'd probably lean slightly more towards Unicode, due to space efficiency and the fact that if something can't pass around a well-formed Unicode string nowadays, it's probably due to a bug that should be fixed.
As it currently is, encodeURIComponent(JSON.stringify("\u{10000}"[0])) is specified to throw a URIError.
Maxdamantus
commented
Jul 3, 2017
|
I imagine it would be undesirable at this point to produce replacement characters, since parsing the string that is currently produced works perfectly well for the relevant cases just as long as it never gets encoded into another UTF (or more accurately, gets interpreted as UTF-16) in the middle (i.e., I think generally it should be perfectly fine to encode arbitrary ES strings and that since the JSON encoding is supposedly a textual one based on Unicode characters (eg, As it currently is, |
Maxdamantus commentedJun 30, 2017
•
edited
JSON.stringifyis described as returning a "String in UTF-16 encoded JSON format representing an ECMAScript value", but due to the fact that ECMAScript strings can contain non-UTF-16 sequences of code units and thatQuoteJSONStringdoes not account for this, the UTF-16 ill-formedness of the value encoded is carried over to the encoding output.The effect of this is that the JSON string can not be correctly converted to another UTF such as UTF-8, so the conversion might fail or involve inserting replacement characters.
Example strange behaviour when invoking
js(SpiderMonkey) ornode:I propose something similar to the fragment below (not currently written in formal ECMAScript style) to be added to the specification for
QuoteJSONString, before the final "Else" that currently exists.Note that this change would only have a minimal effect on the encoding of strings that are not well-formed UTF-16 code unit sequences (and will most likely fail to translate to other UTFs). Any well-formed UTF-16 code unit sequence (a sequence where every member is either a high surrogate code unit followed by a low surrogate code unit or is a non-surrogate code unit) is encoded in the same way as is currently specified.