New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON.stringify produces invalid UTF-16 #944

Open
Maxdamantus opened this Issue Jun 30, 2017 · 10 comments

Comments

Projects
None yet
6 participants
@Maxdamantus

Maxdamantus commented Jun 30, 2017

JSON.stringify is described as returning a "String in UTF-16 encoded JSON format representing an ECMAScript value", but due to the fact that ECMAScript strings can contain non-UTF-16 sequences of code units and that QuoteJSONString does not account for this, the UTF-16 ill-formedness of the value encoded is carried over to the encoding output.

The effect of this is that the JSON string can not be correctly converted to another UTF such as UTF-8, so the conversion might fail or involve inserting replacement characters.
Example strange behaviour when invoking js (SpiderMonkey) or node:

$ js -e 'print(JSON.stringify("\u{10000}".split("")));' | js -e 'print(JSON.parse(readline()).join("") === "\u{10000}");'
false

$ node -e 'console.log(JSON.stringify("\u{10000}".split("")));' | node -e 'require("readline").createInterface({ input: process.stdin, terminal: false }).on("line", l => { console.log(JSON.parse(l).join("") === "\u{10000}"); });'
false

I propose something similar to the fragment below (not currently written in formal ECMAScript style) to be added to the specification for QuoteJSONString, before the final "Else" that currently exists.

else if C is a low surrogate (C >= 0xD800 && C < 0xDC00),
	concatenate the \uHHHH notation of C to product
else if C is a high surrogate (C >= 0xDC00 && C < 0xE000),
	if it is the last code unit,
		concatenate the \uHHHH notation of C to product
	else,
		let N be the next code unit
		if N is a low surrogate (N >= 0xD800 && N < 0xDC00),
			concatenate C to product
			concatenate N to product
			progress iteration to the next code unit (N)
		else,
			concatenate the \uHHHH notation of C to product

Note that this change would only have a minimal effect on the encoding of strings that are not well-formed UTF-16 code unit sequences (and will most likely fail to translate to other UTFs). Any well-formed UTF-16 code unit sequence (a sequence where every member is either a high surrogate code unit followed by a low surrogate code unit or is a non-surrogate code unit) is encoded in the same way as is currently specified.

@mathiasbynens

This comment has been minimized.

Show comment
Hide comment
@mathiasbynens

mathiasbynens Jun 30, 2017

Member

Prior discussion: https://esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content-14

I’m in favor of making JSON.stringify() produce ASCII-safe output through the use of escape sequences if this change doesn’t break the Web. Sadly, we don’t know if it does.

Member

mathiasbynens commented Jun 30, 2017

Prior discussion: https://esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content-14

I’m in favor of making JSON.stringify() produce ASCII-safe output through the use of escape sequences if this change doesn’t break the Web. Sadly, we don’t know if it does.

@allenwb

This comment has been minimized.

Show comment
Hide comment
@allenwb

allenwb Jun 30, 2017

Member

I couple follow-up thoughts to the old discussion.

  1. Perhaps it's time to consider extending JSON.stringify to accept an options object. Note that under the current spec (JSON.stringify step 4) the second argument is ignored if it is an object that is neither an array or a function. So the API could be extended to accept an options object in that position.

  2. This issue extends beyond JSON.stringify. Perhaps we should consider adding a String.prototype.escapeForUTF8 method. (or escapeUnmatchedSurrogates )

Member

allenwb commented Jun 30, 2017

I couple follow-up thoughts to the old discussion.

  1. Perhaps it's time to consider extending JSON.stringify to accept an options object. Note that under the current spec (JSON.stringify step 4) the second argument is ignored if it is an object that is neither an array or a function. So the API could be extended to accept an options object in that position.

  2. This issue extends beyond JSON.stringify. Perhaps we should consider adding a String.prototype.escapeForUTF8 method. (or escapeUnmatchedSurrogates )

@domenic

This comment has been minimized.

Show comment
Hide comment
@domenic

domenic Jun 30, 2017

Member

I think it's at least worth pointing out the easy fix, which is to change the phrase "String in UTF-16 encoded JSON format" to "String in JSON format".

Member

domenic commented Jun 30, 2017

I think it's at least worth pointing out the easy fix, which is to change the phrase "String in UTF-16 encoded JSON format" to "String in JSON format".

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Jun 30, 2017

Contributor

If you need UTF-8, you need to replace unpaired surrogates in a string with U+FFFD. Most platform APIs will do this for you. Perhaps that's a primitive we should expose somehow...

Contributor

annevk commented Jun 30, 2017

If you need UTF-8, you need to replace unpaired surrogates in a string with U+FFFD. Most platform APIs will do this for you. Perhaps that's a primitive we should expose somehow...

@bathos

This comment has been minimized.

Show comment
Hide comment
@bathos

bathos Jun 30, 2017

Contributor

I’m curious, does escaping not just move the problem around? ES permits invalid Unicode, and even if unpaired surrogate code units are escaped, the result still describes an invalid Unicode string, doesn’t it? Maybe that’s still better than nothing since other agents can then safely decode. Is that the only goal?

Contributor

bathos commented Jun 30, 2017

I’m curious, does escaping not just move the problem around? ES permits invalid Unicode, and even if unpaired surrogate code units are escaped, the result still describes an invalid Unicode string, doesn’t it? Maybe that’s still better than nothing since other agents can then safely decode. Is that the only goal?

@annevk

This comment has been minimized.

Show comment
Hide comment
@annevk

annevk Jun 30, 2017

Contributor

If you replace unpaired surrogates with U+FFFD, you no longer have code points that cannot be represented by UTF-8 (or UTF-16). I'm not really sure what you mean with "invalid Unicode".

Contributor

annevk commented Jun 30, 2017

If you replace unpaired surrogates with U+FFFD, you no longer have code points that cannot be represented by UTF-8 (or UTF-16). I'm not really sure what you mean with "invalid Unicode".

@bathos

This comment has been minimized.

Show comment
Hide comment
@bathos

bathos Jun 30, 2017

Contributor

@annevk Oh, I misunderstood — I thought the idea was to replace with \u{SOME_SURROGATE}. (Got mixed up when following a link into a related thread).

Contributor

bathos commented Jun 30, 2017

@annevk Oh, I misunderstood — I thought the idea was to replace with \u{SOME_SURROGATE}. (Got mixed up when following a link into a related thread).

@Maxdamantus

This comment has been minimized.

Show comment
Hide comment
@Maxdamantus

Maxdamantus Jul 1, 2017

@bathos It describes a Unicode string that is not well-formed, but as mentioned in 3.9 of the standard (http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf), "Unicode strings need not contain well-formed code unit sequences under all conditions". It gives an example there of concatenating ill-formed UTF-16 sequences to create a well-formed UTF-16 sequence, like in my split/join examples.

Maxdamantus commented Jul 1, 2017

@bathos It describes a Unicode string that is not well-formed, but as mentioned in 3.9 of the standard (http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf), "Unicode strings need not contain well-formed code unit sequences under all conditions". It gives an example there of concatenating ill-formed UTF-16 sequences to create a well-formed UTF-16 sequence, like in my split/join examples.

@bathos

This comment has been minimized.

Show comment
Hide comment
@bathos

bathos Jul 1, 2017

Contributor

Thanks, @Maxdamantus. "Ill-formed" is probably the right term for what I meant, though either way, my comment was an error — I got this confused with something proposed elsewhere.

Though re-reading the initial issue here, I guess my comment was still applicable after all? This does seem to be about escaping bad surrogate sequences to \u notation as well. I don’t know what higher-level goals inform this so my two cents are pretty worthless, but my naive expectation would be U+FFFD conversions to maximize interoperability, since JSON is largely a data interchange format; on the other hand there are reasons I can think of why one might not want it to ever be lossy (e.g. perhaps there are string encodings used to transmit binary data that might produce these sequences).

Contributor

bathos commented Jul 1, 2017

Thanks, @Maxdamantus. "Ill-formed" is probably the right term for what I meant, though either way, my comment was an error — I got this confused with something proposed elsewhere.

Though re-reading the initial issue here, I guess my comment was still applicable after all? This does seem to be about escaping bad surrogate sequences to \u notation as well. I don’t know what higher-level goals inform this so my two cents are pretty worthless, but my naive expectation would be U+FFFD conversions to maximize interoperability, since JSON is largely a data interchange format; on the other hand there are reasons I can think of why one might not want it to ever be lossy (e.g. perhaps there are string encodings used to transmit binary data that might produce these sequences).

@Maxdamantus

This comment has been minimized.

Show comment
Hide comment
@Maxdamantus

Maxdamantus Jul 3, 2017

I imagine it would be undesirable at this point to produce replacement characters, since parsing the string that is currently produced works perfectly well for the relevant cases just as long as it never gets encoded into another UTF (or more accurately, gets interpreted as UTF-16) in the middle (i.e., JSON.parse(JSON.stringify(s)) === s, regardless of whether s is well-formed UTF-16), and there must be at least some code somewhere that relies on this.

I think generally it should be perfectly fine to encode arbitrary ES strings and that since the JSON encoding is supposedly a textual one based on Unicode characters (eg, {, }, \, u, ", [, …), it should produce an actual well-formed Unicode string. If JSON.stringify were able to be respecified disregarding existing applications, I'd actually be approximately in the middle of thinking it should be well-formed Unicode and thinking it should be printable ASCII (as @mathiasbynens mentioned). I'd probably lean slightly more towards Unicode, due to space efficiency and the fact that if something can't pass around a well-formed Unicode string nowadays, it's probably due to a bug that should be fixed.

As it currently is, encodeURIComponent(JSON.stringify("\u{10000}"[0])) is specified to throw a URIError.

Maxdamantus commented Jul 3, 2017

I imagine it would be undesirable at this point to produce replacement characters, since parsing the string that is currently produced works perfectly well for the relevant cases just as long as it never gets encoded into another UTF (or more accurately, gets interpreted as UTF-16) in the middle (i.e., JSON.parse(JSON.stringify(s)) === s, regardless of whether s is well-formed UTF-16), and there must be at least some code somewhere that relies on this.

I think generally it should be perfectly fine to encode arbitrary ES strings and that since the JSON encoding is supposedly a textual one based on Unicode characters (eg, {, }, \, u, ", [, …), it should produce an actual well-formed Unicode string. If JSON.stringify were able to be respecified disregarding existing applications, I'd actually be approximately in the middle of thinking it should be well-formed Unicode and thinking it should be printable ASCII (as @mathiasbynens mentioned). I'd probably lean slightly more towards Unicode, due to space efficiency and the fact that if something can't pass around a well-formed Unicode string nowadays, it's probably due to a bug that should be fixed.

As it currently is, encodeURIComponent(JSON.stringify("\u{10000}"[0])) is specified to throw a URIError.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment