Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON.stringify produces invalid UTF-16 #944

Closed
Maxdamantus opened this issue Jun 30, 2017 · 11 comments
Closed

JSON.stringify produces invalid UTF-16 #944

Maxdamantus opened this issue Jun 30, 2017 · 11 comments
Labels
proposal This is related to a specific proposal, and will be closed/merged when the proposal reaches stage 4.

Comments

@Maxdamantus
Copy link

Maxdamantus commented Jun 30, 2017

JSON.stringify is described as returning a "String in UTF-16 encoded JSON format representing an ECMAScript value", but due to the fact that ECMAScript strings can contain non-UTF-16 sequences of code units and that QuoteJSONString does not account for this, the UTF-16 ill-formedness of the value encoded is carried over to the encoding output.

The effect of this is that the JSON string can not be correctly converted to another UTF such as UTF-8, so the conversion might fail or involve inserting replacement characters.
Example strange behaviour when invoking js (SpiderMonkey) or node:

$ js -e 'print(JSON.stringify("\u{10000}".split("")));' | js -e 'print(JSON.parse(readline()).join("") === "\u{10000}");'
false

$ node -e 'console.log(JSON.stringify("\u{10000}".split("")));' | node -e 'require("readline").createInterface({ input: process.stdin, terminal: false }).on("line", l => { console.log(JSON.parse(l).join("") === "\u{10000}"); });'
false

I propose something similar to the fragment below (not currently written in formal ECMAScript style) to be added to the specification for QuoteJSONString, before the final "Else" that currently exists.

else if C is a low surrogate (C >= 0xD800 && C < 0xDC00),
	concatenate the \uHHHH notation of C to product
else if C is a high surrogate (C >= 0xDC00 && C < 0xE000),
	if it is the last code unit,
		concatenate the \uHHHH notation of C to product
	else,
		let N be the next code unit
		if N is a low surrogate (N >= 0xD800 && N < 0xDC00),
			concatenate C to product
			concatenate N to product
			progress iteration to the next code unit (N)
		else,
			concatenate the \uHHHH notation of C to product

Note that this change would only have a minimal effect on the encoding of strings that are not well-formed UTF-16 code unit sequences (and will most likely fail to translate to other UTFs). Any well-formed UTF-16 code unit sequence (a sequence where every member is either a high surrogate code unit followed by a low surrogate code unit or is a non-surrogate code unit) is encoded in the same way as is currently specified.

@mathiasbynens
Copy link
Member

Prior discussion: https://esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content-14

I’m in favor of making JSON.stringify() produce ASCII-safe output through the use of escape sequences if this change doesn’t break the Web. Sadly, we don’t know if it does.

@allenwb
Copy link
Member

allenwb commented Jun 30, 2017

I couple follow-up thoughts to the old discussion.

  1. Perhaps it's time to consider extending JSON.stringify to accept an options object. Note that under the current spec (JSON.stringify step 4) the second argument is ignored if it is an object that is neither an array or a function. So the API could be extended to accept an options object in that position.

  2. This issue extends beyond JSON.stringify. Perhaps we should consider adding a String.prototype.escapeForUTF8 method. (or escapeUnmatchedSurrogates )

@domenic
Copy link
Member

domenic commented Jun 30, 2017

I think it's at least worth pointing out the easy fix, which is to change the phrase "String in UTF-16 encoded JSON format" to "String in JSON format".

@annevk
Copy link
Member

annevk commented Jun 30, 2017

If you need UTF-8, you need to replace unpaired surrogates in a string with U+FFFD. Most platform APIs will do this for you. Perhaps that's a primitive we should expose somehow...

@bathos
Copy link
Contributor

bathos commented Jun 30, 2017

I’m curious, does escaping not just move the problem around? ES permits invalid Unicode, and even if unpaired surrogate code units are escaped, the result still describes an invalid Unicode string, doesn’t it? Maybe that’s still better than nothing since other agents can then safely decode. Is that the only goal?

@annevk
Copy link
Member

annevk commented Jun 30, 2017

If you replace unpaired surrogates with U+FFFD, you no longer have code points that cannot be represented by UTF-8 (or UTF-16). I'm not really sure what you mean with "invalid Unicode".

@bathos
Copy link
Contributor

bathos commented Jun 30, 2017

@annevk Oh, I misunderstood — I thought the idea was to replace with \u{SOME_SURROGATE}. (Got mixed up when following a link into a related thread).

@Maxdamantus
Copy link
Author

@bathos It describes a Unicode string that is not well-formed, but as mentioned in 3.9 of the standard (http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf), "Unicode strings need not contain well-formed code unit sequences under all conditions". It gives an example there of concatenating ill-formed UTF-16 sequences to create a well-formed UTF-16 sequence, like in my split/join examples.

@bathos
Copy link
Contributor

bathos commented Jul 1, 2017

Thanks, @Maxdamantus. "Ill-formed" is probably the right term for what I meant, though either way, my comment was an error — I got this confused with something proposed elsewhere.

Though re-reading the initial issue here, I guess my comment was still applicable after all? This does seem to be about escaping bad surrogate sequences to \u notation as well. I don’t know what higher-level goals inform this so my two cents are pretty worthless, but my naive expectation would be U+FFFD conversions to maximize interoperability, since JSON is largely a data interchange format; on the other hand there are reasons I can think of why one might not want it to ever be lossy (e.g. perhaps there are string encodings used to transmit binary data that might produce these sequences).

@Maxdamantus
Copy link
Author

I imagine it would be undesirable at this point to produce replacement characters, since parsing the string that is currently produced works perfectly well for the relevant cases just as long as it never gets encoded into another UTF (or more accurately, gets interpreted as UTF-16) in the middle (i.e., JSON.parse(JSON.stringify(s)) === s, regardless of whether s is well-formed UTF-16), and there must be at least some code somewhere that relies on this.

I think generally it should be perfectly fine to encode arbitrary ES strings and that since the JSON encoding is supposedly a textual one based on Unicode characters (eg, {, }, \, u, ", [, …), it should produce an actual well-formed Unicode string. If JSON.stringify were able to be respecified disregarding existing applications, I'd actually be approximately in the middle of thinking it should be well-formed Unicode and thinking it should be printable ASCII (as @mathiasbynens mentioned). I'd probably lean slightly more towards Unicode, due to space efficiency and the fact that if something can't pass around a well-formed Unicode string nowadays, it's probably due to a bug that should be fixed.

As it currently is, encodeURIComponent(JSON.stringify("\u{10000}"[0])) is specified to throw a URIError.

@bakkot
Copy link
Contributor

bakkot commented Nov 22, 2018

There is now a stage 3 proposal fixing this. Once this lands this can be closed, I expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This is related to a specific proposal, and will be closed/merged when the proposal reaches stage 4.
Projects
None yet
Development

No branches or pull requests

8 participants