JSON.stringify produces invalid UTF-16 #944

Maxdamantus · 2017-06-30T13:05:30Z

JSON.stringify is described as returning a "String in UTF-16 encoded JSON format representing an ECMAScript value", but due to the fact that ECMAScript strings can contain non-UTF-16 sequences of code units and that QuoteJSONString does not account for this, the UTF-16 ill-formedness of the value encoded is carried over to the encoding output.

The effect of this is that the JSON string can not be correctly converted to another UTF such as UTF-8, so the conversion might fail or involve inserting replacement characters.
Example strange behaviour when invoking js (SpiderMonkey) or node:

$ js -e 'print(JSON.stringify("\u{10000}".split("")));' | js -e 'print(JSON.parse(readline()).join("") === "\u{10000}");'
false

$ node -e 'console.log(JSON.stringify("\u{10000}".split("")));' | node -e 'require("readline").createInterface({ input: process.stdin, terminal: false }).on("line", l => { console.log(JSON.parse(l).join("") === "\u{10000}"); });'
false

I propose something similar to the fragment below (not currently written in formal ECMAScript style) to be added to the specification for QuoteJSONString, before the final "Else" that currently exists.

else if C is a low surrogate (C >= 0xD800 && C < 0xDC00),
	concatenate the \uHHHH notation of C to product
else if C is a high surrogate (C >= 0xDC00 && C < 0xE000),
	if it is the last code unit,
		concatenate the \uHHHH notation of C to product
	else,
		let N be the next code unit
		if N is a low surrogate (N >= 0xD800 && N < 0xDC00),
			concatenate C to product
			concatenate N to product
			progress iteration to the next code unit (N)
		else,
			concatenate the \uHHHH notation of C to product

Note that this change would only have a minimal effect on the encoding of strings that are not well-formed UTF-16 code unit sequences (and will most likely fail to translate to other UTFs). Any well-formed UTF-16 code unit sequence (a sequence where every member is either a high surrogate code unit followed by a low surrogate code unit or is a non-surrogate code unit) is encoded in the same way as is currently specified.

The text was updated successfully, but these errors were encountered:

mathiasbynens · 2017-06-30T15:12:44Z

Prior discussion: https://esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content-14

I’m in favor of making JSON.stringify() produce ASCII-safe output through the use of escape sequences if this change doesn’t break the Web. Sadly, we don’t know if it does.

allenwb · 2017-06-30T16:06:40Z

I couple follow-up thoughts to the old discussion.

Perhaps it's time to consider extending JSON.stringify to accept an options object. Note that under the current spec (JSON.stringify step 4) the second argument is ignored if it is an object that is neither an array or a function. So the API could be extended to accept an options object in that position.
This issue extends beyond JSON.stringify. Perhaps we should consider adding a String.prototype.escapeForUTF8 method. (or escapeUnmatchedSurrogates )

domenic · 2017-06-30T16:12:40Z

I think it's at least worth pointing out the easy fix, which is to change the phrase "String in UTF-16 encoded JSON format" to "String in JSON format".

annevk · 2017-06-30T16:33:36Z

If you need UTF-8, you need to replace unpaired surrogates in a string with U+FFFD. Most platform APIs will do this for you. Perhaps that's a primitive we should expose somehow...

bathos · 2017-06-30T16:35:45Z

I’m curious, does escaping not just move the problem around? ES permits invalid Unicode, and even if unpaired surrogate code units are escaped, the result still describes an invalid Unicode string, doesn’t it? Maybe that’s still better than nothing since other agents can then safely decode. Is that the only goal?

annevk · 2017-06-30T16:38:36Z

If you replace unpaired surrogates with U+FFFD, you no longer have code points that cannot be represented by UTF-8 (or UTF-16). I'm not really sure what you mean with "invalid Unicode".

bathos · 2017-06-30T16:39:40Z

@annevk Oh, I misunderstood — I thought the idea was to replace with \u{SOME_SURROGATE}. (Got mixed up when following a link into a related thread).

Maxdamantus · 2017-07-01T22:12:21Z

@bathos It describes a Unicode string that is not well-formed, but as mentioned in 3.9 of the standard (http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf), "Unicode strings need not contain well-formed code unit sequences under all conditions". It gives an example there of concatenating ill-formed UTF-16 sequences to create a well-formed UTF-16 sequence, like in my split/join examples.

bathos · 2017-07-01T22:32:31Z

Thanks, @Maxdamantus. "Ill-formed" is probably the right term for what I meant, though either way, my comment was an error — I got this confused with something proposed elsewhere.

Though re-reading the initial issue here, I guess my comment was still applicable after all? This does seem to be about escaping bad surrogate sequences to \u notation as well. I don’t know what higher-level goals inform this so my two cents are pretty worthless, but my naive expectation would be U+FFFD conversions to maximize interoperability, since JSON is largely a data interchange format; on the other hand there are reasons I can think of why one might not want it to ever be lossy (e.g. perhaps there are string encodings used to transmit binary data that might produce these sequences).

Maxdamantus · 2017-07-03T06:23:15Z

I imagine it would be undesirable at this point to produce replacement characters, since parsing the string that is currently produced works perfectly well for the relevant cases just as long as it never gets encoded into another UTF (or more accurately, gets interpreted as UTF-16) in the middle (i.e., JSON.parse(JSON.stringify(s)) === s, regardless of whether s is well-formed UTF-16), and there must be at least some code somewhere that relies on this.

I think generally it should be perfectly fine to encode arbitrary ES strings and that since the JSON encoding is supposedly a textual one based on Unicode characters (eg, {, }, \, u, ", [, …), it should produce an actual well-formed Unicode string. If JSON.stringify were able to be respecified disregarding existing applications, I'd actually be approximately in the middle of thinking it should be well-formed Unicode and thinking it should be printable ASCII (as @mathiasbynens mentioned). I'd probably lean slightly more towards Unicode, due to space efficiency and the fact that if something can't pass around a well-formed Unicode string nowadays, it's probably due to a bug that should be fixed.

As it currently is, encodeURIComponent(JSON.stringify("\u{10000}"[0])) is specified to throw a URIError.

bakkot · 2018-11-22T09:02:40Z

There is now a stage 3 proposal fixing this. Once this lands this can be closed, I expect.

Fixes tc39#944

…trings Fixes tc39#944

…trings (#1396) Fixes #944

ljharb added the proposal This is related to a specific proposal, and will be closed/merged when the proposal reaches stage 4. label Nov 22, 2018

mathiasbynens mentioned this issue Dec 5, 2018

Advance to stage 4 tc39/proposal-well-formed-stringify#11

Closed

gibson042 added a commit to gibson042/ecma262 that referenced this issue Jan 8, 2019

Prevent JSON.stringify from returning ill-formed Unicode strings

dcf7eb1

Fixes tc39#944

gibson042 mentioned this issue Jan 8, 2019

Normative: Prevent JSON.stringify from returning ill-formed Unicode strings #1396

Merged

gibson042 added a commit to gibson042/ecma262 that referenced this issue Jan 8, 2019

Normative: Prevent JSON.stringify from returning ill-formed Unicode s…

2004447

…trings Fixes tc39#944

bterlson closed this as completed in #1396 Feb 20, 2019

bterlson pushed a commit that referenced this issue Feb 20, 2019

Normative: Prevent JSON.stringify from returning ill-formed Unicode s…

7428c89

…trings (#1396) Fixes #944

gsouf mentioned this issue Feb 22, 2020

Upgrade to node 12 has breaking change with JSON.stringify apify/apify-actor-docker#16

Closed

robotdan mentioned this issue Aug 21, 2020

LDAP Lambda does not support active directory FusionAuth/fusionauth-issues#822

Closed

Maxdamantus mentioned this issue Jul 27, 2023

Binary strings jqlang/jq#2736

Open

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON.stringify produces invalid UTF-16 #944

JSON.stringify produces invalid UTF-16 #944

Maxdamantus commented Jun 30, 2017 •

edited

Loading

mathiasbynens commented Jun 30, 2017

allenwb commented Jun 30, 2017 •

edited

Loading

domenic commented Jun 30, 2017

annevk commented Jun 30, 2017

bathos commented Jun 30, 2017

annevk commented Jun 30, 2017

bathos commented Jun 30, 2017 •

edited

Loading

Maxdamantus commented Jul 1, 2017

bathos commented Jul 1, 2017 •

edited

Loading

Maxdamantus commented Jul 3, 2017

bakkot commented Nov 22, 2018 •

edited

Loading

JSON.stringify produces invalid UTF-16 #944

JSON.stringify produces invalid UTF-16 #944

Comments

Maxdamantus commented Jun 30, 2017 • edited Loading

mathiasbynens commented Jun 30, 2017

allenwb commented Jun 30, 2017 • edited Loading

domenic commented Jun 30, 2017

annevk commented Jun 30, 2017

bathos commented Jun 30, 2017

annevk commented Jun 30, 2017

bathos commented Jun 30, 2017 • edited Loading

Maxdamantus commented Jul 1, 2017

bathos commented Jul 1, 2017 • edited Loading

Maxdamantus commented Jul 3, 2017

bakkot commented Nov 22, 2018 • edited Loading

Maxdamantus commented Jun 30, 2017 •

edited

Loading

allenwb commented Jun 30, 2017 •

edited

Loading

bathos commented Jun 30, 2017 •

edited

Loading

bathos commented Jul 1, 2017 •

edited

Loading

bakkot commented Nov 22, 2018 •

edited

Loading