Clarify that encoding tokens are scalar values #195

andreubotella · 2020-01-14T07:57:34Z

It seems like the fact that Unicode tokens are scalar values, rather than code points, is far from clear in the spec. Other than the usage of USVString and related algorithms in the API section, the only mention of scalar values in the normative text is in the definition of encoding. In fact, the definition of token refers explicitly to code points, rather than scalar values.

This could actually result in a security issue if a specification weren't careful when using the encoding hooks – encoding handlers based on indices would raise an error on surrogate code points, but the UTF-8 handler would go along with it, returning a byte sequence which would fail on decoding.

I propose adding some text to the note in the hooks section informing specs that they should only invoke the encoding algorithms with streams built from a USVString, as well as adding an assertion in the process algorithm that, if encoderDecoderInstance is an encoder instance, input must not be a surrogate.

The text was updated successfully, but these errors were encountered:

annevk · 2020-01-14T08:34:12Z

I support both those suggestions, though would suggest that rather than explicitly mentioning USVString we stick to https://infra.spec.whatwg.org/#scalar-value-string or equivalent as it's more widely applicable.

(Anyone that's able to sign https://participate.whatwg.org/agreement should feel free to pick this up and work on a PR.)

andreubotella · 2020-01-17T23:42:38Z

I'm working on a pull request for this issue.

This replaces some occurrences of "code point" with "scalar value", since it might not be clear to passing readers that output streams from decoding and input streams to encoding cannot contain tokens which are surrogates. Fixes #195.

annevk added clarification Standard could be clearer security/privacy There are security or privacy implications labels Jan 14, 2020

annevk added the good first issue Ideal for someone new to a WHATWG standard or software project label Jan 14, 2020

andreubotella mentioned this issue Jan 18, 2020

Clarify that encoding tokens are scalar values. #196

Merged

annevk closed this as completed in #196 Jan 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify that encoding tokens are scalar values #195

Clarify that encoding tokens are scalar values #195

andreubotella commented Jan 14, 2020 •

edited

annevk commented Jan 14, 2020 •

edited

andreubotella commented Jan 17, 2020

Clarify that encoding tokens are scalar values #195

Clarify that encoding tokens are scalar values #195

Comments

andreubotella commented Jan 14, 2020 • edited

annevk commented Jan 14, 2020 • edited

andreubotella commented Jan 17, 2020

andreubotella commented Jan 14, 2020 •

edited

annevk commented Jan 14, 2020 •

edited