Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that encoding tokens are scalar values #195

Closed
andreubotella opened this issue Jan 14, 2020 · 2 comments · Fixed by #196
Closed

Clarify that encoding tokens are scalar values #195

andreubotella opened this issue Jan 14, 2020 · 2 comments · Fixed by #196
Labels
clarification Standard could be clearer good first issue Ideal for someone new to a WHATWG standard or software project security/privacy There are security or privacy implications

Comments

@andreubotella
Copy link
Member

andreubotella commented Jan 14, 2020

It seems like the fact that Unicode tokens are scalar values, rather than code points, is far from clear in the spec. Other than the usage of USVString and related algorithms in the API section, the only mention of scalar values in the normative text is in the definition of encoding. In fact, the definition of token refers explicitly to code points, rather than scalar values.

This could actually result in a security issue if a specification weren't careful when using the encoding hooks – encoding handlers based on indices would raise an error on surrogate code points, but the UTF-8 handler would go along with it, returning a byte sequence which would fail on decoding.

I propose adding some text to the note in the hooks section informing specs that they should only invoke the encoding algorithms with streams built from a USVString, as well as adding an assertion in the process algorithm that, if encoderDecoderInstance is an encoder instance, input must not be a surrogate.

@annevk annevk added clarification Standard could be clearer security/privacy There are security or privacy implications labels Jan 14, 2020
@annevk
Copy link
Member

annevk commented Jan 14, 2020

I support both those suggestions, though would suggest that rather than explicitly mentioning USVString we stick to https://infra.spec.whatwg.org/#scalar-value-string or equivalent as it's more widely applicable.

(Anyone that's able to sign https://participate.whatwg.org/agreement should feel free to pick this up and work on a PR.)

@annevk annevk added the good first issue Ideal for someone new to a WHATWG standard or software project label Jan 14, 2020
@andreubotella
Copy link
Member Author

I'm working on a pull request for this issue.

annevk pushed a commit that referenced this issue Jan 20, 2020
This replaces some occurrences of "code point" with "scalar value", since it might not be clear to passing readers that output streams from decoding and input streams to encoding cannot contain tokens which are surrogates.

Fixes #195.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Standard could be clearer good first issue Ideal for someone new to a WHATWG standard or software project security/privacy There are security or privacy implications
Development

Successfully merging a pull request may close this issue.

2 participants