Clarify that encoding tokens are scalar values #195
Labels
clarification
Standard could be clearer
good first issue
Ideal for someone new to a WHATWG standard or software project
security/privacy
There are security or privacy implications
It seems like the fact that Unicode tokens are scalar values, rather than code points, is far from clear in the spec. Other than the usage of
USVString
and related algorithms in the API section, the only mention of scalar values in the normative text is in the definition of encoding. In fact, the definition of token refers explicitly to code points, rather than scalar values.This could actually result in a security issue if a specification weren't careful when using the encoding hooks – encoding handlers based on indices would raise an error on surrogate code points, but the UTF-8 handler would go along with it, returning a byte sequence which would fail on decoding.
I propose adding some text to the note in the hooks section informing specs that they should only invoke the encoding algorithms with streams built from a
USVString
, as well as adding an assertion in the process algorithm that, ifencoderDecoderInstance
is an encoder instance,input
must not be a surrogate.The text was updated successfully, but these errors were encountered: