Skip to content
Permalink
Browse files

Improve representation guidelines for bytes and code points

Apart from that, apply those guidelines to this document and adjust the range for byte sequences that can be represented as strings slightly, to account for 0x7F (DEL).

See whatwg/url#47.
  • Loading branch information...
annevk committed Mar 23, 2017
1 parent c17d8e3 commit 543b2c48b6e1babb50d77e5a3ba5b86be8b56eb9
Showing with 44 additions and 29 deletions.
  1. +44 −29 infra.bs
@@ -210,14 +210,20 @@ iteration.
<p>A <dfn export>byte</dfn> is a sequence of eight bits, represented as a double-digit hexadecimal
number in the range 0x00 to 0xFF, inclusive.

<p>An <dfn export>ASCII byte</dfn> is a <a>byte</a> in the range 0x00 to 0x7F, inclusive.
<p>An <dfn export>ASCII byte</dfn> is a <a>byte</a> in the range 0x00 (NUL) to 0x7F (DEL),
inclusive. As illustrated, an <a>ASCII byte</a> may be followed by the representation outlined in
the <a href=https://tools.ietf.org/html/rfc20#section-2>Standard Code</a> section of
<cite>ASCII format for Network Interchange</cite>, between parentheses. [[!RFC20]]

<p class=example id=example-byte-notation>0x49 (I) when <a>UTF-8 decoded</a> becomes the
<a>code point</a> U+0049 (I).

<h3 id=byte-sequences>Byte sequences</h3>

<p>A <dfn export>byte sequence</dfn> is a sequence of <a>bytes</a>, represented as a space-separated
sequence of bytes. Byte sequences with bytes in the range 0x20 to 0x7F, inclusive, can alternately
be written as a string, but using backticks instead of quotation marks, to avoid confusion with an
actual <a>string</a>.
sequence of bytes. Byte sequences with bytes in the range 0x20 (SP) to 0x7E (~), inclusive, can
alternately be written as a string, but using backticks instead of quotation marks, to avoid
confusion with an actual <a>string</a>.

<div class=example id=example-byte-sequence-notation>
<p>0x48 0x49 can also be represented as `<code>HI</code>`.
@@ -229,10 +235,10 @@ actual <a>string</a>.
<a>UTF-8 encode</a> from the Encoding Standard. [[ENCODING]]

<p>To <dfn export>byte-lowercase</dfn> a <a>byte sequence</a>, increase each <a>byte</a> it
contains, in the range 0x41 to 0x5A, inclusive, by 0x20.
contains, in the range 0x41 (A) to 0x5A (Z), inclusive, by 0x20.

<p>To <dfn export>byte-uppercase</dfn> a <a>byte sequence</a>, subtract each <a>byte</a> it
contains, in the range 0x61 to 0x7A, inclusive, by 0x20.
contains, in the range 0x61 (a) to 0x7A (z), inclusive, by 0x20.

<p>A <a>byte sequence</a> <var>A</var> is a <dfn export>byte-case-insensitive</dfn> match for a
<a>byte sequence</a> <var>B</var>, if the <a>byte-lowercase</a> of <var>A</var> is the
@@ -241,54 +247,63 @@ contains, in the range 0x61 to 0x7A, inclusive, by 0x20.
<h3 id=code-points>Code points</h3>

<p>A <dfn export>code point</dfn> is a Unicode code point and is represented as a four-to-six digit
hexadecimal number, typically prefixed with "U+". Often the name of the <a>code point</a> is also
included in capital letters afterward, potentially with the rendered form of the <a>code point</a>
in parentheses. [[!UNICODE]]
hexadecimal number, typically prefixed with "U+". A <a>code point</a> may be followed by its name,
by its rendered form between parentheses, or both. Documents using the Infra Standard are encouraged
to follow <a>code points</a> by their name when they cannot be rendered, and rendered form between
parentheses otherwise, for legibility.

<p>A <a>code point</a>'s name is defined in the Unicode Standard and represented in
<a>ASCII uppercase</a>. [[!UNICODE]]

<div class=example id=example-code-point-notation>
<p>The <a>code point</a> rendered as 🤔 is represented as U+1F914.

<p>When referring to that <a>code point</a>, we might instead say "U+1F914 THINKING FACE (🤔)",
instead of just "U+1F914", to provide extra context.
<p>When referring to that <a>code point</a>, we might say "U+1F914 (🤔)", to provide extra context.
Documents are allowed to use "U+1F914 THINKING FACE (🤔)" as well, though this is somewhat verbose.
</div>

<p class=example id=example-code-point-notation-hard-to-render><a>Code points</a> that are difficult
to render unambigiously, such as U+000A, can be referred to as "U+000A LF".

<p>In certain contexts <a>code points</a> are prefixed with "0x" instead of "U+".

<p>A <dfn export>scalar value</dfn> is a <a>code point</a> that is not in the range
U+D800 to U+DFFF, inclusive.

<p>An <dfn export>ASCII code point</dfn> is a <a>code point</a> in the range U+0000 to U+007F,
inclusive.
<p>An <dfn export>ASCII code point</dfn> is a <a>code point</a> in the range U+0000 NULL to
U+007F DELETE, inclusive.

<p>An <dfn export lt="ASCII tab or newline|ASCII tabs or newlines">ASCII tab or newline</dfn> is
U+0009, U+000A, or U+000D.
U+0009 TAB, U+000A LF, or U+000D CR.

<p>An <dfn export>ASCII whitespace</dfn> is U+0009, U+000A, U+000C, U+000D, or U+0020.
<p>An <dfn export>ASCII whitespace</dfn> is U+0009 TAB, U+000A LF, U+000C FF, U+000D CR, or
U+0020 SPACE.

<p class=note>"Whitespace" is a mass noun.

<p>A <dfn export>C0 control</dfn> is a <a>code point</a> in the range U+0000 to U+001F, inclusive.
<p>A <dfn export>C0 control</dfn> is a <a>code point</a> in the range U+0000 NULL to
U+001F INFORMATION SEPARATOR ONE, inclusive.

<p>A <dfn export lt="C0 control or space|C0 controls or spaces">C0 control or space</dfn> is a
<a>C0 control</a> or U+0020.
<a>C0 control</a> or U+0020 SPACE.

<p>An <dfn export>ASCII digit</dfn> is a <a>code point</a> in the range U+0030 to U+0039,
<p>An <dfn export>ASCII digit</dfn> is a <a>code point</a> in the range U+0030 (0) to U+0039 (9),
inclusive.

<p>An <dfn export>ASCII upper hex digit</dfn> is an <a>ASCII digit</a> or a <a>code point</a> in the
range U+0041 to U+0046, inclusive.
range U+0041 (A) to U+0046 (F), inclusive.

<p>An <dfn export>ASCII lower hex digit</dfn> is an <a>ASCII digit</a> or a <a>code point</a> in the
range U+0061 to U+0066, inclusive.
range U+0061 (a) to U+0066 (f), inclusive.

<p>An <dfn export>ASCII hex digit</dfn> is an <a>ASCII upper hex digit</a> or
<a>ASCII lower hex digit</a>.

<p>An <dfn export>ASCII upper alpha</dfn> is a <a>code point</a> in the range U+0041 to U+005A,
inclusive.
<p>An <dfn export>ASCII upper alpha</dfn> is a <a>code point</a> in the range U+0041 (A) to
U+005A (Z), inclusive.

<p>An <dfn export>ASCII lower alpha</dfn> is a <a>code point</a> in the range U+0061 to U+007A,
inclusive.
<p>An <dfn export>ASCII lower alpha</dfn> is a <a>code point</a> in the range U+0061 (a) to
U+007A (z), inclusive.

<p>An <dfn export>ASCII alpha</dfn> is an <a>ASCII upper alpha</a> or <a>ASCII lower alpha</a>.

@@ -317,8 +332,8 @@ the <a>string</a> with their corresponding <a>code point</a> in <a>ASCII upper a

<hr>

<p>To <dfn export>strip newlines</dfn> from a <a>string</a>, remove any U+000A LINE FEED and
U+000D CARRIAGE RETURN <a>code points</a> from the <a>string</a>.
<p>To <dfn export>strip newlines</dfn> from a <a>string</a>, remove any U+000A LF and U+000D CR
<a>code points</a> from the <a>string</a>.

<p>To <dfn export>strip leading and trailing ASCII whitespace</dfn> from a <a>string</a>, remove all
<a>ASCII whitespace</a> that are at the start or the end of the <a>string</a>.
@@ -443,7 +458,7 @@ interspersed <a>ASCII whitespace</a>.
<ol>
<li>
<p>Let <var>token</var> be the result of <a>collecting a sequence of code points</a> that are
not U+002C COMMA (,) from <var>input</var>, given <var>position</var>.
not U+002C (,) from <var>input</var>, given <var>position</var>.

<p class=note><var>token</var> might be the empty string.
</li>
@@ -456,8 +471,8 @@ interspersed <a>ASCII whitespace</a>.
<p>If <var>position</var> is not past the end of <var>input</var>, then:

<ol>
<li><p>Assert: the <a>code point</a> at <var>position</var> within <var>input</var> is U+002C
COMMA (,).
<li><p>Assert: the <a>code point</a> at <var>position</var> within <var>input</var> is
U+002C (,).

<li><p>Advance <var>position</var> by 1.
</ol>

0 comments on commit 543b2c4

Please sign in to comment.
You can’t perform that action at this time.