Skip to content
Permalink
Browse files

[ct] (2) Make surrogates in UTF-8 and character references turn into …

…U+FFFD to prevent UTF-16 environments having hard-to-handle bugs.

git-svn-id: http://svn.whatwg.org/webapps@3871 340c8d12-0b0e-0410-8428-c7bf67bfef74
  • Loading branch information...
Hixie committed Sep 16, 2009
1 parent 18f4c73 commit 6db21943d024e774d2aa52573981c130767034e9
Showing with 56 additions and 48 deletions.
  1. +28 −24 index
  2. +28 −24 source
52 index
motivated by a desire to increase the resilience of user agents in
the face of na&iuml;ve transcoders.</p>

<p>All U+0000 NULL characters in the input must be replaced by
U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters is
a <a href=#parse-error>parse error</a>.</p>
<p>All U+0000 NULL characters and characters in the range U+D800 to
U+DFFF<!-- surrogates not allowed e.g. in UTF-8, and we don't want
them to suddenly turn into codepoints when they go through a UTF-16
pipe --> in the input must be replaced by U+FFFD REPLACEMENT
CHARACTERs. Any occurrences of such characters is a <a href=#parse-error>parse
error</a>.</p>

<p>Any occurrences of any characters in the ranges U+0001 to U+0008,
<!-- HT, LF allowed --> <!-- U+000B is in the next list --> <!-- FF,
CR allowed --> U+000E to U+001F, <!-- ASCII allowed --> U+007F
<!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+D800
to U+DFFF<!-- surrogates not allowed -->, U+FDD0 to U+FDEF, and
characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE,
U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF,
U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE,
U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF,
U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and
U+10FFFF are <a href=#parse-error title="parse error">parse errors</a>. (These
are all control characters or permanently undefined Unicode
characters.)</p>
<!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+FDD0
to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
U+10FFFE, and U+10FFFF are <a href=#parse-error title="parse error">parse
errors</a>. (These are all control characters or permanently
undefined Unicode characters.)</p>

<p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any CR characters that are
<tr><td>0x9D <td>U+009D <td>&lt;control&gt;
<tr><td>0x9E <td>U+017E <td>LATIN SMALL LETTER Z WITH CARON ('&#382;')
<tr><td>0x9F <td>U+0178 <td>LATIN CAPITAL LETTER Y WITH DIAERESIS ('&Yuml;')
</table><p>Otherwise, if the number is greater than 0x10FFFF, then this is
a <a href=#parse-error>parse error</a>. Return a U+FFFD REPLACEMENT
CHARACTER.</p>
</table><p>Otherwise, if the number is in the range 0xD800 to 0xDFFF<!--
surrogates not allowed; see the comment in the "preprocessing the
input stream" section for details --> or is greater than 0x10FFFF,
then this is a <a href=#parse-error>parse error</a>. Return a U+FFFD
REPLACEMENT CHARACTER.</p>

<p>Otherwise, return a character token for the Unicode character
whose code point is that number.
If the number is in the range 0x0001 to 0x0008, <!-- HT, LF
allowed --> <!-- U+000B is in the next list --> <!-- FF, CR
allowed --> 0x000E to 0x001F, <!-- ASCII allowed --> 0x007F <!--to
0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xD800 to
0xDFFF<!-- surrogates not allowed -->, 0xFDD0 to 0xFDEF, or is one
of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or
0x10FFFF, then this is a <a href=#parse-error>parse error</a>.</p>
0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xFDD0 to
0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
0x10FFFE, or 0x10FFFF, then this is a <a href=#parse-error>parse
error</a>.</p>

</dd>

52 source
motivated by a desire to increase the resilience of user agents in
the face of na&iuml;ve transcoders.</p>

<p>All U+0000 NULL characters in the input must be replaced by
U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters is
a <span>parse error</span>.</p>
<p>All U+0000 NULL characters and characters in the range U+D800 to
U+DFFF<!-- surrogates not allowed e.g. in UTF-8, and we don't want
them to suddenly turn into codepoints when they go through a UTF-16
pipe --> in the input must be replaced by U+FFFD REPLACEMENT
CHARACTERs. Any occurrences of such characters is a <span>parse
error</span>.</p>

<p>Any occurrences of any characters in the ranges U+0001 to U+0008,
<!-- HT, LF allowed --> <!-- U+000B is in the next list --> <!-- FF,
CR allowed --> U+000E to U+001F, <!-- ASCII allowed --> U+007F
<!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+D800
to U+DFFF<!-- surrogates not allowed -->, U+FDD0 to U+FDEF, and
characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE,
U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF,
U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE,
U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF,
U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and
U+10FFFF are <span title="parse error">parse errors</span>. (These
are all control characters or permanently undefined Unicode
characters.)</p>
<!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+FDD0
to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
U+10FFFE, and U+10FFFF are <span title="parse error">parse
errors</span>. (These are all control characters or permanently
undefined Unicode characters.)</p>

<p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any CR characters that are
<tr><td>0x9F <td>U+0178 <td>LATIN CAPITAL LETTER Y WITH DIAERESIS ('&#x0178;')
</table>

<p>Otherwise, if the number is greater than 0x10FFFF, then this is
a <span>parse error</span>. Return a U+FFFD REPLACEMENT
CHARACTER.</p>
<p>Otherwise, if the number is in the range 0xD800 to 0xDFFF<!--
surrogates not allowed; see the comment in the "preprocessing the
input stream" section for details --> or is greater than 0x10FFFF,
then this is a <span>parse error</span>. Return a U+FFFD
REPLACEMENT CHARACTER.</p>

<p>Otherwise, return a character token for the Unicode character
whose code point is that number.
If the number is in the range 0x0001 to 0x0008, <!-- HT, LF
allowed --> <!-- U+000B is in the next list --> <!-- FF, CR
allowed --> 0x000E to 0x001F, <!-- ASCII allowed --> 0x007F <!--to
0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xD800 to
0xDFFF<!-- surrogates not allowed -->, 0xFDD0 to 0xFDEF, or is one
of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or
0x10FFFF, then this is a <span>parse error</span>.</p>
0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xFDD0 to
0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
0x10FFFE, or 0x10FFFF, then this is a <span>parse
error</span>.</p>

</dd>

0 comments on commit 6db2194

Please sign in to comment.
You can’t perform that action at this time.