Skip to content

Commit

Permalink
Change query state slightly to better deal with non-UTF-8 encodings
Browse files Browse the repository at this point in the history
If the input to the URL parser contains code points outside the non-UTF-8 encoding's value space and the URL parser was invoked using a non-UTF-8 encoding, then those code points end up as &#...;.

The problem is that &, #, and ; are also URL separators, but the previous algorithm would only encode #. This ensures that & and ; are also encoded, as some browsers already do, but only if they came about as the result of the encode operation.

Tests: web-platform-tests/wpt#10915.

Fixes whatwg/encoding#139.
  • Loading branch information
annevk committed May 23, 2018
1 parent fb623cf commit 2518aa4
Showing 1 changed file with 37 additions and 22 deletions.
59 changes: 37 additions & 22 deletions url.bs
Original file line number Diff line number Diff line change
Expand Up @@ -2116,43 +2116,58 @@ string <var>input</var>, optionally with a <a>base URL</a> <var>base</var>, opti
<p>then set <var>encoding</var> to <a>UTF-8</a>.
<!-- https://simon.html5.org/test/url/url-encoding.html -->

<li><p>If <var>state override</var> is not given and <a>c</a> is U+0023 (#), then set
<var>url</var>'s <a for=url>fragment</a> to the empty string and state to
<a>fragment state</a>.

<li>
<p>If <a>c</a> is the <a>EOF code point</a>, or <var>state override</var> is not given and
<a>c</a> is U+0023 (#), then:
<p>Otherwise, if <a>c</a> is not the <a>EOF code point</a>:

<ol>
<li><p>Set <var>buffer</var> to the result of <a lt=encode>encoding</a> <var>buffer</var>
using <var>encoding</var>.
<li><p>If <a>c</a> is not a <a>URL code point</a> and not U+0025 (%),
<a>validation error</a>.

<li><p>If <a>c</a> is U+0025 (%) and <a>remaining</a> does not start with two
<a>ASCII hex digits</a>, <a>validation error</a>.

<li><p>Let <var>bytes</var> be the result of <a lt=encode>encoding</a> <a>c</a> using
<var>encoding</var>.

<li>
<p>For each <var>byte</var> in <var>buffer</var>:
<p>If <var>bytes</var> starts with `<code>&amp;#</code>` and ends with 0x3B (;), then:

<ol>
<li><p>If <var>byte</var> is less than 0x21 (!), greater than 0x7E (~), or is 0x22 ("),
0x23 (#), 0x3C (&lt;), or 0x3E (>), append <var>byte</var>,
<a lt="percent encode">percent encoded</a>, to <var>url</var>'s <a for=url>query</a>.
<li><p>Replace `<code>&amp;#</code>` at the start of <var>bytes</var> with
`<code>%26%23</code>`.

<li><p>Otherwise, append a code point whose value is <var>byte</var> to
<var>url</var>'s <a for=url>query</a>.
<li><p>Replace 0x3B (;) at the end of <var>bytes</var> with `<code>%3B</code>`.

<li><p>Append <var>bytes</var>, <a>isomorphic decoded</a>, to <var>url</var>'s
<a for=url>query</a>.
</ol>

<li><p>Set <var>buffer</var> to the empty string.
<p class="note no-backref">This can happen when <a lt=encode>encoding</a> code points using
a non-<a>UTF-8</a> <a for=/>encoding</a>.

<li><p>If <a>c</a> is U+0023 (#), then set <var>url</var>'s <a for=url>fragment</a> to the
empty string and state to <a>fragment state</a>.
</ol>
<li>
<p>Otherwise, for each <var>byte</var> in <var>bytes</var>:

<li>
<p>Otherwise:
<ol>
<li>
<p>If one of the following is true

<ol>
<li><p>If <a>c</a> is not a <a>URL code point</a> and not U+0025 (%),
<a>validation error</a>.
<ul class=brief>
<li><p><var>byte</var> is less than 0x21 (!)
<li><p><var>byte</var> is greater than 0x7E (~)
<li><p><var>byte</var> is 0x22 ("), 0x23 (#), 0x3C (&lt;), or 0x3E (>)
</ul>

<li><p>If <a>c</a> is U+0025 (%) and <a>remaining</a> does not start with two
<a>ASCII hex digits</a>, <a>validation error</a>.
<p>then append <var>byte</var>, <a lt="percent encode">percent encoded</a>, to
<var>url</var>'s <a for=url>query</a>.

<li><p>Append <a>c</a> to <var>buffer</var>.
<li><p>Otherwise, append a code point whose value is <var>byte</var> to
<var>url</var>'s <a for=url>query</a>.
</ol>
</ol>
</ol>

Expand Down

0 comments on commit 2518aa4

Please sign in to comment.