From 2518aa4ae09244eba67fbb4823014ef065492b76 Mon Sep 17 00:00:00 2001
From: Anne van Kesteren <annevk@annevk.nl>
Date: Wed, 9 May 2018 10:55:15 +0200
Subject: [PATCH] Change query state slightly to better deal with non-UTF-8
 encodings

If the input to the URL parser contains code points outside the non-UTF-8 encoding's value space and the URL parser was invoked using a non-UTF-8 encoding, then those code points end up as &#...;.

The problem is that &, #, and ; are also URL separators, but the previous algorithm would only encode #. This ensures that & and ; are also encoded, as some browsers already do, but only if they came about as the result of the encode operation.

Tests: https://github.com/w3c/web-platform-tests/pull/10915.

Fixes https://github.com/whatwg/encoding/issues/139.
---
 url.bs | 59 ++++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 37 insertions(+), 22 deletions(-)
diff --git a/url.bs b/url.bs
index 59e1e816..dd5d571d 100644
--- a/url.bs
+++ b/url.bs
@@ -2116,43 +2116,58 @@ string <var>input</var>, optionally with a <a>base URL</a> <var>base</var>, opti
       <p>then set <var>encoding</var> to <a>UTF-8</a>.
       <!-- https://simon.html5.org/test/url/url-encoding.html -->
 
+     <li><p>If <var>state override</var> is not given and <a>c</a> is U+0023 (#), then set
+     <var>url</var>'s <a for=url>fragment</a> to the empty string and state to
+     <a>fragment state</a>.
+
      <li>
-      <p>If <a>c</a> is the <a>EOF code point</a>, or <var>state override</var> is not given and
-      <a>c</a> is U+0023 (#), then:
+      <p>Otherwise, if <a>c</a> is not the <a>EOF code point</a>:
 
       <ol>
-       <li><p>Set <var>buffer</var> to the result of <a lt=encode>encoding</a> <var>buffer</var>
-       using <var>encoding</var>.
+       <li><p>If <a>c</a> is not a <a>URL code point</a> and not U+0025 (%),
+       <a>validation error</a>.
+
+       <li><p>If <a>c</a> is U+0025 (%) and <a>remaining</a> does not start with two
+       <a>ASCII hex digits</a>, <a>validation error</a>.
+
+       <li><p>Let <var>bytes</var> be the result of <a lt=encode>encoding</a> <a>c</a> using
+       <var>encoding</var>.
 
        <li>
-        <p>For each <var>byte</var> in <var>buffer</var>:
+        <p>If <var>bytes</var> starts with `<code>&amp;#</code>` and ends with 0x3B (;), then:
 
         <ol>
-         <li><p>If <var>byte</var> is less than 0x21 (!), greater than 0x7E (~), or is 0x22 ("),
-         0x23 (#), 0x3C (&lt;), or 0x3E (>), append <var>byte</var>,
-         <a lt="percent encode">percent encoded</a>, to <var>url</var>'s <a for=url>query</a>.
+         <li><p>Replace `<code>&amp;#</code>` at the start of <var>bytes</var> with
+         `<code>%26%23</code>`.
 
-         <li><p>Otherwise, append a code point whose value is <var>byte</var> to
-         <var>url</var>'s <a for=url>query</a>.
+         <li><p>Replace 0x3B (;) at the end of <var>bytes</var> with `<code>%3B</code>`.
+
+         <li><p>Append <var>bytes</var>, <a>isomorphic decoded</a>, to <var>url</var>'s
+         <a for=url>query</a>.
         </ol>
 
-       <li><p>Set <var>buffer</var> to the empty string.
+        <p class="note no-backref">This can happen when <a lt=encode>encoding</a> code points using
+        a non-<a>UTF-8</a> <a for=/>encoding</a>.
 
-       <li><p>If <a>c</a> is U+0023 (#), then set <var>url</var>'s <a for=url>fragment</a> to the
-       empty string and state to <a>fragment state</a>.
-      </ol>
+       <li>
+        <p>Otherwise, for each <var>byte</var> in <var>bytes</var>:
 
-     <li>
-      <p>Otherwise:
+        <ol>
+         <li>
+          <p>If one of the following is true
 
-      <ol>
-       <li><p>If <a>c</a> is not a <a>URL code point</a> and not U+0025 (%),
-       <a>validation error</a>.
+          <ul class=brief>
+           <li><p><var>byte</var> is less than 0x21 (!)
+           <li><p><var>byte</var> is greater than 0x7E (~)
+           <li><p><var>byte</var> is 0x22 ("), 0x23 (#), 0x3C (&lt;), or 0x3E (>)
+          </ul>
 
-       <li><p>If <a>c</a> is U+0025 (%) and <a>remaining</a> does not start with two
-       <a>ASCII hex digits</a>, <a>validation error</a>.
+          <p>then append <var>byte</var>, <a lt="percent encode">percent encoded</a>, to
+          <var>url</var>'s <a for=url>query</a>.
 
-       <li><p>Append <a>c</a> to <var>buffer</var>.
+         <li><p>Otherwise, append a code point whose value is <var>byte</var> to
+         <var>url</var>'s <a for=url>query</a>.
+        </ol>
       </ol>
     </ol>