From 2518aa4ae09244eba67fbb4823014ef065492b76 Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Wed, 9 May 2018 10:55:15 +0200 Subject: [PATCH] Change query state slightly to better deal with non-UTF-8 encodings If the input to the URL parser contains code points outside the non-UTF-8 encoding's value space and the URL parser was invoked using a non-UTF-8 encoding, then those code points end up as &#...;. The problem is that &, #, and ; are also URL separators, but the previous algorithm would only encode #. This ensures that & and ; are also encoded, as some browsers already do, but only if they came about as the result of the encode operation. Tests: https://github.com/w3c/web-platform-tests/pull/10915. Fixes https://github.com/whatwg/encoding/issues/139. --- url.bs | 59 ++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 22 deletions(-) diff --git a/url.bs b/url.bs index 59e1e816..dd5d571d 100644 --- a/url.bs +++ b/url.bs @@ -2116,43 +2116,58 @@ string input, optionally with a base URL base, opti

then set encoding to UTF-8. +

  • If state override is not given and c is U+0023 (#), then set + url's fragment to the empty string and state to + fragment state. +

  • -

    If c is the EOF code point, or state override is not given and - c is U+0023 (#), then: +

    Otherwise, if c is not the EOF code point:

      -
    1. Set buffer to the result of encoding buffer - using encoding. +

    2. If c is not a URL code point and not U+0025 (%), + validation error. + +

    3. If c is U+0025 (%) and remaining does not start with two + ASCII hex digits, validation error. + +

    4. Let bytes be the result of encoding c using + encoding.

    5. -

      For each byte in buffer: +

      If bytes starts with `&#` and ends with 0x3B (;), then:

        -
      1. If byte is less than 0x21 (!), greater than 0x7E (~), or is 0x22 ("), - 0x23 (#), 0x3C (<), or 0x3E (>), append byte, - percent encoded, to url's query. +

      2. Replace `&#` at the start of bytes with + `%26%23`. -

      3. Otherwise, append a code point whose value is byte to - url's query. +

      4. Replace 0x3B (;) at the end of bytes with `%3B`. + +

      5. Append bytes, isomorphic decoded, to url's + query.

      -
    6. Set buffer to the empty string. +

      This can happen when encoding code points using + a non-UTF-8 encoding. -

    7. If c is U+0023 (#), then set url's fragment to the - empty string and state to fragment state. -

    +
  • +

    Otherwise, for each byte in bytes: -

  • -

    Otherwise: +

      +
    1. +

      If one of the following is true -

        -
      1. If c is not a URL code point and not U+0025 (%), - validation error. +

          +
        • byte is less than 0x21 (!) +

        • byte is greater than 0x7E (~) +

        • byte is 0x22 ("), 0x23 (#), 0x3C (<), or 0x3E (>) +

        -
      2. If c is U+0025 (%) and remaining does not start with two - ASCII hex digits, validation error. +

        then append byte, percent encoded, to + url's query. -

      3. Append c to buffer. +

      4. Otherwise, append a code point whose value is byte to + url's query. +