Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The text of this standard appears vulnerable to mismatching other standards #74

Closed
ilatypov opened this issue Nov 26, 2015 · 4 comments
Closed

Comments

@ilatypov
Copy link

RFC 3986 suggests to rely only on the smallest possible set of reserved characters that is necessary to split the URL into 5 components (Section 5.2.1 Pre-parse the Base URI). Assuming that the RFC implied left-to-right parsing, that would mean encoding only the terminator expected by the parser in each component. The query component has the hash mark as its terminator.

The RFC goes as far as to recommend keeping raw as many characters as possible in section 3.4 Query:

as [..] one frequently used [query] value is a 
reference to another URI, it is sometimes better 
for usability to avoid percent-encoding those 
characters.

On the other hand, the following part of the RFC implies encoding of many characters.

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
   query         = *( pchar / "/" / "?" )
   pct-encoded   = "%" HEXDIG HEXDIG

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved      = gen-delims / sub-delims
   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

https://www.ietf.org/rfc/rfc3986.txt

  • (a) Encode special characters sub-delims, :, @, /, ? in (name value) pairs when generating the query component, allowing their use as delimiters in the resulting query string. (I found only a special interpretation of one sub-delim = as the name=value separator in the RFC. The RFC remains silent about the role of other special characters, even &, as delimiters of the resulting query string).
  • (b) Encode characters not allowed in the resulting query string at all. (Despite this, browsers emit raw back-ticks et al in their HTTP requests as mentioned in a Mozilla bug referenced from Stop escaping byte 0x60 in query state #17).

More to that, Appendix C Delimiting a URI in Context seems to imply that double quotes 22", whitespace 20SP, hyphens 2D- and angle brackets 3C<, 3E> need encoding when the URLs are further submerged into a context of a text message directed at a human reader. It would be nice to remain strict about the parsers that seem external to the URL parser and let additional encoders protect against specific external parsers. On the other hand, not every message reader applies a parser to line breaks, so protecting the Appendix characters using the percent encoder for own hyphens seems a reasonable option when splitting the URLs with hyphens on line breaks. The RFC requirement mentioned in (b) above already protects double quotes 22", whitespace 20SP and angle brackets 3C<, 3E> with the percent-encoding algorithm.

So far I see the following algorithms for encoding and decoding (name value) pairs as satisfying the RFC's musts and following its shoulds. I guess this should agree with https://github.com/tkem/uritools. (The RFC did not mention the vestige of isindex HTML tag submitting a request with words separated by the plus characters: the plus character in the query part of the URL decodes to the space character).

GetURLFromClient(network) -> URL
  (Because the network accepts only byte arrays, we 
  receive URLs as byte arrays).

SendHTTPGet(URL, network) -> response
  (Because the network accepts only byte arrays, we
  send URLs as byte arrays).

GetURLStringFromUserOrPage(browser) -> unicodeURL
  (Because input and rendering interacts with humans, URL 
  parsers should accept strings with some special characters
  and Unicode characters that can be percent-encoded 
  without sacrificing the parsing of the URL's structure.
  For this, parsers may allow a mix of UTF-8 byte arrays and 
  UTF-16 code units when parsing percent-encoded strings).

RenderURLStringForUserOrPage(unicodeURL, browser) 
    --> browserShowingURLString
 (Because rendering URL strings in special characters and
 Unicode improves their 
 interpretation by humans, we may need to show some 
 pct-encoded special and non-ASCII characters as raw special 
 and Unicode).

URLParser(URL) -> (scheme authority path query fragment):
  Split the string URL based on the structure:
      scheme ":" hier-part [ "?" query ] [ "#" fragment ]
  (The parser will split hier-part into authority and path 
  expecting an optional leading double-slash and a slash 
  indicating the beginning of the path).
  ==> query should hide its own ASCII hex 23#. The 
  encoder will provide that.

QueryParser(query, delimeters="&") -> *(name value)
  (a) Expect query to comply with the spec (no reasoning
  except protecting against the fragment 23# search).
    query = * (ALPHA / DIGIT / pct-encoded 
          / one-of 2D- 2E. 5F_ 7E~ 21! 24$ 26& 27' 28( 
                       29) 2A* 2B+ 2C, 3B; 3D= 3A: 40@ 2F/ 
                       3F?)
  ==> query must hide the following characters found 
  in *(name value):
    ASCII 00-1F 20SP 22" 23# 25% 2B+ 3C< 3E>
    5B[ 5C\ 5D] 5E^ 60` 7B{ 7C| 7D} 7FDEL
    non-ASCII.

  (b) Split the string querySpec expecting a separator 26& 
  into *nameValuePlus elements.  To comply with an earlier 
  HTML4 suggestion on avoiding confusion in developers, 
  optionally allow other delimiters from sub-delims such as 3B; 
  2C, 21! 24$ 27' 28( 29) 2A*, as well as special pchars 
  3A: 40@ and special query characters 2F/ 3F?, if they 
  reside in the delimiters argument.

    http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.2
    http://stackoverflow.com/a/7287629/80772

  Split *nameValuePlus elements into pairs *(namePlus, valuePlus) 
  using the sub-delim 3D=.

  ==> names and values must protect own 26& 3D= 3B; 2C, 
  21! 24$ 27' 28( 29) 2A* 3A: 40@ 2F/ 3F?.

  (c) Always decode the "+" vestige to 20SP, resulting in
  *(namePct, valuePct) pairs.  (The inverse encoding of
  20SP may be done with either percent- or plus-encoding,
  and the latter appears more clear).

  (d) Decode percent-encoded UTF-8 in *(namePct, valuePct) 
  pairs to UTF-16 code units, resulting in *(name value) pairs.

==>
EncodeQuery(*(name value), vestigeSep=true) --> query
  For each character in each element of each pair *(name value):

  i. Encode the character if it falls into one of the following categories,
    using percent-encoding unless mentioned otherwise:
    26& 3D= (sub-delims from QueryParser.b to satisfy the standard
      query composer and name, value separator)

    3B; 2C, 21! 24$ 27' 28( 29) 2A* 3A: 40@  2F/ 3F? (sub-delims 
      and special characters from QueryParser.b to parse results of 
      unusual query composers; what's optional in the composer 
      becomes mandatory in the parser)

    00-1F, 
      20SP to "+" (instead of the percent-encoding) if vestigeSep (from 
      QueryParser.c).
      20SP if not vestigeSep
      22" 23# 25% 2B+ 3C< 3E> 5B[ 5C\ 5D] 5E^ 60` 7B{ 7C| 7D} 
      7FDEL (as not allowed by QueryParser.a)

    non-ASCII (using their UTF-8 presentation; from QueryParser.d)

  ii.  Add any other character unmodified.

  query = "&".join(["%s=%s" % (name, value) for (name, value) in args])

EncodeURL(scheme, authority, path, query, fragment, delimiters="") -> URL
  URL = authority + path
  if scheme: URL = scheme + ":" +  URL
  if query: URL += "?" + query 
  if fragment: URL += "#" + fragment
  if delimiters:
    For each character in URL:
      Percent-encode the character if it is found in delimiters.
      (Appendix C appears to demand protection of 2D- at line 
      breaks in the message context because the context lacks
      a standard URL delimiter parser.  Protecting quotes 27' in 
      HTML attributes containing URLs using this function does 
      not suffice because browsers will attempt to interpret 
      ampersands 26& that separate the query elements, 
        http://www.w3.org/TR/html5/syntax.html#before-attribute-value-state
      An HTML attribute encoder will protect any attribute value, 
      including URLs).
@ilatypov ilatypov changed the title The text of these standard appears vulnerable to mismatching other standards The text of this standard appears vulnerable to mismatching other standards Nov 26, 2015
@annevk
Copy link
Member

annevk commented Nov 26, 2015

This standard supersedes the RFC. What exactly is the problem you are trying to identify here?

@ilatypov
Copy link
Author

When we change standards because the browsers do that (as in #17), and change browsers because standards say so, we risk deteriorating backward compatibility without adding usability. I wish the URL standard could include the parsing and the composing algorithms that could be easily proven to derive from a common-sense syntax.

As of now, it would be nice to spell out the API for encoding (name value) pairs into the query string and decoding the string back, as in the above prototypes. I guess other components already have their composing and parsing algorithms spelled in the standard.

Another pair of pseudo code functions could write and read Unicode or UTF-16 presentations of URLs. This would need more investigation as to which special characters would be safe to reveal from its percent encoding. So far I see no conflict if a subset of the the non-query non-delimiter characters

    5B[ 5C\ 5D] 5E^ 60` 7B{ 7C| 7D}

could be displayed raw in an enhanced-usability form and read back into a byte array, query-characters-only URL (I would avoid displaying the message context delimiter characters raw in the URL, as the presentation URL may be cut and pasted into messages). Also, %20 could be safely replaced with + for presentation.

@annevk
Copy link
Member

annevk commented Nov 27, 2015

I don't really understand what you're asking for.

@annevk
Copy link
Member

annevk commented Feb 11, 2016

Closing since the issue is unclear. Please let me know if you want to further discuss this.

@annevk annevk closed this as completed Feb 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants