helpers for RFC 7230 productions #24

njsmith · 2016-07-02T05:13:05Z

Hello! I'm looking at improving h11's handling of URI-related stuff, and of course RFC 7230 delegates a bunch of the heavy lifting to RFC 3986. And I'd kinda rather not have to implement an RFC 3986 parser from scratch.

Annoyingly, though, RFC 7230 likes to refer directly to some of the intermediate productions inside RFC 3986. Specifically, I need to be able to:

check if a string matches the production origin-form = absolute-path [ "?" query] (where absolute-path and query are from RFC 3986)
check if a string matches the production authority, with empty user-info (I guess check for "no user-info" is easy if you can check for authority, since an authority has a "@" in it iff it has a non-empty user-info
check if a string matches the production host ":" port
check if a string is a valid absolute-URI, and if so, split it into scheme, authority, and everything else (path + query).

AFAICT rfc3986 has all the stuff it needs for doing these things, but most of its not exposed. (Except for the last one -- I think I could implement that using parsed = rfc3986.uri_reference(purported_url); assert parsed.fragment is None; everything_else = parsed.path + "?" + parsed.query.)

Ideally what I'd want is the regex text (as opposed to compiled regex objects) for each of those productions. Is that something rfc3986 could easily provide?

(P.S.: what's going on with unicode handling? It seems like on py3, if I pass a byte-string to uri_reference then I get regular strs back?)

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2016-07-02T14:47:50Z

Ideally what I'd want is the regex text (as opposed to compiled regex objects)

Why do you want the regular expression text instead of functions on rfc3986 or methods on URIReferences? (Or if you're replacing urlparse, the compatibility layer there)

I'm not sure I see the benefit to providing the regular expression text versus an API for this. I'm open to being convinced though.

what's going on with unicode handling? It seems like on py3, if I pass a byte-string to uri_reference then I get regular strs back?

This library was originally designed with OpenStack/Requests in mind. Requests doesn't like bytestrings for URLs for reasons and it's simpler to reason about everything internally if it's all exactly one type.

njsmith · 2016-07-02T15:04:43Z

Thanks for the response!

Why do you want the regular expression text instead of functions on rfc3986 or methods on URIReferences?

Because I want to be able to write single regex that checks for quirky RFC 7230 productions like host ":" port, which is easy if you can hand me the regexs that match host and port. RFC 7230's parsing requirements can't obviously be expressed in terms of parsing whole URIs. (And I'd rather transcribe the RFC text as directly as possible instead of trying to get cute with writing code that I think is probably equivalent.)

This library was originally designed with OpenStack/Requests in mind. Requests doesn't like bytestrings for URLs for reasons and it's simpler to reason about everything internally if it's all exactly one type.

Possibly I'm just being dense and this is documented somewhere, but I meant more, what are the semantics? What happens if i pass in a bytestring, and how do I interpret the strings I get back in edge cases? HTTP is resolutely defined in terms of bytestrings, for better or worse...

sigmavirus24 · 2016-07-02T15:37:23Z

Because I want to be able to write single regex that checks for quirky RFC 7230 productions like host ":" port, which is easy if you can hand me the regexs that match host and port. RFC 7230's parsing requirements can't obviously be expressed in terms of parsing whole URIs. (And I'd rather transcribe the RFC text as directly as possible instead of trying to get cute with writing code that I think is probably equivalent.)

So, it sounds like the only real use for rfc3986 is the regular expression text then, yes? In that case, I'm not sure why you couldn't copy the regular expression strings into h11. I tried to name them descriptively. If you're concerned about copyright/licensing, I'd be happy to either give you permission or contribute it myself to h11.

Possibly I'm just being dense and this is documented somewhere, but I meant more, what are the semantics?

I think I'm the dense one in this case. URIs aren't resolutely defined as ASCII only (regardless of the fact that ABNF is typically US-ASCII only). See: Section 2. Since queries can have unicode in them, e.g., many sites add utf8=✓ (or some variation thereof) and if you passed that to rfc3986

~/s/rfc3986 git:master ❯❯❯ python3.5
Python 3.5.1 (default, Jun 15 2016, 20:14:41)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rfc3986
>>> uri = 'https://example.com/path?utf8=✓'
>>> rfc3986.uri_reference(uri)
URIReference(scheme='https', authority='example.com', path='/path', query='utf8=%e2%9c%93', fragment=None)
>>> _.unsplit()
'https://example.com/path?utf8=%e2%9c%93'

Which would be safe to just immediately .encode('utf-8'). You can also specify you're own encoding if you want that with the URIReference object.

njsmith · 2016-07-03T06:45:50Z

copy the regular expression strings into h11

It's true, I could. How confident are you that you've gotten all the regexes exactly correct and they'll never need updates? (Serious question!) Also, I do actually need to parse absolute-URIs in some cases (see the last entry in my list above), and I suppose normalizing them might be polite too, so a dependency on rfc3986 makes some sense regardless, at which point it seems a little silly to have two copies of the same regexes in every install?

...at least assuming that RFC 3986-compliant regexes are actually sufficient for handling HTTP in practice. Having done a bit more reading, this appears to be more of a can of worms than I realized :-/.

unicode

So it looks like the rule is that browser are supposed to and generally do percent-encode that unicode checkmark before sending it over the wire, so in theory h11 shouldn't need to know about utf-8. Hooray for theory!

That said: if I do pass in a byte-string to rfc3986, then it will decode it as utf-8 (or whatever I ask for, but whatwg/url says that decoding as anything besides utf-8 is insecure so that's fine), parse it in unicode-land, and then the fields of URIReference are stored as unicode strings, but are guaranteed to contain only ascii codepoints. Is that all right?

sigmavirus24 · 2016-07-03T11:52:20Z

How confident are you that you've gotten all the regexes exactly correct and they'll never need updates?

Fairly confident.

so a dependency on rfc3986 makes some sense regardless

So then would you object to helping me figure out how to expose these as helper functions or methods on a URIReference instead? I don't like exposing things like regexp text (or really the compiled regexps) because some people will start just by using those. I don't want to get into the game of yakshaving regexp performance with people or golfing regexps.

Having done a bit more reading, this appears to be more of a can of worms than I realized :-/.

If you're thinking of resolving URIs via RFC 3986, our URI Reference object can do that for you.

That said: if I do pass in a byte-string to rfc3986, then it will decode it as utf-8 (or whatever I ask for, but whatwg/url says that decoding as anything besides utf-8 is insecure so that's fine), parse it in unicode-land, and then the fields of URIReference are stored as unicode strings, but are guaranteed to contain only ascii codepoints. Is that all right?

That is correct. I also want to make sure that you understand the difference between URLs, URIs, and IRIs. There are subtle differences, and while this library can handle the first two, it does not (yet) handle IRIs (where there can be unicode in the authority). I've always planned to add that support but never had anyone else really need it, nor have I needed it, and so I haven't allocated time to it. So you could have someone provide you with a URL like

http://☃.net

(Which is actually a valid host) That would need to be properly encoded to a URI. On Python 3, you can encode a string with idna but Python 2 lacks that. If h11 is server-side only, then you should not have to deal with that either. That said, if it's also for clients, I'm not sure you can rely on every client to be as good as urllib3 and requests.

sigmavirus24 · 2017-03-09T13:46:38Z

Hey @njsmith, so I'm also doing a non-backwards compatible 1.0 version of RFC3986. If you can give me some more pointers on the regular expression bits that you want, that'd be awesome.

Also, probably worth noting that the library grew a replacement for urlparse so you can have a ParseResult or ParseResultBytes object if you only ever want to deal with bytes.

I'd love your feedback on this, if you have time to provide it.

sigmavirus24 · 2017-03-16T14:57:26Z

One last friendly ping, @njsmith

njsmith · 2017-03-16T21:01:45Z

Hey, sorry. I've been procrastinating on dealing with this because it turns out that no-one knows how URLs work and ugh, plus got distracted from writing my own HTTP server by writing my own networking library :-).

I'm not sure from your question what exactly you're looking for? The first message in this thread has a list of the RFC 3986 productions that show up in the HTTP RFCs, and having a bytes-only option sounds very nice. Is there something somewhere you wanted me to look at?

Add UseExisting and begin using it in the API. Also rename some now public attributes in rfc3986.abnf_regexp. Refs #24

sigmavirus24 modified the milestone: 1.0.0 Mar 28, 2017

sigmavirus24 mentioned this issue Apr 30, 2017

Create 1.0 Release #25

Merged

7 tasks

sigmavirus24 added a commit that referenced this issue May 7, 2017

Document misc and abnf_regexp submodules

18a689d

Add UseExisting and begin using it in the API. Also rename some now public attributes in rfc3986.abnf_regexp. Refs #24

sigmavirus24 closed this as completed May 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

helpers for RFC 7230 productions #24

helpers for RFC 7230 productions #24

njsmith commented Jul 2, 2016

sigmavirus24 commented Jul 2, 2016

Uh oh!

njsmith commented Jul 2, 2016

Uh oh!

sigmavirus24 commented Jul 2, 2016

Uh oh!

njsmith commented Jul 3, 2016

Uh oh!

sigmavirus24 commented Jul 3, 2016

Uh oh!

sigmavirus24 commented Mar 9, 2017

Uh oh!

sigmavirus24 commented Mar 16, 2017

Uh oh!

njsmith commented Mar 16, 2017

Uh oh!

helpers for RFC 7230 productions #24

helpers for RFC 7230 productions #24

Comments

njsmith commented Jul 2, 2016

sigmavirus24 commented Jul 2, 2016

Uh oh!

njsmith commented Jul 2, 2016

Uh oh!

sigmavirus24 commented Jul 2, 2016

Uh oh!

njsmith commented Jul 3, 2016

Uh oh!

sigmavirus24 commented Jul 3, 2016

Uh oh!

sigmavirus24 commented Mar 9, 2017

Uh oh!

sigmavirus24 commented Mar 16, 2017

Uh oh!

njsmith commented Mar 16, 2017

Uh oh!