Skip to content

helpers for RFC 7230 productions #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
njsmith opened this issue Jul 2, 2016 · 8 comments
Closed

helpers for RFC 7230 productions #24

njsmith opened this issue Jul 2, 2016 · 8 comments
Milestone

Comments

@njsmith
Copy link
Member

njsmith commented Jul 2, 2016

Hello! I'm looking at improving h11's handling of URI-related stuff, and of course RFC 7230 delegates a bunch of the heavy lifting to RFC 3986. And I'd kinda rather not have to implement an RFC 3986 parser from scratch.

Annoyingly, though, RFC 7230 likes to refer directly to some of the intermediate productions inside RFC 3986. Specifically, I need to be able to:

  • check if a string matches the production origin-form = absolute-path [ "?" query] (where absolute-path and query are from RFC 3986)
  • check if a string matches the production authority, with empty user-info (I guess check for "no user-info" is easy if you can check for authority, since an authority has a "@" in it iff it has a non-empty user-info
  • check if a string matches the production host ":" port
  • check if a string is a valid absolute-URI, and if so, split it into scheme, authority, and everything else (path + query).

AFAICT rfc3986 has all the stuff it needs for doing these things, but most of its not exposed. (Except for the last one -- I think I could implement that using parsed = rfc3986.uri_reference(purported_url); assert parsed.fragment is None; everything_else = parsed.path + "?" + parsed.query.)

Ideally what I'd want is the regex text (as opposed to compiled regex objects) for each of those productions. Is that something rfc3986 could easily provide?

(P.S.: what's going on with unicode handling? It seems like on py3, if I pass a byte-string to uri_reference then I get regular strs back?)

@sigmavirus24
Copy link
Collaborator

Ideally what I'd want is the regex text (as opposed to compiled regex objects)

Why do you want the regular expression text instead of functions on rfc3986 or methods on URIReferences? (Or if you're replacing urlparse, the compatibility layer there)

I'm not sure I see the benefit to providing the regular expression text versus an API for this. I'm open to being convinced though.

what's going on with unicode handling? It seems like on py3, if I pass a byte-string to uri_reference then I get regular strs back?

This library was originally designed with OpenStack/Requests in mind. Requests doesn't like bytestrings for URLs for reasons and it's simpler to reason about everything internally if it's all exactly one type.

@njsmith
Copy link
Member Author

njsmith commented Jul 2, 2016

Thanks for the response!

Why do you want the regular expression text instead of functions on rfc3986 or methods on URIReferences?

Because I want to be able to write single regex that checks for quirky RFC 7230 productions like host ":" port, which is easy if you can hand me the regexs that match host and port. RFC 7230's parsing requirements can't obviously be expressed in terms of parsing whole URIs. (And I'd rather transcribe the RFC text as directly as possible instead of trying to get cute with writing code that I think is probably equivalent.)

This library was originally designed with OpenStack/Requests in mind. Requests doesn't like bytestrings for URLs for reasons and it's simpler to reason about everything internally if it's all exactly one type.

Possibly I'm just being dense and this is documented somewhere, but I meant more, what are the semantics? What happens if i pass in a bytestring, and how do I interpret the strings I get back in edge cases? HTTP is resolutely defined in terms of bytestrings, for better or worse...

@sigmavirus24
Copy link
Collaborator

Because I want to be able to write single regex that checks for quirky RFC 7230 productions like host ":" port, which is easy if you can hand me the regexs that match host and port. RFC 7230's parsing requirements can't obviously be expressed in terms of parsing whole URIs. (And I'd rather transcribe the RFC text as directly as possible instead of trying to get cute with writing code that I think is probably equivalent.)

So, it sounds like the only real use for rfc3986 is the regular expression text then, yes? In that case, I'm not sure why you couldn't copy the regular expression strings into h11. I tried to name them descriptively. If you're concerned about copyright/licensing, I'd be happy to either give you permission or contribute it myself to h11.

Possibly I'm just being dense and this is documented somewhere, but I meant more, what are the semantics?

I think I'm the dense one in this case. URIs aren't resolutely defined as ASCII only (regardless of the fact that ABNF is typically US-ASCII only). See: Section 2. Since queries can have unicode in them, e.g., many sites add utf8=✓ (or some variation thereof) and if you passed that to rfc3986

~/s/rfc3986 git:master ❯❯❯ python3.5
Python 3.5.1 (default, Jun 15 2016, 20:14:41)
[GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rfc3986
>>> uri = 'https://example.com/path?utf8=✓'
>>> rfc3986.uri_reference(uri)
URIReference(scheme='https', authority='example.com', path='/path', query='utf8=%e2%9c%93', fragment=None)
>>> _.unsplit()
'https://example.com/path?utf8=%e2%9c%93'

Which would be safe to just immediately .encode('utf-8'). You can also specify you're own encoding if you want that with the URIReference object.

@njsmith
Copy link
Member Author

njsmith commented Jul 3, 2016

copy the regular expression strings into h11

It's true, I could. How confident are you that you've gotten all the regexes exactly correct and they'll never need updates? (Serious question!) Also, I do actually need to parse absolute-URIs in some cases (see the last entry in my list above), and I suppose normalizing them might be polite too, so a dependency on rfc3986 makes some sense regardless, at which point it seems a little silly to have two copies of the same regexes in every install?

...at least assuming that RFC 3986-compliant regexes are actually sufficient for handling HTTP in practice. Having done a bit more reading, this appears to be more of a can of worms than I realized :-/.

unicode

So it looks like the rule is that browser are supposed to and generally do percent-encode that unicode checkmark before sending it over the wire, so in theory h11 shouldn't need to know about utf-8. Hooray for theory!

That said: if I do pass in a byte-string to rfc3986, then it will decode it as utf-8 (or whatever I ask for, but whatwg/url says that decoding as anything besides utf-8 is insecure so that's fine), parse it in unicode-land, and then the fields of URIReference are stored as unicode strings, but are guaranteed to contain only ascii codepoints. Is that all right?

@sigmavirus24
Copy link
Collaborator

How confident are you that you've gotten all the regexes exactly correct and they'll never need updates?

Fairly confident.

so a dependency on rfc3986 makes some sense regardless

So then would you object to helping me figure out how to expose these as helper functions or methods on a URIReference instead? I don't like exposing things like regexp text (or really the compiled regexps) because some people will start just by using those. I don't want to get into the game of yakshaving regexp performance with people or golfing regexps.

Having done a bit more reading, this appears to be more of a can of worms than I realized :-/.

If you're thinking of resolving URIs via RFC 3986, our URI Reference object can do that for you.

That said: if I do pass in a byte-string to rfc3986, then it will decode it as utf-8 (or whatever I ask for, but whatwg/url says that decoding as anything besides utf-8 is insecure so that's fine), parse it in unicode-land, and then the fields of URIReference are stored as unicode strings, but are guaranteed to contain only ascii codepoints. Is that all right?

That is correct. I also want to make sure that you understand the difference between URLs, URIs, and IRIs. There are subtle differences, and while this library can handle the first two, it does not (yet) handle IRIs (where there can be unicode in the authority). I've always planned to add that support but never had anyone else really need it, nor have I needed it, and so I haven't allocated time to it. So you could have someone provide you with a URL like

http://☃.net

(Which is actually a valid host) That would need to be properly encoded to a URI. On Python 3, you can encode a string with idna but Python 2 lacks that. If h11 is server-side only, then you should not have to deal with that either. That said, if it's also for clients, I'm not sure you can rely on every client to be as good as urllib3 and requests.

@sigmavirus24
Copy link
Collaborator

Hey @njsmith, so I'm also doing a non-backwards compatible 1.0 version of RFC3986. If you can give me some more pointers on the regular expression bits that you want, that'd be awesome.

Also, probably worth noting that the library grew a replacement for urlparse so you can have a ParseResult or ParseResultBytes object if you only ever want to deal with bytes.

I'd love your feedback on this, if you have time to provide it.

@sigmavirus24
Copy link
Collaborator

One last friendly ping, @njsmith

@njsmith
Copy link
Member Author

njsmith commented Mar 16, 2017

Hey, sorry. I've been procrastinating on dealing with this because it turns out that no-one knows how URLs work and ugh, plus got distracted from writing my own HTTP server by writing my own networking library :-).

I'm not sure from your question what exactly you're looking for? The first message in this thread has a list of the RFC 3986 productions that show up in the HTTP RFCs, and having a bytes-only option sounds very nice. Is there something somewhere you wanted me to look at?

@sigmavirus24 sigmavirus24 modified the milestone: 1.0.0 Mar 28, 2017
@sigmavirus24 sigmavirus24 mentioned this issue Apr 30, 2017
7 tasks
sigmavirus24 added a commit that referenced this issue May 7, 2017
Add UseExisting and begin using it in the API. Also rename some now
public attributes in rfc3986.abnf_regexp.

Refs #24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants