-
Notifications
You must be signed in to change notification settings - Fork 32
helpers for RFC 7230 productions #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Why do you want the regular expression text instead of functions on rfc3986 or methods on URIReferences? (Or if you're replacing urlparse, the compatibility layer there) I'm not sure I see the benefit to providing the regular expression text versus an API for this. I'm open to being convinced though.
This library was originally designed with OpenStack/Requests in mind. Requests doesn't like bytestrings for URLs for reasons and it's simpler to reason about everything internally if it's all exactly one type. |
Thanks for the response!
Because I want to be able to write single regex that checks for quirky RFC 7230 productions like
Possibly I'm just being dense and this is documented somewhere, but I meant more, what are the semantics? What happens if i pass in a bytestring, and how do I interpret the strings I get back in edge cases? HTTP is resolutely defined in terms of bytestrings, for better or worse... |
So, it sounds like the only real use for rfc3986 is the regular expression text then, yes? In that case, I'm not sure why you couldn't copy the regular expression strings into h11. I tried to name them descriptively. If you're concerned about copyright/licensing, I'd be happy to either give you permission or contribute it myself to h11.
I think I'm the dense one in this case. URIs aren't resolutely defined as ASCII only (regardless of the fact that ABNF is typically US-ASCII only). See: Section 2. Since queries can have unicode in them, e.g., many sites add
Which would be safe to just immediately |
It's true, I could. How confident are you that you've gotten all the regexes exactly correct and they'll never need updates? (Serious question!) Also, I do actually need to parse absolute-URIs in some cases (see the last entry in my list above), and I suppose normalizing them might be polite too, so a dependency on rfc3986 makes some sense regardless, at which point it seems a little silly to have two copies of the same regexes in every install? ...at least assuming that RFC 3986-compliant regexes are actually sufficient for handling HTTP in practice. Having done a bit more reading, this appears to be more of a can of worms than I realized :-/.
So it looks like the rule is that browser are supposed to and generally do percent-encode that unicode checkmark before sending it over the wire, so in theory h11 shouldn't need to know about utf-8. Hooray for theory! That said: if I do pass in a byte-string to rfc3986, then it will decode it as utf-8 (or whatever I ask for, but whatwg/url says that decoding as anything besides utf-8 is insecure so that's fine), parse it in unicode-land, and then the fields of |
Fairly confident.
So then would you object to helping me figure out how to expose these as helper functions or methods on a
If you're thinking of resolving URIs via RFC 3986, our URI Reference object can do that for you.
That is correct. I also want to make sure that you understand the difference between URLs, URIs, and IRIs. There are subtle differences, and while this library can handle the first two, it does not (yet) handle IRIs (where there can be unicode in the authority). I've always planned to add that support but never had anyone else really need it, nor have I needed it, and so I haven't allocated time to it. So you could have someone provide you with a URL like
(Which is actually a valid host) That would need to be properly encoded to a URI. On Python 3, you can encode a string with |
Hey @njsmith, so I'm also doing a non-backwards compatible 1.0 version of RFC3986. If you can give me some more pointers on the regular expression bits that you want, that'd be awesome. Also, probably worth noting that the library grew a replacement for I'd love your feedback on this, if you have time to provide it. |
One last friendly ping, @njsmith |
Hey, sorry. I've been procrastinating on dealing with this because it turns out that no-one knows how URLs work and ugh, plus got distracted from writing my own HTTP server by writing my own networking library :-). I'm not sure from your question what exactly you're looking for? The first message in this thread has a list of the RFC 3986 productions that show up in the HTTP RFCs, and having a bytes-only option sounds very nice. Is there something somewhere you wanted me to look at? |
Add UseExisting and begin using it in the API. Also rename some now public attributes in rfc3986.abnf_regexp. Refs #24
Hello! I'm looking at improving h11's handling of URI-related stuff, and of course RFC 7230 delegates a bunch of the heavy lifting to RFC 3986. And I'd kinda rather not have to implement an RFC 3986 parser from scratch.
Annoyingly, though, RFC 7230 likes to refer directly to some of the intermediate productions inside RFC 3986. Specifically, I need to be able to:
origin-form = absolute-path [ "?" query]
(whereabsolute-path
andquery
are from RFC 3986)authority
, with emptyuser-info
(I guess check for "no user-info" is easy if you can check forauthority
, since anauthority
has a"@"
in it iff it has a non-emptyuser-info
host ":" port
absolute-URI
, and if so, split it into scheme, authority, and everything else (path + query).AFAICT rfc3986 has all the stuff it needs for doing these things, but most of its not exposed. (Except for the last one -- I think I could implement that using
parsed = rfc3986.uri_reference(purported_url); assert parsed.fragment is None; everything_else = parsed.path + "?" + parsed.query
.)Ideally what I'd want is the regex text (as opposed to compiled regex objects) for each of those productions. Is that something rfc3986 could easily provide?
(P.S.: what's going on with unicode handling? It seems like on py3, if I pass a byte-string to
uri_reference
then I get regularstr
s back?)The text was updated successfully, but these errors were encountered: