Skip to content
This repository has been archived by the owner on Apr 14, 2022. It is now read-only.

Notes on "high-level" HTTP support #124

Open
njsmith opened this issue Sep 8, 2019 · 14 comments
Open

Notes on "high-level" HTTP support #124

njsmith opened this issue Sep 8, 2019 · 14 comments

Comments

@njsmith
Copy link
Member

njsmith commented Sep 8, 2019

I've heard repeatedly that requests is a "high-level" HTTP API and urllib3 is a "low-level" HTTP API, but I wasn't super clear on what exactly that meant in practice. So I read requests today in hopes of learning more about what secret sauce it's adding. These are some notes/thoughts I wrote down while doing that.


Things that requests does:

  • Convenience functions: get, post, etc., plus all the helpers for encoding and decoding bodies. Including:

    • the thing where you can pass in a bunch of different formats of request body and requests will try to DWIM

    • ability to read the response body as bytes, text, json, streaming bytes, streaming text, lines, ...

    • charset autodetection when reading text/json

    • adding http:// on the front of URLs that are missing a scheme

    • choosing between Content-Length and Transfer-Encoding: chunked framing

    • measures time from starting the request until response headers are received (or something like that)

    • lots of fiddling with URLs and URL encodings. Seems to normalize URL encoding even. Uses urlparse, which is highly dubious. Not sure what this is actually about.

  • a "hooks" system that lets you register callbacks on particular events: https://2.python-requests.org//en/master/user/advanced/#event-hooks

    • In practice, it just lets you pass some functions that get to inspect/replace the Response object before it's returned to the user
  • automatic redirect handling. this has a lot of interesting complexities

    • there's some redirect defaults in requests/sessions.py for get, options, head... but not the other methods for some reason? I'm not sure what this actually does. (they're also in requests/api.py, but AFAICT that's totally redundant)

    • requests accepts file objects as request bodies, and streams them. This requires seeking back to the beginning on redirects

    • switching proxy configuration when redirected

    • saves the list of response objects from intermediate requests. I think the body is unconditionally read and stashed in memory on the response object! this is plausibly correct, since I guess redirect bodies are always small, and you want to consume the body so the connection can be released back to the pool? though there's some kind of DoS opportunity here if the server gives back a large body – normally a client can protect themselves against large response bodies by choosing to use the streaming body reading APIs, but in this case requests unconditionally fetches and stores the whole body and there's no way to stop it. Not the scariest thing, but probably should be considered a bug.

    • there's a lower-level "incremental" API: if you set allow_redirects=False, you get back the response object for the first request in the chain, and that object has some extra state on it that tracks where it is in the request chain, which you can access with requests.Response.next. However, I'm not 100% sure how you use this... seems like it would be more useful for next() to go ahead and perform the next request and return the response? maybe if the API were that easy, that could be what gets recommended to everyone who wants to see the detailed history of the redirect chain, and avoid some of the awkwardness about intermediate bodies mentioned in the previous bullet point?

    • the redirect code uses a ton of mutable state and is entangled with a lot of the other features here, which makes it hard to understand and reason about

  • authorization

    • digest auth handling is really complex: thread-local variables! relies on the hook system to integrate into redirect handling! reimplements the file seek on redirect handling (I have no idea why)! some silly code for generating nonces that combines os.urandom and ad hoc entropy sources!

    • if the user passed https://username:password@host/ URLs, requests removes them and converts them into a HTTPBasicAuth setup

  • cookies: you can pass in a cookiejar of cookies to use on requests, and it will automatically update it across a Session

    • also interacts with redirects
  • mounting different adapters

  • setting up TLS trust via certifi

  • proxy autoconfiguration from environment (reading .netrc, $http_proxy, $NO_PROXY, etc.)

  • forces all HTTP verbs to be uppercase -- interesting choice. the HTTP RFCs say that method names are case-sensitive, so in principle GET and get are both valid HTTP verbs, and they're different from each other. In practice, I've never heard of anyone using anything except all-uppercase, and apparently the requests devs never have either.

  • pickling!


Dealing with cookies is really messy! AFAICT there isn't a modern WHATWG-style standard yet, so no-one entirely knows how they're supposed to work. In Python there's the venerable http.cookies, whose API is super gnarly to work with, and I'm skeptical that its semantics are really suitable for the modern world, given that it comes from the same era as http.client. There's this cookies package on PyPI, which hasn't had a release since 2014, but the README has a bunch of important-sounding critiques of http.client. And... that's about it! Eek.


I've been thinking about architectures like: a package.Session with a highlevel API (similar to requests), and package.lowlevel.ConnectionPool with a lowlevel API (similar to what urllib3 has now, but simplified), where the low-level object is "mounted" onto the highlevel object, so we can use the mounting feature as an API for users to inject quirky configuration stuff. This is on the same lines as httpx's middleware system. httpx wants to use this for redirects/auth/etc. Looking at the notes above, I'm not sure they're that easily splittable? But need to think about it more. What parts would people reasonable need to be able to mix and match?

Note: I literally just mean I've been thinking about these kinds of architectures, like as a design puzzle. I'm not saying that we should adopt an architecture like that for this project. But I am pondering what we should do.

@madsmtm
Copy link

madsmtm commented Sep 8, 2019

A few notes from an outsider 😉:

  • Personally, I like the "low-level" aspect of urllib3 - requests is way too "smart", which is rarely a good idea in software! The only thing preventing me from switching to urllib3 (and in the future, this async version) is that I haven't figured out how to use http.cookies ergonomically...

    So in my opinion, that's the primary improvement you could make, especially if you can make it thread-safe (which is something requests haven't figured out, see Document threading contract for Session class psf/requests#2766).

  • HTTP RFCs say that method names are case-sensitive, so in principle GET and get are both valid HTTP verbs, and they're different from each other. In practice, I've never heard of anyone using anything except all-uppercase, and apparently the requests devs never have either.

    I had this issue a while ago with some corporate proxies that didn't like the lowercase put as a method... So this does exist in the wild, but not something I'd have urllib3 correct... Users should just use "GET" if they want to be compatible.

Oh, and while I'm at it, I'd like to thank you and the other people working on this, this project must meet at lot of difficult resistance, and just be really difficult in general, I really hope you succeed eventually!

@njsmith
Copy link
Member Author

njsmith commented Sep 8, 2019

A few notes from an outsider 😉

Whoa, thank you! I had no idea anyone was even paying attention to this repo. Getting feedback is very exciting :-)

(Btw, if you want to get more involved and stop being an outsider, we can make that happen.)

Personally, I like the "low-level" aspect of urllib3 - requests is way too "smart", which is rarely a good idea in software!

I totally get the feeling of what you're saying. But I have trouble mapping it back to specific details in the requests code. This is one of my major motivations for reading through it in the first place – to figure out what requests actually does, and hopefully figure out which parts are obvious and uncontroversial, and which parts are too smart for their own good. So I'm hoping you could elaborate a bit more?

Is it the automatic redirect handling that gives you problems? Do you hate having the option to call resp.json() to decode json bodies? Did you get bitten somehow by the URL normalization?

Oh, and while I'm at it, I'd like to thank you and the other people working on this, this project must meet at lot of difficult resistance, and just be really difficult in general, I really hope you succeed eventually!

Thanks!

@njsmith
Copy link
Member Author

njsmith commented Sep 9, 2019

More unstructured brain dump:

So one natural-ish design would be to say that the low-level API exposes a single operation:

class AbstractAsyncHTTPTransport(ABC):
    async def request(self, verb: bytes, headers: List[Tuple[bytes, bytes]], body: AsyncIterator[bytes]):
        ...
        return (status_code, status_message, headers, response_body)

And response_body is... well, I guess basically what trio calls a ReceiveStream. The idea of this interface is that it just does a single HTTP request/response with exactly the given headers – no header munging, it only supports "streaming mode", etc.

(And of course we'd have a sync version of this interface too.)

That API is missing some options/configuration, especially for TLS and proxy settings. I guess users expect to be able to set these per-request, which creates some interesting challenges for pooling. (If a connection was originally created with cert verification disabled, then you'd better not pull that connection out of the pool later when cert verification is turned on!) Maybe we should make them always set per-request, to simplify things. Let the higher-level code worry about proxy configuration policy.

Our default implementation of AbstractHTTPTransport would handle connection pooling, whatever HTTP versions we support, and instances would need to be configured with the networking backend and the policy about how many idle connections to cache.

Then the higher-level API has a Session object, which is a concrete class that holds one or more AbstractHTTPTransport. This is what handles:

  • user-friendly APIs for specifying requests and accessing responses
  • redirect handling
  • auth handling
  • cookies
  • proxy configuration, I guess?
  • picking which transport to use for each request

redirects have to be somewhat tightly integrated with the user-friendly APIs, because the redirect code has to know how to rewind file objects, and to be able to report the history of redirects back in the final Response object. So that's a kind of outer loop that issues some arbitrary number of actual HTTP requests. The auth/cookies/proxies handling happens "inside" this redirect handling loop, and only care about a single request at a time. Cookies and proxies are pretty simple I guess, because they really do only care about the individual request's target; there's no interaction with anything outside that. But auth interacts with redirects: both requests and urllib3 have some code to strip auth headers from requests number 2-through-N (in some cases). This makes things a bit fragile, e.g. I think I noticed requests has code to add a Proxy-Authentication: header in some cases, but urllib3 only strips Authentication:. And requests code for this itneractions involves some gnarly in-place mutation of a PreparedResponse object. Maayyybe another way to do handle this would be to convert authentication into a pure function-of-the-target, like cookies and proxies? Like instead of saying "you can set a username/password for this request" and then having to do fiddly things to guess which "request" that refers to, we could treat it as "you can set a mapping of targets -> username/passwords to use", and we treat the per-request setting as a degenerate case of this that maps just that request's original target to the given username/password. But.... I dunno if this actually works. Need to study the exact rules for when you keep the auth info and when you don't.

There's also handling of responses with code 401 Gimme Some Authentication Please. I think maybe this can be kind-of like a redirect? requests has some code for this but I don't understand the details. Need to look into this more.

Exotic cases that we might want to make possible, or at least not accidentally rule out?

Upgrade/CONNECT: these are interesting because you either get a regular response back (if the upgrade/CONNECT is denied), or else you get a new bidirectional byte stream. There's also an annoying complication HTTP/1.1 has both upgrade+CONNECT, but in HTTP/2 they dropped upgrade and instead extended CONNECT (see RFC 8441). So if you want to abstract over HTTP/1.1 versus HTTP/2, then you have to translate between these two approaches or... something. This has some interesting use cases: websockets are the most prominent (and it would be neat to be able to multiplex websocket+regular HTTP to the same server over a single HTTP/2 connection! That's actually supported by the RFCs), but there are also some cool things you can do if you can negotiate a new protocol when talking to an HTTP server. E.g. the docker API uses this for docker exec – what starts out as a HTTP request to the API port gets "upgraded" into a telnet session (or something like telnet).

It wouldn't be too wacky to support websockets in the same library, at the Session layer, first using the HTTPTransport to negotiate the upgraded connection and then using wsproto to speak websocket over it.

Maybe the thing to do here is to have a second method on the HTTPTransport that's specifically for this case, that uses a somewhat richer input API to handle the Upgrade/CONNECT version differences, and returns Union[ReceiveStream, Stream] (the former for success

Bidirectional communication over vanilla HTTP/2 (without CONNECT): grpc is a bit weird: it uses the HTTP/2 wire format but with slightly different semantics than real standard-compliant HTTP/2. In particular:

  • minor point: it specifically requires HTTP/2, not HTTP/1.1
  • major point: the HTTP/2 protocol provides an arbitrary-bidi-byte-stream interface, but the abstract HTTP semantics say that you send the request body, and then receive the response body, basically in that order – there's no concept of sending some request, then reading some response, then sending a bit more request, etc. However, grpc does do this.

And actually, HTTP/1.1 has a similar situation: there's nothing in the wire protocol that prevents doing bidi communication like this, but it's not really part of the HTTP semantics and isn't well-supported by middleboxes. And in our current HTTP/1.1 code we actually abort sending the request as soon as we start seeing a response, because that's the most correct thing to do for regular HTTP semantics.

So... do we need a way to enable bidi incremental request/response bodies? Do we need it in general, or just for HTTP/2? What would the semantics and interface even look like? I'm really not sure about any of this.

Trailers: Also needed by grpc, as well as some real HTTP use cases. These are an optional block of headers that can arrive after reading the response body. So I guess the interface for response bodies isn't quite a ReceiveStream – we need a bit more for this. Maybe it's just a ReceiveStream plus some .trailing_headers attribute that starts out as None but might change after the stream is exhausted? Or a closeable async iterable that yields Union[bytes, HeaderList]? (In that case we lose the option of setting a max read size, which is maybe a bit awkward since most streaming APIs do support that.)

Lifecycles

How do we manage the lifetime of session objects and individual response objects? traditionally libraries like requests/urllib3 have allowed users to use context managers, but this is optional – if you don't, then the GC will eventually clean things up. This also matches how Python handles most resources, like files.

Async and new HTTP protocols make this more complicated.

Let's talk about session lifetimes first.

HTTP/2 and HTTP/3 assume that there's some kind of continuously-running background-task to handle stuff like PINGs. Tom Christie has argued there might be some way to finesse this – encode/httpx#258 (comment) – so I guess we should double-check – but I think we'll need a background task anyway to avoid nasty issues with cancellation, so the most we could hope for is to make tasks that live as long as individual response objects. Anyway, the reason all this matters is that we believe in structured-concurrency so we think that background tasks require an async with.

The simplest way to handle this would be to simply make it mandatory to use async with whenever creating a lowlevel HTTP transport. And sessions have a transport by default, so they would also need an async with.

Things to think about:

  • We might not be able to support HTTP/2 for sync users at all, in which case maybe we should somehow allow direct construction of session objects without a with block, but only for the sync API? (Or can we support HTTP/2 by writing a nursery that uses background threads? sounds dicey...)

  • Or we could make the with block optional in all cases, but disable HTTP/2 support unless you use with, and document that? This seems like a subtle trap for users though – if you use a bare Session then everything seems to work but you're actually running in a degraded mode.

The actual resource release for a session/connection pool is pretty trivial... basically just close all the sockets and you're done. So async doesn't make that part any more complicated – the GC works as well for resource cleanup as it ever does.

Then there are responses. If we scope background tasks to the session level, then we don't need background tasks for individual responses. So in that regard, no with block is necessary. However, releasing the resources from a response is somewhat complicated in async mode: for HTTP/2, you might need to send a message to the background task handling the connection. To do this, you need to do shutdown from inside async context. And that's tricky to do from __del__: you can't send the message directly, but rather have to re-enter async loop through some kind of call_soon_threadsafe functionality. That would require extending our networking backend API, but it's a pretty simple extension and I think all our backends do support this. So, maybe that's fine and we can go ahead and tell people that using with on streaming responses is recommended but not mandatory? (And on non-streaming responses it's really not an issue either way...)

[Note to self: check how aiohttp handles this. I think they at least STRONGLY RECOMMEND with blocks, but I don't remember the details.]

Also related: Should we have a stream=True/False kwarg like requests does, or should we unconditionally consume the response body lazily, so you choose whether you're streaming or not by whether you call await response.get_whole_body() versus await response.iter_body_chunks()? (not their actual names)

This has an interesting interaction with resource management: if you do it lazily in general, then you have to decide what policy to use when someone closes the response without accessing the body. Do you close the connection, forcing it to be re-opened? Or do you try to read the response body, so the connection can be returned to the pool? This is also closely related to my questions in my first post, about how to handle the body of redirect responses – do we read it, discard it, what?

(Well, if you have a stream= kwarg, then of course you have the same question when someone does stream=True and then closes the response without consuming the body. But in this case we know that the user explicitly said they didn't want to fetch the whole body, so it's probably fine to trust them on that and immediately close the connection. If you don't have a stream= kwarg, then there's an ambiguity between the relatively-rare (I think?) cases where the user explicitly wanted to avoid fetching the whole body, versus the relatively-common (I think?) cases where the user was fine with fetching the whole body, but just didn't happen to touch it. So that makes it much more of a puzzler.)

Oh hmm we also need to avoid the situation where the server can cause __del__ to block forever by not sending any data, because __del__ has a nasty habit of escaping from trio's normal timeout control. Maybe it's simplest all around if we say with is optional on responses, we have a stream= flag that defaults to False, and when a stream=True response is closed or GCed then we kill it immediately without messing around with trying to drain the body. And for redirects, when stream=False we could feel free to capture the bodies of redirect responses, and when stream=True we could discard the bodies (or do something cleverer, like reading a bit of data and then discard).

Another question: Should closing a session immediately close all responses, even ones that are currently "alive"? It makes a lot of sense for HTTP/2, since there responses are relatively thin wrappers around some shared state that's most easily-managed at the session level. And probably we want HTTP/2 and HTTP/1.1 to act the same, to avoid surprising bugs. So maybe tentatively, yes, we should track "checked out" connections and close them when the session is closed.

@njsmith
Copy link
Member Author

njsmith commented Sep 9, 2019

Another thing to think about: "retry handling", which is similar to "redirect handling" in some ways, but different. Retry = automatically responding to a failed request by re-issuing it, versus redirects which are explicitly initiated by the server. You can't do retry on all failure modes, but there are some where it makes sense – in particular early failures like DNS resolution or a timeout on connection establishment, where you know you never started submitting the request to the server. It's safe, you know these kinds of problems are often transient, just do it. There are also server-initiated retries, involving the Retry-After header.

Do you want to handle retries at the highlevel Session level, or at the lowlevel HTTPTransport level? Doing it at the Session level lets you share some code with redirects (esp. for Retry-After, which may involve re-sending the body!). It also means you need some reliable way for the HTTPTransport to say "that failed, but it was the kind of failure where a retry makes sense. It's also not so great for folks who want to use HTTPTransport directly, who probably end up having to re-implement their own retry support every time. OTOH we could do it at the HTTPTransport level. That means more configuration to pass down into the HTTPTransport from the Session.

Currently when using requests, I think requests handles redirects while urllib3 handles retries, so it's like the HTTPTransport-does-retries approach. [Edit: and requests doesn't provide any way to configure the retires per-request, the knobs are only exposed as state on the "adapter" object. And maybe this is a good thing, because those knobs are actually super complicated!] [Edit 2: in case it's not clear: the nice thing about making retries configurable only as object state is that lets us take them out of the abstract transport interface, and make them part of the concrete default transport implementation. Abstract interfaces are tricky because you have multiple users on both sides, so it's nice to keep them as simple as possible.]

@njsmith
Copy link
Member Author

njsmith commented Sep 9, 2019

I should also look through urllib3 to think about what stuff is happening there and whether it makes sense as the place to do it. For example, I know urllib3 handles automatic decompression of gzipped responses – is that something you would want at the low level or the high level?

@njsmith
Copy link
Member Author

njsmith commented Sep 10, 2019

wacky idea for dealing with draining connections: put the connect back into the pool undrained, and then if we pull it out again, try to drain it then (with some limit on how much to read). The idea is that this moves the potentially-blocking draining into another operation that also blocks waiting for the same server, so if the user is using timeouts properly then it should Just Work.

This is probably Too Clever By Half, but I figured I'd capture the thought...

@njsmith
Copy link
Member Author

njsmith commented Sep 10, 2019

Dealing with cookies is really messy

Huh, apparently aiohttp actually implemented its own cookie jar code! It still uses some of the code from http.cookies, but they have an abstract cookie storage interface: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/cookiejar.py

@njsmith
Copy link
Member Author

njsmith commented Sep 11, 2019

I figured out what's going on with the 401 handling in requests' auth code. If you're using digest authentication, then you have to do two requests: the first one gets denied and in the denial it includes a nonce; then the nonce is used to generate the actual authentication token; and then we repeat the request with our new token. I guess this is one reason that digest auth has always been unpopular... if you want to upload a large file with authentication, then you have to upload it twice!

It seems like there's a fundamental similarity between redirects, the Retry-After style of retries, and digest auth: they all involve rewinding and re-sending the original response.


On another topic: does it even make sense to keep a split between "high-level" and "low-level" APIs? I'm looking at that list of value-added features that requests gives, and it's like... well, you probably want redirect handling almost always, and if not there will be way to disable it. Auth doesn't do anything unless you ask for it. And people probably do want cookies in pretty much all situations, or if not then we'd want a way to disable it anyway. And if you don't want to use helpers like .json(), then you can just not use them. So it seems like from a user's point of view, the only difference between a "low level" and "high level" API are to pass redirect=False, disable_cookies=True or something, and that's hardly enough to justify a major API change.

So it sounds like the remaining reason people rely on this split is because they're using requests's "mounting" capability as an extension point so they can interpose and mess with things. I looked around to find real-examples of how people use this:

  • requests-unixsocket: adds support for http+unix:// URLs.
  • requests-toolbelt has a bunch:
    • GAE support
    • option to specify a specific TLS certificate fingerprint
    • option to restrict which versions of TLS the client offers (e.g. force TLS 1.2 only)
    • option to connect to a URL containing a raw IP, but accept an explicit Host: header and use it to set TLS hostname
    • option to pick which source address to bind to when making outgoing connections
    • option to call setsockopt on the socket
    • option to set up TCP keepalive

Actually in general requests-toolbelt is a treasure trove of info on exotic extensions people have asked for. Some more interesting extensions:

  • A bunch of auth methods (NTLM, a system to map target -> auth, like I suggested above...)
  • A subclass of Session where you give a base URL when you create it, and then you can pass relative URLs and they're resolved relative to that base

We can group these into three categories:

  1. APIs to manipulate the socket and TLS layer – unix domain sockets, TCP socket options, TLS tweaks, etc. For us, this stuff all lives down inside our backend layer, so it's really unclear how to expose it. I guess we might have to though. Explicit support for some cases in our high-level API? Backend-specific extension points? aiohttp has a plugin API specifically for this kind of thing. In any case, none of the design stuff I mentioned above helps with this; it's a separate issue.

  2. High-level tweaks to the request itself: authentication methods, the custom Session with a default base URL. Probably there are folks out there who want to do other wacky stuff, like automatically setting headers on every request or something. For this, the high-level/low-level split I described above could maybe handle it, but it would be extremely awkward – e.g. you have to know the actual request target for cookie handling, so the default base URL thing has to be resolved early. Authentication interacts with retries, so also needs to be handled at a higher level. Probably we can just support the default base URL thing as a regular feature (given a decent URL library it's like 1 extra line of code), and I guess we might want a specific extension API for handling authentication?

  3. The Google App Engine support. This is the only case that's a natural fit for the high-level/low-level design I described. It's still a bit awkward because GAE's "fetch API" is inherently awkward (e.g. it always sets some headers you can't control, and can't do streaming). And from a quick look at the GAE page, it sounds like the fetch API is in the process of being phased out – in fact it doesn't seem to be supported in any of their Python 3 environments. And even if we did want to support GAE, by itself it probably doesn't justify creating a whole elaborate system for interposing transport backends.

@njsmith
Copy link
Member Author

njsmith commented Sep 11, 2019

both requests and urllib3 have some code to strip auth headers from requests number 2-through-N (in some cases). This makes things a bit fragile, e.g. I think I noticed requests has code to add a Proxy-Authentication: header in some cases, but urllib3 only strips Authentication:. And requests code for this itneractions involves some gnarly in-place mutation of a PreparedResponse object. Maayyybe another way to do handle this would be to convert authentication into a pure function-of-the-target, like cookies and proxies? Like instead of saying "you can set a username/password for this request" and then having to do fiddly things to guess which "request" that refers to, we could treat it as "you can set a mapping of targets -> username/passwords to use", and we treat the per-request setting as a degenerate case of this that maps just that request's original target to the given username/password. But.... I dunno if this actually works. Need to study the exact rules for when you keep the auth info and when you don't.

I think the concept is sound in some sense – the rules for keeping auth are basically "if and only if you keep the same destination schema+host+port". But there's a problem: the user can also set their own manual Auth headers (and in fact this is super common, since the normal HTTP auth mechanisms are so limited). And on redirect, you want to strip those too.

So for simple kinds of auth, maybe you want to set up the headers just once, before entering the redirect handling loop. And then inside the redirect loop, as a separate matter, you want to strip the headers, because that way you treat them uniformly regardless of how exactly the user set them.

For auth that requires a 401 back-and-forth, like Digest and I think some of the other fancy ones like Kerberos or NTLM, it's a little trickier. I guess those need to go "inside" the redirect loop, because if you get redirected from A -> B, and B sends a 401, and you have credentials for B, then you should use those credentials to retry the request to B, not go back to A and send it credentials that it might not even know what to do with. But, of course, the redirect loop needs to cancel this entirely if it crosses between sites. So actually I'm not so sure... maybe you want to handle all "structured" auth at this stage, and strip user headers as a separate thing.

(It's important to think carefully about these cases... requests has had 4 CVEs over its life, and every single one involved bad interactions between redirects+auth.)

Another bit of fun: proxies can also require authentication. For http requests, you have to send a Proxy-Authorization: mixed in with the actual request headers (and the proxy will strip it off before forwarding). For https requests that use CONNECT, you have to send the Proxy-Authorization: as a header on the CONNECT request. In both cases, the proxy can return a 407 code, which is just like a 401 code except for proxies.

@njsmith
Copy link
Member Author

njsmith commented Sep 12, 2019

I realized Go's net/http package is also a good thing to look at, as a recent, heavily-used client that's gotten a lot of attention from HTTP experts. Here's some notes from reading through https://golang.org/pkg/net/http/

The overall structure is actually eerily similar to the straw-man Session-vs-Transport design that I sketched out above. They have a Client as the high-level object. Requests are reified as a Request object, and responses are reified as a Response object. Bodies are always streamed – the objects just hold a reference to a Reader object (= the go version of a Python file object). The Client handles high-level stuff like redirects and cookies, then delegates to implementation of the RoundTripper interface, which simply takes a Request and returns a Response. The default implementation of RoundTripper is called Transport, and has a ton of configuration options.

There's even a dispatch mechanism similar to requests's "mounting". Except, it's not on the Client, it's on the default Transport. And you can only mount other RoundTrippers to handle specific schemas, not arbitrary URL prefixes like requests allows.

Their Client.Head helper does follow redirects by default. That's different from what requests does. I wonder what urllib3 does.

Their multipart parser has a mandatory (!) maxMemory argument. If the data being parsed exceeds that, then it automatically spills onto disk (!).

There's a trace hook mechanism, to register callbacks that get called at various points in the request lifecycle (https://golang.org/pkg/net/http/httptrace/). I think this must be the inspiration for aiohttp's similar feature?

You can do Upgrade: requests, and if it succeeds, then the body object isn't just a Reader, but actually a ReaderWriter (= the go version of a bidirectional streaming connection), basically the raw socket. They don't support CONNECT though... not sure why not, since they're like basically the same thing. There's also a "hijacking" API, but I get the feeling that it's an older hack that's obsolete now that there's proper Upgrade: support?

They support sending trailers on requests, via a somewhat awkward API (you fill in a request.trailers attribute with placeholder values, then mutate the value while you're writing the request body, and then at the end the trailers get snapshotted and sent). For responses, trailers are an attribute on the response object that gets automagically filled in after the body is consumed.

Redirects: they have an interesting linked-list kind of representation – each Response contains a pointer to the Request that triggered it, and if that Request was an automatic redirect, then it contains a pointer to the Response that triggered it, so you can walk back to the beginning of the chain examining all the requests and responses.

In their redirect chain, all the previous responses don't have a body attached – it's automatically closed. Normally this would kill the connection, which is inefficient since you might want to reuse the connection for the redirect. In order to avoid this, they have some special-case code that runs on redirects to drain and discard up to 2 << 10 bytes of body (= 2 KiB, weird way to write it). This makes it look like there's no timeout, but in fact the Body object auto-closes itself if a global timeout for the request gets exceeded, so it is possible to prevent it hanging forever:

			// Close the previous response's body. But
			// read at least some of the body so if it's
			// small the underlying TCP connection will be
			// re-used. No need to check for errors: if it
			// fails, the Transport won't reuse it anyway.
			const maxBodySlurpSize = 2 << 10
			if resp.ContentLength == -1 || resp.ContentLength <= maxBodySlurpSize {
				io.CopyN(ioutil.Discard, resp.Body, maxBodySlurpSize)
			}
			resp.Body.Close()

There's lots of interesting undocumented tidbits like this in the source. Here's another example:

			// If the caller specified a custom Host header and the
			// redirect location is relative, preserve the Host header
			// through the redirect. See issue #22233.

You can register a callback that gets consulted before following each redirect, and it gets the option to stop following the chain at any point. The callback gets a reference to the next Request, and since that has a Response in it, I think it can read the last response body if it wants to before deciding? Or it can return a special value that means "stop here, return the last response as the final response".

Unrecognized 1xx responses are mostly ignored, except that they trigger a trace event. The exceptions are 101 (which indicates a successful Upgrade, as noted above), and 100 Continue. For 100 Continue, they have a setting on the Client that says how long to wait for a 100 Continue before going ahead sending the body. If you don't set it, then it defaults to sending the body immediately without waiting.

There's a mechanism for setting a header to nil to mean "don't send this header at all, even if normally you would automatically send it". (Maybe only for servers?)

The details of how they handle headers on redirects are subtle and interesting to compare to what the Python-land libs do:

When following redirects, the Client will forward all headers set on the initial Request except:

  • when forwarding sensitive headers like "Authorization", "WWW-Authenticate", and "Cookie" to untrusted targets. These headers will be ignored when following a redirect to a domain that is not a subdomain match or exact match of the initial domain. For example, a redirect from "foo.com" to either "foo.com" or "sub.foo.com" will forward the sensitive headers, but a redirect to "bar.com" will not.

  • when forwarding the "Cookie" header with a non-nil cookie Jar. Since each redirect may mutate the state of the cookie jar, a redirect may possibly alter a cookie set in the initial request. When forwarding the "Cookie" header, any mutated cookies will be omitted, with the expectation that the Jar will insert those mutated cookies with the updated values (assuming the origin matches). If Jar is nil, the initial cookies are forwarded without change.

Proxy configuration: the default Transport handles proxies internally (http, https, socks5), and you configure it by passing a function that maps URL -> proxy that should be used. There's a special config setting for "headers to pass with CONNECT requests". I don't understand how proxy authentication is handled... is that a thing? Some googling suggests that go does not handle this case very well, e.g. you're just expected to manually put Proxy-Authorization: headers in all your requests (but only for http:// requests, don't do that for https:// requests or you'll send your credentials off to some random third-party!).

To find out how people take advantage of the RoundTripper extension point, I went searching on github, and I found lots of examples:

I'd say that overall, these are more compelling use cases than GAE. Test fakes and debugging tools are especially compelling, maybe enough to justify their own dedicated API. For several of the rest, I wonder if it would be better to wrap something around the Client instead of interposing underneath the Client, but we'll see... probably any well-factored HTTP client is going to end up with something like the Client/Transport split at least internally, and maybe the thing to do is to start hacking on that and see how complex that API boundary is before deciding whether to make it a public interposition point.

@tomchristie
Copy link

Something that might be a useful pointer here is that httpx's "Dispatch" interface is the equivelent of the low-level part here, and it's "Client" (which handles auth, redirects, cookie persistence) is the equievelent of the high-level part here.

I'd be super keen to have see a dispatch class implementation based on your work here, or discover if there's any blockers that make that infeasible for any reason.

Concretely: Implementing a dispatch class means subclassing the AsyncDispatcher interface, and implementing the .send() method.

@njsmith
Copy link
Member Author

njsmith commented Jan 8, 2020

Some critique of Go's net/http:

https://github.com/gorilla/http

And fasthttp is mostly focused on speed, not ergonomics, but from their README:

@sethmlarson
Copy link
Contributor

sethmlarson commented Jan 19, 2020

IMO it's preferable to not have a "PreparedRequest" API, and instead supporting all requests via hip.get(), hip.post(), etc.

I'm also a fan of only exposing a string interface for URLs on Response.request objects so we don't have to contain a whole URL parsing library.

However that presents some unique challenges to support all of HTTP without PreparedRequests.
One I can think of right away is "asterisk URL form" requests when using the OPTIONS method.

A solution I quickly thought of would be a flag in hip.options() with something like hip.options("http://example.com", asterisk_form_url=True)?

@njsmith
Copy link
Member Author

njsmith commented Jan 20, 2020

Yeah, OPTIONS * is complicated. I opened #192 to collect more info on it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants