Skip to content
This repository has been archived by the owner on Apr 14, 2022. It is now read-only.

[discussion] [potential heresy] global connection pools and global tasks #125

Open
njsmith opened this issue Oct 1, 2019 · 19 comments
Open

Comments

@njsmith
Copy link
Member

njsmith commented Oct 1, 2019

Here's a convenient thing about current Python HTTP clients: when you do create a Session, you can just do session = Session(). You can use with Session() as session, but it's not required. This flexibility can be very convenient – in fact trio's own github bot uses it.

Here's an annoying thing about HTTP/2 and HTTP/3: you really need a background task to handle the shared connection. (And that issue doesn't even talk about cancellation safety, which is also a huge issue that I don't see how to solve without background tasks.) According to the principles of structured concurrency, that means that if you want to support HTTP/2 or HTTP/3, then you must use a with block to manage sessions, which is a usability regression compared to older HTTP/1.1-only clients.

We could potentially restrict this regression to just async clients, since sync clients will probably never support HTTP/2 or HTTP/3. But that's pretty awkward, since our goal is to keep the async and sync APIs matching as close as possible. Another option: we could support both async_session = AsyncSession() and async with AsyncSession() as async_session... but the first session is restricted to HTTP/1.1, and you have to write it the second way if you want your session to use HTTP/2 or HTTP/3. But, this would suck, because it's a really subtle distinction that would create a really surprising behavioral difference.

Here's an annoying thing about current Python HTTP clients: you have to explicitly create and use a Session to make things efficient. If you just call requests.get(...), then it can't reuse connections. Wouldn't it be nice if things were magically efficient for everyone? Golang handles this by having a global Client object (= their version of a Session) that gets used by bare functions like http.get(...). We could do that too! (Well, one global session for the sync API, and one session-per-async-loop for the async API.) However... you can't have an implicitly constructed global session, if sessions require a with. So if you want to make HTTP efficient by default out-of-the-box, then your sessions have to support the bare constructor form.

So... I'm wondering if this is a place where we should commit heresy against the structured concurrency dogma, and use globally-scoped tasks to manage HTTP/2 and HTTP/3 connections. (And then if the answer is yes, we should also consider whether to have a global connection pool.)

Heresy: historical context

Back in the 1970s, the folks arguing for structured programming were originally super dogmatic about it. Structured programs weren't allowed to include constructs like break or continue or return, because those can jump out of multiple nested blocks at once. That means they broke Dijkstra's black-box rule, and were considered gotos in disguise. Instead you were supposed to make sure your loop exit condition covered all possible ways the loop could exit, and arrange your functions so that you always fell off the bottom eventually.

Of course today this sounds pretty silly. It's certainly possible to write some confusing code using these constructs, but they're nowhere near as bad as goto: their effects are local and pretty restricted, and avoiding them is often worse than the disease. E.g. if you encounter a fatal error inside a nested loop, you'd have to write a bunch of duplicated boilerplate conditions to carefully ferry it out of the loop and down to the bottom of your function, and that's also super error prone and confusing.

But I'm sympathetic to those early structured programming evangelists. The only reason we know that these constructs are worth it is because they made a serious attempt to go without them. And I'm sure there were dozens of other cases where goto proponents insisted that that some kind of goto was the right solution, and only a super-dogmatic insistence on structure could force them to actually try the structured version... and discover that in fact it was better.

So the lessons I take from this are:

  • A dogmatic insistence on "structured concurrency" is probably only, like 98% correct, not 100% correct, and we'll eventually look back and think we were silly for insisting on that 2%
  • But we don't yet know which 2% is the silly part!
  • And if we're going to make exceptions, we have to be really careful to analyze and justify them, because otherwise we'll lose the moral high ground
  • ...and besides, that's the only way to move the theory forward.

So lets try to put those lessons to work.

So how bad would this heresy be exactly?

There's a few reasons to suspect that using background tasks to pump HTTP/2 and HTTP/3 connections might not be that bad.

First, the idea is to keep the exact same public API as we had back in the old days with HTTP/1.1 and purely synchronous operations. So right there we know that this shouldn't mess up the control flow that our users experience, because the control flow shouldn't change at all. (If we can do it right.)

And second, the housekeeping work that HTTP/2 and HTTP/3 connections need is basically the same as the housekeeping work that the kernel already does on TCP connections. Every TCP connection has its own concurrent tasks running in the kernel to manage the TCP state machine, track ACKs, manage flow control, etc. When we call send or recv on a socket, that's actually just passing some request off to these concurrent tasks, which they perform on their own schedule. But we mostly don't have to think about this – it doesn't cause any horrible concurrency problems. So hopefully HTTP/2 and HTTP/3 should work the same way.

So that's promising, but let's work through the details and try to distill out some principles.

I think the fundamental reason it works for TCP, is that the send/recv API is very careful to preserve the illusion that these really are happening immediately when you call them. That's what we want here too: users shouldn't even be able to tell whether their requests are using HTTP/1.1 or HTTP/2.

Crucially, this includes error handling. If the kernel detects an error, like a TCP RST packet or something, it doesn't crash; it holds onto the error and reports it the next time we do a call on the socket. That's what we want for HTTP requests too. And for TCP, the semantics of send/recv are such that if there's an error on a socket that we never look at, then it's OK for the kernel to discard that error. Fortunately, that also applies to HTTP/2 and HTTP/3: if some background connection dies at a moment when we aren't actually using it for an HTTP request, then that's not really an error; we just silently discard that connection and open a new one later if we need to. The fact that multiple requests might share the same connection is designed to be entirely invisible, an implementation detail.

And the kernel is written very carefully to make sure it never ever crashes. Furthermore: if the kernel does crash, it takes down our program with it. However, the reverse isn't true: if our program crashes, then the kernel keeps going. Since the kernel is strictly more reliable than our code, that makes it safe for us to hand off work to it to happen later. For example, after you close a socket, the kernel might keep working on it for a while, to send any data that you passed to send earlier. If our program exits after calling close, that data isn't lost. OTOH if the kernel does lose the data, it's probably because the whole system died, in which case we would have lost the data no matter what we did. So the hand-off doesn't affect reliability.

Twisted and asyncio also like to pump data in the background, but they don't have this property: if you hand them some data to send, and then your program exits or crashes, their event loop stops, and the data that you thought you sent might not actually be sent. For our background tasks, we could try to force the program to keep running until they finish their work, but probably an easier way to handle this is to make sure that if a user sends something, that call doesn't report success until the background task has handed the data off to the kernel. (This is also the simplest way to handle backpressure.) And of course we'd need to be very very careful that they can never ever crash, and we'd need a backup plan for if that fails and they crash anyway.

But if we do all that..... I actually can't think of any problems that these background tasks would cause? This isn't exactly a tight list of reusable principles yet, but it's a bit closer than we were.

So what kind of architecture are we talking about?

I'm imagining something like Trio's "system task" concept, which gives a way to spawn a task into a global "system" nursery. They have some special properties that are all appropriate here:

  • If they raise an exception, then the whole program is cancelled (basically they follow the normal nursery rules, but running as a sibling of the entire program) – so if you have some horrible programming bug then at least you'll notice, but you're strongly encouraged to make sure that they never ever raise an exception
  • They're protected from receiving KeyboardInterrupt, which is kinda important for that "never raise an exception" thing (but means they have to be careful not to ever get stuck in an infinite loop or anything)
  • They're automatically cancelled when the main program exits.

Emulating some of these details on top of asyncio is probably a hassle, but I'm going to assume the Trio semantics for now and figure out how to compromise for other libraries later.

So given that, we can silently create a background task whenever an HTTP/2 or HTTP/3 connection is made, and automatically kill it again when the connection is closed. That means that in principle, we could implement pretty much any of the classic Session patterns.

However, I think the best architecture if we can manage it would be to have a single loop-global connection pool. The idea would be that it stores as little configuration state as possible; all it should hold is:

  • The cache of already-open connections
  • The minimal configuration needed to manage that cache (e.g. maximum number of idle connections)
  • A cache of which origins support HTTP/3 (since HTTP/3 is served over UDP, it has a wierd negotiation system where you never use it by itself; you have to first make a connection using HTTP/1.1 or HTTP/2 and in the response the server mentions hey btw, I also have an HTTP/3 server on this port, you can use that for future requests)

This is the stuff that you want to share as globally as possible, because the more widely its shared the better the chances that any given request in your program can skip expensive connection setup or use a more efficient protocol.

But that means we can't associate configuration with the pool. Instead, we'd associate configuration with individual requests. For example: you don't disable certificate validation on the connection pool, like urllib3 does; instead you disable it on individual requests. And part of the connection pool's job is to keep track of which connections have certification validation enabled or disabled, and match up requests with an appropriate connection. (urllib3 already does something similar, but slightly more convoluted: it stores configuration on individual ConnectionPools, but then the PoolManager class does matchmaking between requests and pools based on the per-request configuration settings – see _key_fields and associated stuff in poolmanager.py. So this is taking the PoolManager approach and doubling down on it.)

Of course it's still useful to be able to set up configuration on a Session object once, and then re-use it multiple times. But in this approach, the Session would only be a collection of default configuration that it applies to requests made through it, before delegating the actual request handling to the global pool. That would make Session lifecycle management super trivial :-).

We might also want to expose the option of making new pools, for weird cases and for testing. So maybe the layering would be:

  • A Session holds configuration, including a reference to which low-level HTTP transport object to use to fulfill individual requests.
  • By default this points to the global connection pool
  • But you can make new pools, and explicitly configure a Session to use them

I'm not 100% sure whether we'd want to expose the low-level transport object as part of the public API – #124 has lots of musing on this topic. But that seems like the basic architecture we'd want internally, at the least.

Like any global cache, we'd have to be very careful to make sure that the pool can't accidentally leak state in unexpected ways. For example, for the sync version of the pool, we'd have to make sure that it was both thread-safe and fork-safe. That seems pretty doable though.

Another interesting challenge will be figuring out where to stash the "global" connection pool. For the sync backend, we can just use a global variable. For Trio, we can use a RunVar. For asyncio and Twisted... maybe we need a WeakKeyDict mapping loops/reactors to connection pools, or something? I'm sure we can figure something out.

That sounds elegant and all, but does reality agree?

Currently libraries like requests and urllib3 don't have the kind of careful-and-systematic separation of configuration from the connection pool that I'm envisioning above. Could it actually work in practice?

Here's the state that requests keeps on Session objects:

  • defaults for extra headers, auth, params to attach to every request
  • default for stream=True/False (!?)
  • TLS settings: trust configuration, client cert
  • maximum redirect chain length
  • proxy configuration
  • "hooks" = a list of arbitrary callables to run on responses before returning them
  • "adapters" = which low-level "raw" HTTP transports to use for different protocols
  • cookies

Other things I can think of:

  • default timeout settings
  • default retry settings
  • default settings for transport-level configuration, like TCP keepalive or whether to use HTTP/2

A lot of this is assorted "configuration" stuff that would fit very naturally in my description above.

We'd also want to think hard about whether to make any of this stuff mutable, because if there's a global default Session and its configuration is mutable then that creates all kinds of potential for folks to get themselves into trouble.

The tricky cases:

Connection caching policy obviously has to be configured on the connection cache. Arguably this is a good thing... the main reason for policy settings here is to avoid allocating too many fds, and fds are a process-global resource, so it kinda makes sense to set this policy at a process-global level. But is that good enough in practice, or do people need the option to create private connection pools with their own policy, inside particular modules?

I'm not quite sure what to do with cookies. They can mostly fit into the "configuration" model, if we expose a "cookie jar" type, and let you specify which cookie jar to use on a per-request basis. But what should we do when there's no cookie jar specified: use a process-global jar? Don't store cookies at all? And when you create a Session without specifying a jar, does it default to using the global settings, or should it create a new cookie jar?

Go's approach is that when you make an http.Client, you can either specify a CookieJar, or set it to their version of None. If it's None, then all the default cookie handling is disabled (so not only does it not automatically add cookies to requests or save cookies from responses, it doesn't even strip cookies when following redirects! not sure what the reasoning there was). And the default global Client has the CookieJar set to None.

I can see arguments either way! On the one hand, having a global CookieJar would be a vector for state to leak between different parts of your program in potentially-surprising ways. On the other hand, a lot of Python programs are quick scripts to mess with HTTP APIs and we want to make that beginner path as smooth as possible – and how often do you really want to throw away cookies? Browsers have a global cookie jar and it works out OK for them. Though it wouldn't be terrible to say that by default cookies are ignored, and if you want automatic cookie handling then you have to create a Session. And I guess browsers do have stuff like incognito windows and facebook containers. And if you have two tasks interacting with some API on behalf of different users, you definitely don't want to accidentally mix up their cookies. So I'm leaning towards making the global session use a null cookiejar that's always empty.

@RonnyPfannschmidt
Copy link

why not generally make the with statement a requirement, even for http 1/1.1 its nice to clean up the residual resource pool and drop connections/caches

@njsmith
Copy link
Member Author

njsmith commented Oct 1, 2019

@RonnyPfannschmidt Good question :-). Cleaning up resources is generally a good thing! Though... do you actually make a habit of closing your requests.Sessions before your program exits, and if so, why? I have some trouble thinking of cases where this is a good idea.

But even if it's a good idea in general, there are also some reasons why it might be nicer to keep it optional, rather than mandatory:

  • That's how the existing libraries work, and making it mandatory will feel like a regression to lots of folks. If you want your new library to succeed, you want to minimize the number of excuses people can use for sticking to the status quo :-)

  • While I'm generally in favor of cleaning up resources, realistically it's often a waste of time in small one-off scripts, and that's an important use case for Python. Especially scripts written by non-experts. It's hard for me to honestly say that someone working through the first few chapters of Automate the Boring Stuff would really have their life improved by being forced to learn about with statements before they could fetch some web pages.

  • The with requirement is "viral" – if creating a session requires a with statement, then every object that wants to make HTTP requests has to also be created with with, or else take a mandatory Session argument. So for example, here's the first example in the docker-py documentation:

    import docker
    client = docker.from_env()

    And here's the first example in the botocore docs:

    import botocore.session
    session = botocore.session.get_session()
    client = session.create_client('ec2', region_name='us-west-2')

    If we made with mandatory, and these libraries wanted to switch to our new library, then that would force them to make backwards-incompatible changes in their public API and break these examples. In practice, this would probably mean that no-one actually adopted our library.

@RonnyPfannschmidt
Copy link

conceptual a divide between configuring and acquisition of a resource manager/pool could be drawn, and in many of the easy cases people care about obtaining aqired resources

i believe this can play nice with atexit cleanup and async aware contextual variables

so the core library would use the explicit style and surround it with a convenient shell
then application and library authors could pick their style, with the suggestion that applications typically use the convenience shell and libraries typically implement the more detailed core in addition to a custom shell

@oremanj
Copy link
Member

oremanj commented Oct 1, 2019

$ python -m this | grep purity

I'm a fan of the approach @njsmith has proposed here. I think it's appropriate to think long and hard before relying on an implicit background task, and to design that task very carefully for robustness, but I believe that as long as those things are done (which in this discussion they sure seem to be) an implicit background task isn't inherently evil or complicating.

@RonnyPfannschmidt
Copy link

well, does anyone have a good answer for when to start it, and what to do with it when the mainloop isn't running, also to which nursery would it belong

@oremanj
Copy link
Member

oremanj commented Oct 1, 2019

You could start it on first use. As described, in Trio this would be a system task, cancelled automatically when the trio.run() call exits. In asyncio it might have to hang around until the program exits, but that's no worse than the status quo for current synchronous HTTP libraries. As for "to which nursery would it belong" -- technically it goes in the "system nursery" which is the top-level nursery whose children are the main task (function passed to trio.run()) and any system tasks. But from a user perspective it appears to be outside of all nurseries.

@RonnyPfannschmidt
Copy link

lovely, thanks

@jtrakk
Copy link

jtrakk commented Oct 1, 2019

can take advantage of the

@njsmith Did some words fall off in your original post?

@njsmith
Copy link
Member Author

njsmith commented Oct 1, 2019

@jtrakk whoops, thanks, fixed

@jtrakk
Copy link

jtrakk commented Oct 1, 2019

FWIW I do rather like one feature of requiring with which hasn't been mentioned -- namely, it teaches me about how the system works. When I see with, I learn "aha, there's a background task in this operation", which is actually a nice feature for network novices, rather than a solely negative thing.

@njsmith
Copy link
Member Author

njsmith commented Oct 2, 2019

Browsers have a global cookie jar and it works out OK for them.

Heh, I looked into the Fetch API today, and it turns out that this comment I wrote was hilariously naive. Browser cookie handling logic has special cases everywhere to figure out when to use which cookies. For example, see: https://developer.mozilla.org/en-US/docs/Web/API/Request/credentials

@sethmlarson
Copy link
Contributor

Here's a brain dump from me for this issue:

Some initial components of a "pool key":

  • Scheme
  • Host
  • Port
  • Trust Store (CA cert(s) / fingerprint)
  • TLS Config (alpn_protocols, server_hostname)
  • Client Cert, Key, Password
  • Proxy URL
  • Proxy Headers
  • Proxy Config (eg remote/local DNS for SOCKS)

Proxying is a pain because the different proxying methods being supported (HTTP->HTTP, HTTP->HTTPS, HTTPS->HTTP, HTTPS->HTTPS, SOCKS) have different times where the proxy logic kicks in.

  • HTTP->HTTP and HTTPS->HTTP: Proxy logic kicks in at the request level, request URL target is in absolute form.
  • HTTP->HTTPS: Need to call start_tls() after the inital request has succeeded so the logic kicks in after a single response has been received on the connection. The proxy's response also needs to be captured in the case of a non-2XX status code to report to the user. So either the connection pool needs to know when to send the proxy request and somehow report an invalid response upstream or the session needs to somehow signal to the connection pool that the "origin"
  • HTTPS->HTTPS: Same as above except have to call start_tls() twice on the stream so logic is both before the first request and after the first request.

All CONNECT tunnel proxy methods need to be able to react to an HTTP version change via TLS ALPN. I'd recommend all (X)->HTTPS be X=HTTP/1.1 since the connection is so short lived and the logic is simpler with HTTP/1.X for upgrading to a TCP tunnel. I guess it doesn't really matter that the version changes because the HTTP connection even if it's the same version of HTTP has to be restarted anyways. (I think h11 goes into a h11.PAUSE state?)

@njsmith
Copy link
Member Author

njsmith commented Nov 19, 2019

Yeah, proxies are going to have some special case logic; no way around it.

Let's imagine we have functions pool.get_connection(config) -> conn and pool.put_connection(config, conn).

HTTP->HTTP and HTTPS->HTTP: Proxy logic kicks in at the request level, request URL target is in absolute form.

Yeah, for these I think you treat it as a plain HTTP/HTTPS connection to the proxy when looking it up in the pool/establishing it: conn = pool.get_connection(proxy_config), make an absolute request over conn, pool.put_connection(proxy_config, conn). So the pool key won't even mention any attributes of the actual target host.

HTTP->HTTPS, HTTPS->HTTPS

I think these work the same for connection establishment: conn = pool.get_connection(proxy_config). The tricky part comes afterward; we take that conn, issue a CONNECT, and hopefully it succeeds and now we basically have a plain TCP connection to the target host, and can use our regular TCP setup code and make our request. Then when we put it back in the pool, we have to put it back under a new key that describes both the proxy config + the target host, pool.put_connection(target_config_including_proxy_info, conn).

SOCKS is kind of similar: since it's a connection-level thing, it needs to be described in the pool key.

All CONNECT tunnel proxy methods need to be able to react to an HTTP version change via TLS ALPN. I'd recommend all (X)->HTTPS be X=HTTP/1.1 since the connection is so short lived and the logic is simpler with HTTP/1.X for upgrading to a TCP tunnel. I guess it doesn't really matter that the version changes because the HTTP connection even if it's the same version of HTTP has to be restarted anyways. (I think h11 goes into a h11.PAUSE state?)

The state's called h11.SWITCHED_PROTOCOL, but yeah.

But: if your proxy speaks HTTP/2, then it's possible and I guess desireable to establish a single HTTP/2 connection, and then tunnel multiple CONNECT requests through that connection simultaneously. This is probably not too hard once we have the infrastructure, because everything composes: we do pool.get_connection(proxy_config), and that returns either a HTTP/1.1 or HTTP/2 connection. If it's a HTTP/1.1 connection, then we use CONNECT and then let our regular connection establishment logic figure out what to do with it after that (as described above). If it's a HTTP/2 connection, then we allocate a HTTP/2 stream, issue a CONNECT on that stream, and then establish a regular connection over that stream.

This does lead to a weird situation where you may end up having some "connections" in the pool that actually represent tunnels over other "connections" in the pool. But I think that's like... fine?

I haven't quite thought through how the lifecycle management should work to abstract over HTTP/1.1 vs HTTP/2, since "returning a connection to the pool" isn't really a thing in HTTP/2. It might be trivial though; maybe put_connection on HTTP/1.1 puts a connection back in the pool, and on HTTP/2 it closes the stream, and that's all there is too it.

@sethmlarson
Copy link
Contributor

I think these work the same for connection establishment: conn = pool.get_connection(proxy_config). The tricky part comes afterward; we take that conn, issue a CONNECT, and hopefully it succeeds and now we basically have a plain TCP connection to the target host, and can use our regular TCP setup code and make our request. Then when we put it back in the pool, we have to put it back under a new key that describes both the proxy config + the target host, pool.put_connection(target_config_including_proxy_info, conn).

So for this case how would the .get_connection() look? Would we have to issue .get_connection(target_config_including_proxy_info) first while also asking the connection pool to not create a new connection if one doesn't fit that config then if that get_connection() call fails issue get_connection(proxy_config)? That would work I think.

If it's a HTTP/2 connection, then we allocate a HTTP/2 stream, issue a CONNECT on that stream, and then establish a regular connection over that stream.

So then we'd have to hand back a "SocketStream" that's really just an interface containing an HTTP/2 stream. Lots of moving parts there. Have to worry about writes "timing out" if the proxy doesn't give back enough credits to the HTTP/2 stream.

This does lead to a weird situation where you may end up having some "connections" in the pool that actually represent tunnels over other "connections" in the pool. But I think that's like... fine?

I'm worried about handling cases where a single HTTP/2 proxy connection ends up representing tunnels to different endpoints. Have to think about all the scenarios that the pool can get locked up over, especially when available resources are low. Which connection do we evict to cause the least damage?

I haven't quite thought through how the lifecycle management should work to abstract over HTTP/1.1 vs HTTP/2, since "returning a connection to the pool" isn't really a thing in HTTP/2. It might be trivial though; maybe put_connection on HTTP/1.1 puts a connection back in the pool, and on HTTP/2 it closes the stream, and that's all there is too it.

We'll have to worry about resources within the HTTP2Connection as that will affect whether that connection can be used for a "new" request ie tracking how many concurrent streams are being used. Will somehow have to signal to the pool which stream_id was allocated to a given request so it knows which stream to close when the connection is "returned". Maybe the Request needs to be a parameter to get_connection() and put_connection()?

@njsmith
Copy link
Member Author

njsmith commented Nov 19, 2019

Maybe the abstraction we want to represent a single request/response cycle at a low-level is something like:

class OneHTTPTransaction(ABC):
    async def send_request_data(self, data) -> None:
        ...

    async def send_eof(self, trailers: Headers):
       ...

    async def receive_response_headers(self) -> Tuple[Code, Reason, Headers]:
        ...

    async def receive_response_data(self, max_nbytes) -> bytes:
       ...

    def get_response_trailers(self) -> Headers:
        ...

    async def aclose(self):
       ...

class ConnManager(ABC):
    async def start_one_http_transaction(self, method, target_url, request_headers, conn_config) -> OneHTTPTransaction:
        ...

(Obviously all these names are placeholders. Also, we'd probably actually want to make ReqRep implement Python's standard async bidirectional byte stream interface, once that exists (see python-trio/trio#1208), but that's a whole other discussion so I just put some placeholders in there for now.)

So the idea is that for HTTP/1.1, closing or exhausting the OneHTTPTransaction doesn't actually close the underlying connection; instead it releases it back to the pool. For HTTP/2, the underlying connection stays in the pool throughout, and closing or exhausting the OneHTTPTransaction releases the HTTP/2 stream.

And then inside start_one_http_transaction we'd do some fancy footwork:

async def start_one_http_transaction(self, method, target_url, request_headers, conn_config):
    return await self._start_one(method, target_url, request_headers, conn_config)

async def _start_one(self, method, target_url, request_headers, conn_config, *, override_target=None):
    if is_http(target_url) and wants_http_style_proxy(conn_config):
        assert override_target is None
        proxy_url, proxy_headers, proxy_config = get_http_style_proxy_info(conn_config)
        # Recurse, and presumably next time we'll hit the branch below.
        return await self._start_one(method, proxy_url, combine(proxy_headers, request_headers), proxy_config, override_target=target_url)
    else:
        pool_key = compute_pool_key(target_url, conn_config)
        try:
            return await self._start_transaction_using_existing_key(pool_key, ...)
        except KeyError:
            # Need to set up a new connection
            if wants_http_style_proxy(conn_config):
                assert is_https(target_url)  # http URLs w/ http proxies shouldn't make it here
                proxy_url, proxy_headers, proxy_config = get_http_style_proxy_info(conn_config)
                # Recurse to get the transport object
                connect_transaction = await self._start_one("CONNECT", proxy_url, proxy_headers, proxy_config)
                ... = await connect_transaction.receive_response_headers()
                check_for_connect_error(...)
                # Run any initial handshake, and add the connection to the appropriate pool
                await self._setup_new_connection(pool_key, connect_transaction, ...)
            else:
                tcp_conn = await self._make_tcp_connection(...)  # I guess this handles SOCKS internally
                await self._setup_new_connection(pool_key, tcp_conn, ...)
            # Now we've added the connection to the pool, so we can try this again and it will succeed:
            return await self._start_transaction_using_existing_key(pool_key, ...)

async def _start_transaction_using_existing_key(self, pool_key, ...):
    if pool_key in self._http2_conns:
        conn = self._http2_conns[pool_key]
        return await conn.start_transaction(target_url, method, request_headers)
    else:
        while pool_key in self._http11_conns:
            # Remove the connection from the pool entirely while it's in use
            conn = self._http11_conns.pop(pool_key)
            if not still_alive(conn):
                conn.close()
                continue
            await conn.start_transaction(target_url, method, request_headers)
            return conn
        raise KeyError

It's not simple, and I'm sure it could be expressed more clearly with some judicious refactoring. But I think that basic flow can handle all the wacky combinations of different proxy types, layering of HTTP/X on top of HTTP/Y, etc., and it's only about a page of code or so.

@njsmith
Copy link
Member Author

njsmith commented Nov 19, 2019

I'm worried about handling cases where a single HTTP/2 proxy connection ends up representing tunnels to different endpoints. Have to think about all the scenarios that the pool can get locked up over, especially when available resources are low. Which connection do we evict to cause the least damage?

Hmm, I'm not sure this is actually our problem? If the proxy decides they want to reject some of our connections, they can do that equally well over HTTP/1.1 or HTTP/2. And the whole point of HTTP/2's flow control stuff is to make sure that one stream can't "lock up" other streams by hogging the underlying TCP connection.

@sethmlarson
Copy link
Contributor

Hmm, I'm not sure this is actually our problem? If the proxy decides they want to reject some of our connections, they can do that equally well over HTTP/1.1 or HTTP/2. And the whole point of HTTP/2's flow control stuff is to make sure that one stream can't "lock up" other streams by hogging the underlying TCP connection.

I'm saying that if we constrain resources in the pool to be TCP streams instead of "HTTP lifecycles" then using LRU/LFU won't be optimal because a single TCP stream could be housing many HTTP lifecycles. Do we go the route of "you can't evict an HTTPConnection while a lifecycle is active"?

@njsmith
Copy link
Member Author

njsmith commented Nov 19, 2019

Ohh, yeah, cache eviction policy is a whole other topic, right.

For HTTP/1.1 it's simpler: either a connection is in use, or it's sitting idle in the pool. If it's in use, we obviously aren't going to evict it. If it's sitting idle in the pool, we might decide to evict it based on some policy like a timeout or based on a limit on how many idle connections we're willing to retain.

I'm actually not sure what the best eviction policy is... urllib3's is a bit quirky (IIRC it's separate pools per request target, each pool has its own LRU policy, and then there's a second layer of LRU on the pools themselves?). It works, but it seems more like an accidental side-effect of some implementation choices, not necessarily something that needs to be copied exactly. That's kind of a separate discussion though. Let's worry about it later.

I think we can say though that whatever eviction policy we use, for HTTP/1.1 it will only consider idle connections. For active connections, the user has both the knowledge and control they need to manage resources appropriately for their use case.

So then generalizing to HTTP/2... I guess the same principle applies? Obviously the bookkeeping will be a bit more complicated, because the underlying connection only becomes "idle" when there are no live streams. But I think it works out. For example, consider the most complex case, of HTTP/2 tunneled through an HTTP/2 proxy:

  • First we establish the HTTP/2 connection to the proxy, and put it in our pool, call this connection A
  • Then we open a stream on top of connection A, issue a CONNECT, get a tunneled connection to the actual target host, and put that in our pool. Call this connection B. As long as connection B survives, connection A will be considered active/non-idle, because it has an active stream on it.
  • Then we open some streams over connection B, and use them to make requests against the actual target host
  • Eventually, we stop making requests against this target host, all the streams get closed, and so connection B gets marked idle.
  • Eventually, the idle connection cleanup logic notices that connection B is idle, and closes it.
  • Now that connection B was closed, this means connection A has no active streams, so it gets marked as idle.
  • Eventually, the idle connection cleanup logic notices that connection A is idle, and closes it.

So I think once we have the mechanisms to handle regular non-proxied HTTP/2, then those same mechanisms will be sufficient to handle the complicated nesting cases too.

@njsmith
Copy link
Member Author

njsmith commented Nov 19, 2019

The biggest difference between HTTP/1.1 and HTTP/2 here is that for HTTP/1.1, you don't really need to keep track of active connections, so you can use "in the pool" as your marker of idle-ness. For HTTP/2, though, you have to use some kind of central data structure to track all connections, whether they're currently active or not. So idle-vs-active needs to be tracked separately.

It might make sense to do this for HTTP/1.1 too though – e.g. maybe when the pool shuts down you want to forcibly close all connections, even the active ones.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants