Connection pool biased towards slow connections #8244

alexbrasetvik · 2024-02-08T10:24:45Z

This paper on Metastable Failures in Distributed Systems has an excellent war-story about connection pooling logic that unintentionally favors slow connections:

Quoting from section 2.4 on Link Imbalance:

[I]f one of the network links is congested, then the queries that get sent across that link will reliably complete last. This interacts with the connection pool’s MRU policy where the most recently used connection is preferred for the next query. As a result, each spike of misses rearranges the connection pool so that the highest-latency links are at the top of the stack.

We experienced something similar using okhttp as a client, where disabling the okhttp connection pooling completely around 15:20 had a very obvious effect on load distribution. The metric is from the downstream server's perspective:

Our setup involves kube-proxy, which provides a virtual IP and round robins TCP connections to pods that are part of the service, i.e. all okhttp knows is that there's a single IP. The nature of the service is that some requests are naturally slow, while most are fast, i.e. there's an expected significant difference between p95 and p50.

Before ultimately disabling connection pooling (as a stop-gap, and to confirm our suspicions), we reduced the keep alive timeout, but not as aggressively as shared in this post, with a similar problem with okhttp's connection pooler: Kubernetes network load balancing using OkHttp client That had some effect, but the bias issue remained:

When the pool selects a connection, it picks the first available connection. The idle connection cleanup removes the idle connections that have been used the least.

Our theory is this: When cleanup happens, the connections to slow pods are more likely to be active, and thus not eligible for cleanup. Thus, over time, their connections will gravitate towards the front of the queue, and thus be selected increasingly frequently, making them even slower, and even more likely to be selected – much like the story from the linked paper.

Thanks a lot for making okhttp, and sorry I don't have a running test, but wanted to at least share our suspicions.

yschimke · 2024-02-10T15:00:40Z

Thanks for the detailed analysis. This is really useful.

Possibly similar to #1397

There is a larger feature request #4530 for HTTP/2, where we "aggressively tries to deduplicate connections to the same destination".

I'd be curious if there is a smaller API that let's us experiment with these strategies, before we solve it completely. but I can't promise anything.

yschimke · 2024-02-10T15:01:23Z

Can I assume you are on HTTP/1.1?

alexbrasetvik · 2024-02-11T11:19:56Z

Correct, we're on HTTP/1.1.

yschimke · 2024-02-11T12:09:40Z

In the short term, what might be viable is to explore some strictly internal strategy API with the default version implementing what we have now. Something we could test different strategies.

swankjesse · 2024-04-03T03:13:49Z

I love this analysis. Thanks!

alexbrasetvik added the bug Bug in existing code label Feb 8, 2024

yschimke added enhancement Feature not a bug performance Performance optimisation and removed bug Bug in existing code labels Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection pool biased towards slow connections #8244

Connection pool biased towards slow connections #8244

alexbrasetvik commented Feb 8, 2024

yschimke commented Feb 10, 2024

yschimke commented Feb 10, 2024

alexbrasetvik commented Feb 11, 2024

yschimke commented Feb 11, 2024

swankjesse commented Apr 3, 2024

Connection pool biased towards slow connections #8244

Connection pool biased towards slow connections #8244

Comments

alexbrasetvik commented Feb 8, 2024

yschimke commented Feb 10, 2024

yschimke commented Feb 10, 2024

alexbrasetvik commented Feb 11, 2024

yschimke commented Feb 11, 2024

swankjesse commented Apr 3, 2024