Should we automatically increase the file descriptor limit? #2016

njsmith · 2021-05-20T06:22:50Z

Just read this interesting article about the file descriptor limit on Linux: http://0pointer.net/blog/file-descriptor-limits.html

The short version is:

Unix's enforce a limit on how many file descriptors a process can have open
This limit is kind of archaic. Back in the day I guess this was needed to stop processes overloading the kernel by opening too many file descriptors. But these days we have things like "dynamic memory allocation", and a file descriptor is really just a bit of kernel memory, and the kernel already accounts for how much memory a process is using.
In fact, the only reason the limit is useful at all today is that some programs might corrupt memory if they blindly use select and have more than 1024 file descriptors open. This is because as part of select's API, file descriptor values are effectively used as unchecked indices into a fixed-size memory buffer. So if an fd has value 1024 or higher, you end up scribbling on random memory. Sigh.
So on modern Linux systems, processes all start with a "soft" limit of 1024 to protect against this memory corruption, but any program that knows its not doing stupid things with select can freely remove this limit. (On my laptop with stock ubuntu, ulimit -Ha says that my actual per-process limit is 2**20 file descriptors.)

Arguments in favor of Trio silently raising the limit:

Hitting the fd limit is a common source of surprise problems. 1024 concurrent clients is pretty easy to miss in testing but hit in production. We have special hacks to try to degrade gracefully when this happens:

trio/trio/_highlevel_serve_listeners.py

Line 91 in 3750516

Error handling:

(There are several different errors that trigger the graceful degradation, but EMFILE is by far the easiest to hit.)

Generally speaking, Trio tries to deal with this kind of arcane nonsense so that our users don't have to become experts in low-level trivia.
It actually isn't possible to cause the stupid memory corruption problem from Python directly; Python's select wrapper correctly checks for too-large file descriptor values, and errors out if you try to pass them. (Though it is possible for a C extension to mess this up, if it uses select or calls into some C library that uses select.) Also, hopefully no-one is using select and Trio together anyway?
Also, with glibc, if you build with -D_FORTIFY_SOURCE=1 or higher, then glibc tweaks the select API to detect the buffer overrun and crash your program if you try. This is extremely standard for distribution-built binaries. But this may not apply to locally-built binaries, and musl doesn't have any similar protection.
The limit doesn't even fix select; it just makes it so that processes that use too many fds will give an error, instead of silently corrupting memory. Now, don't get me wrong, giving an error is way better than silently corrupting memory.

Arguments against Trio silently raising the limit:

It's a process-wide limit, and we're just a library – we have no idea what other code might be running in our process.
If someone does somehow manage to use select in the broken way, raising the limit could in theory produce a remotely-exploitable vulnerability. (It's not a slam dunk by any means, but "how many file descriptors the program has open" is something that a remote attacker could exert some control over, and maybe if they're clever enough they could cause a controlled number of file descriptors to be opened, and then a certain one to be passed to the code that's using select, to flip a controlled bit in memory.)

The text was updated successfully, but these errors were encountered:

PiotrCzapla · 2021-05-25T09:39:55Z

Interesting article, there might be another arguments against transparently changing the limit. The sooner you run in to such limit the more likely you will start thinking about other correlated limits and production tuning. It might be better to educate devs about the limits instead silently tweaking them.

Consider number of ephemeral ports (the ports used to track TCP connection) the limit is system wide 28k (linux), 16k (macos). A closed port cannot be reused for the 60s (default TIME_WAIT value). So once a process hit the limit nothing can spawn new connections any more. What is worse you might get errors in different processes than the one causing the issue.

It is much easier to hit such limit in an async settings, and I guess people trying the async approach wont be aware about this kind of issues. I wasn't thinking how likely it is to trigger this until I've read this issue, thank you.

This article on k6 (load testing tool) has a nice summary of the limits one may hit when spawning multiple connections on linux and mac.

I hope it helps.
Btw. I really enjoyed 'go considered harmful', shame this pattern doesn't seems to be implemented in javascript yet.

njsmith · 2021-06-01T01:44:02Z

@PiotrCzapla Huh, I hadn't thought about ephemeral ports. That is a genuine resource limit that is in fact related to the number of file descriptors, and isn't just memory in disguise. The connection is a little complicated though, so let's think it through.

IIUC:

For server / listening ports, they consume an entire local port while active, and then keep TIME_WAIT for a while afterwards. However, for these ports, I think literally everyone uses SO_REUSEADDR, and that fixes the TIME_WAIT issue. Plus, normal programs aren't binding thousands of different listening ports simultaneously -- that would be a very unusual way to hit fd limits. Usually it's from accepting or making too many connections.

Connected sockets don't consume an entire local port. You can't have two different sockets that are using the same "5 tuple" of (protocol, local ip, local port, remote ip, remote port) at the same time, and maybe not within TIME_WAIT seconds of each other (depending on which side initiates the close). But you can have lots of sockets sharing the same local port as long as they're connected to different remote destinations. So ephemeral port exhaustion is usually only an issue when you want to make tons and tons of connections to a single remote destination, which is common for benchmarking tools but probably only for benchmarking tools? This also reduces the chances that one runaway process will affect unrelated processes on the system, because those unrelated processes are probably connecting to different remote destinations.

Also, because of the TIME_WAIT issue, putting a limit on the number of fds that can be open simultaneously doesn't actually prevent ephemeral port exhaustion: a program that creates and then immediately closes thousands of connections can easily exhaust your ephemeral ports, without ever having more than a handful of sockets open at the same time.

And, the standard fd limit is at least an order of magnitude lower than even the most conservative ephemeral port limit anyway.

So my feeling is that fd limits and ephemeral port limits just aren't correlated enough to provide a meaningful benefit.

PiotrCzapla · 2021-06-02T07:43:30Z

But you can have lots of sockets sharing the same local port as long as they're connected to different remote destinations.

I was missing this bit, thank you! You are right.
It explains why I encounter this issue only two times when a db connection pool was missing, or during stress tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we automatically increase the file descriptor limit? #2016

Should we automatically increase the file descriptor limit? #2016

njsmith commented May 20, 2021

PiotrCzapla commented May 25, 2021

njsmith commented Jun 1, 2021

PiotrCzapla commented Jun 2, 2021 •

edited

Should we automatically increase the file descriptor limit? #2016

Should we automatically increase the file descriptor limit? #2016

Comments

njsmith commented May 20, 2021

PiotrCzapla commented May 25, 2021

njsmith commented Jun 1, 2021

PiotrCzapla commented Jun 2, 2021 • edited

PiotrCzapla commented Jun 2, 2021 •

edited