bump RLIMIT_NOFILE #10244

poettering · 2018-10-02T07:28:38Z

After discussions with some kernel devs at All Systems Go! it appears that we should bump the RLIMIT_NOFILE hard limit substantially these days, as fds are nowadays cheap in memory and performance, much cheaper than they used to be. (or in other words: by bumping the hard limit we have little to lose but a lot to win)

This PR universally bumps the hard limit to 256K, overriding the kernel's default of 4K. The soft limit's default of 1K is left as it is (for compat with select()). For all services and tools using the journal this also bumps the soft limit to 256K, and for PID 1 (and systemd --user) instance to the highest value permitted.

poettering · 2018-10-02T07:28:50Z

/cc @htejun

keszybz

Reviewed. Suggestions rather than issues really.

keszybz · 2018-10-03T09:07:40Z

src/basic/rlimit-util.c

+
+        r = setrlimit_closest(RLIMIT_NOFILE, &RLIMIT_MAKE_CONST(limit));
+        if (r < 0)
+                return log_debug_errno(r, "Failed to bump RLIMIT_NOFILE: %m");


"bump" → "set", because it can be lower than the current value.

keszybz · 2018-10-03T09:24:09Z

src/core/main.c

+                 * select(), the way everybody should) can explicitly opt into high fds by bumping their soft limit
+                 * beyond 1024, to the hard limit we pass. */
+                if (arg_system)
+                        arg_default_rlimit[RLIMIT_NOFILE]->rlim_max = MIN((rlim_t) nr, MAX(arg_default_rlimit[RLIMIT_NOFILE]->rlim_max, (rlim_t) HIGH_RLIMIT_NOFILE));


The fact that the "global" arg_default_rlimit is used here is a bit confusing. Why not use rl here, and then assign at the end when the structure is ready? That way the reader immediately knows that global state is not used in the calculation:

if (arg_system) rl->rlim_max = MIN((rlim_t) nr, MAX(rl->rlim_max, (rlim_t) HIGH_RLIMIT_NOFILE)); arg_default_rlimit[RLIMIT_NOFILE] = rl;

keszybz · 2018-10-03T09:26:09Z

units/systemd-journal-gatewayd.service.in

-# access them all and combine
-LimitNOFILE=16384
+# If there are many split up journal files we need a lot of fds to access them
+# all and combine them.


Maybe drop "all and combine them", since that conveys no information and/or is incorrect (we don't use the fds for combining).

keszybz · 2018-10-03T09:28:40Z

units/systemd-journal-gatewayd.service.in

-LimitNOFILE=16384
+# If there are many split up journal files we need a lot of fds to access them
+# all and combine them.
+LimitNOFILE=262144


I'd prefer to see the HIGH_RLIMIT_NOFILE number exported as a configuration variable (with a fixed value, not configurable) and inserted here.

keszybz · 2018-10-03T09:32:19Z

I'll set the needs-update flag, but feel free to drop it.

poettering · 2018-10-04T17:45:40Z

I force pushed a new version now, with all of @keszybz's items addressed, except for the requested config variable thing. I didn't see a nice way to implement that, without also having to remove the define from def.h and move it into meson, which I didn't like...

So, dunno, let's maybe do that part next time changing this rlimit comes up, for now the current patches should be good enough I think, if that makes sense?

no other changes

ptal!

poettering · 2018-10-05T18:47:05Z

Also relevant, this discussion:

https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/ZN5TK3D6L7SE46KGXICUKLKPX2LQISVX/

keszybz · 2018-10-11T09:17:10Z

As discussed on telegram, we also need to bump fs.file-max to avoid a trivial DOS.

poettering · 2018-10-11T16:40:55Z

I force pushed a new version now, that also bumps the fs.file-max and fs.nr_open sysctls to their maximums. The code is frickin' ugly because the APIs are so weird, but we have to do what we have to do. Also, I commented it with emojis, so who can be mad?

PTAL!

This simply adds a new constant we can use for bumping RLIMIT_NOFILE to a "high" value. It default to 256K for now, which is pretty high, but smaller than the kernel built-in limit of 1M. Previously, some tools that needed a higher RLIMIT_NOFILE bumped it to 16K. This new define goes substantially higher than this, following the discussion with the kernel folks.

Following discussions with some kernel folks at All Systems Go! it appears that file descriptors are not really as expensive as they used to be (both memory and performance-wise) and it should thus be OK to allow programs (including unprivileged ones) to have more of them without ill effects. Unfortunately we can't just raise the RLIMIT_NOFILE soft limit globally for all processes, as select() and friends can't handle fds >= 1024, and thus unexpecting programs might fail if they accidently get an fd outside of that range. We can however raise the hard limit, so that programs that need a lot of fds can opt-in into getting fds beyond the 1024 boundary, simply by bumping the soft limit to the now higher hard limit. This is useful for all our client code that accesses the journal, as the journal merging logic might need a lot of fds. Let's add a unified function for bumping the limit in a robust way.

…the journal This makes use of rlimit_nofile_bump() in all tools that access the journal. In some cases this replaces older code to achieve this, and others we add it in where it was missing.

Following the discussions with the kernel folks, let's substantially increase the hard limit (but not the soft limit) of RLIMIT_NOFILE to 256K for all services we start. Note that PID 1 itself bumps the limit even further, to the max the kernel allows. We can deal with that after all.

… the journal This updates the unit files of all our serviecs that deal with journal stuff to use a higher RLIMIT_NOFILE soft limit by default. The new value is the same as used for the new HIGH_RLIMIT_NOFILE we just added. With this we ensure all code that access the journal has higher RLIMIT_NOFILE. The code that runs as daemon via the unit files, the code that is run from the user's command line via C code internal to the relevant tools. In some cases this means we'll redundantly bump the limits as there are tools run both from the command line and as service.

Previously we'd do this for PID 1 only. Let's do this when running in user mode too, because we know we can handle it.

…anything Just a tiny tweak to avoid generating an error if there's no need to.

keszybz

Seems to all work fine.

keszybz · 2018-10-17T11:58:56Z

meson.build

@@ -229,6 +231,8 @@ conf.set_quoted('USER_KEYRING_PATH',                          join_paths(pkgsysc
 conf.set_quoted('DOCUMENT_ROOT',                              join_paths(pkgdatadir, 'gatewayd'))
 conf.set('MEMORY_ACCOUNTING_DEFAULT',                         memory_accounting_default ? 'true' : 'false')
 conf.set_quoted('MEMORY_ACCOUNTING_DEFAULT_YES_NO',           memory_accounting_default ? 'yes' : 'no')
+conf.set('BUMP_PROC_SYS_FS_FILE_MAX',                         bump_proc_sys_fs_file_max ? 'true' : 'false')
+conf.set('BUMP_PROC_SYS_FS_NR_OPEN',                          bump_proc_sys_fs_nr_open ? 'true' : 'false')


This should be conf.set10 without the ternary operator. I'd also move it right under conf.set10('HAVE_SYSV_COMPAT', ...).

I'll push a fixup for this.

After discussions with kernel folks, a system with memcg really shouldn't need extra hard limits on file descriptors anymore, as they are properly accounted for by memcg anyway. Hence, let's bump these values to their maximums. This also adds a build time option to turn thiss off, to cover those users who do not want to use memcg.

…OFILE

keszybz · 2018-10-17T12:45:28Z

I pushed two an ammended commit and two more commits on top.

poettering · 2018-10-17T12:49:00Z

additional changes look good to me...

btw, it might make sense to also export the high memlock thing via meson now, just to align this with the high fileno thing

poettering · 2018-10-17T12:50:13Z

src/core/system.conf.in

@@ -53,7 +53,7 @@
 #DefaultLimitSTACK=
 #DefaultLimitCORE=
 #DefaultLimitRSS=
-#DefaultLimitNOFILE=
+#DefaultLimitNOFILE=@HIGH_RLIMIT_NOFILE@


hmm, actually, this should be DefaultLimitNOFILE=1024:@HIGH_RLIMIT_NOFILE@

Let's just use the simplest form, it doesn't really matter how the define looks after preprocessing.

poettering · 2018-10-17T13:01:12Z

lgtm

aeikum · 2018-10-22T20:10:49Z

Apologies that I missed this conversation by just days. Cool to see this increased by default. I don't know if Proton's esync patches were the main motivator for this, but I just wanted to add a note that we have seen real applications use over 300,000 eventfd objects via that implementation (Google Earth VR). I know Debian-derived distros have used a limit over 1,000,000 for a while. Unless there's some reason to prefer a lower limit, may I suggest just unifying around the number they picked?

The Linux kernel default hard limit of 4096 for the number of file descriptors is quite small. Debian/Ubuntu have for a long time overridden this, increasing it to 1M. Recently systemd also bumped the default to 512k. https://github.com/systemd/systemd/blob/master/NEWS systemd/systemd#10244 https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/ZN5TK3D6L7SE46KGXICUKLKPX2LQISVX/ systemd/systemd@09dad04 Here the limits are increased as follows: - slurmd: 128k; some workloads like Hadoop/Spark need a lot of fd's, and recommend that the limit is increased to at least 64k. - slurmctld: 64k; per the Slurm high throughput and big system guides which recommend a file-max of at least 32k. - slurmdbd: 64k, matching slurmctld, though slurmdbd shouldn't need that many fd's, bumping the limit shouldn't hurt either. Bug 6171

mimmus · 2019-08-30T13:46:28Z

Just a note to report an issue with a Java application, not starting with a "unable to allocate file descriptor table" error, probably due to this PR and this bug:
https://bugs.openjdk.java.net/browse/JDK-8150460

poettering added journal pid1 tree-wide units labels Oct 2, 2018

keszybz reviewed Oct 3, 2018

View reviewed changes

keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 3, 2018

poettering force-pushed the nofile-bump branch from aa3d73a to 0c7ed3d Compare October 4, 2018 17:43

poettering removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 4, 2018

keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 11, 2018

poettering force-pushed the nofile-bump branch from 0c7ed3d to 4a60402 Compare October 11, 2018 16:39

poettering removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 11, 2018

poettering added 10 commits October 16, 2018 16:33

update TODO

d3aeddb

tree-wide: uniformly bump RLIMIT_NOFILE in all our tools that access …

1abaf48

…the journal This makes use of rlimit_nofile_bump() in all tools that access the journal. In some cases this replaces older code to achieve this, and others we add it in where it was missing.

core: bump RLIMIT_NOFILE soft+hard limit for systemd itself in all cases

a17c171

Previously we'd do this for PID 1 only. Let's do this when running in user mode too, because we know we can handle it.

rlimit-util: don't call setrlimit() needlessly if it wouldn't change …

0bbee2c

…anything Just a tiny tweak to avoid generating an error if there's no need to.

NEWS: explain the RLIMIT_NOFILE bump

0972c1a

mkosi: make kmsg work in our mkosi builds at least

52d363e

poettering force-pushed the nofile-bump branch from 4a60402 to f95ebff Compare October 16, 2018 14:36

keszybz approved these changes Oct 17, 2018

View reviewed changes

main: introduce a define HIGH_RLIMIT_MEMLOCK similar to HIGH_RLIMIT_N…

c8884ac

…OFILE

keszybz force-pushed the nofile-bump branch from f95ebff to f477398 Compare October 17, 2018 12:44

keszybz added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Oct 17, 2018

poettering commented Oct 17, 2018

View reviewed changes

keszybz added 2 commits October 17, 2018 14:54

meson: define @HIGH_RLIMIT_NOFILE@ and use it everywhere

c02b6ee

meson: simplify definition of MEMORY_ACCOUNTING_DEFAULT

30538ff

Let's just use the simplest form, it doesn't really matter how the define looks after preprocessing.

keszybz force-pushed the nofile-bump branch from f477398 to 30538ff Compare October 17, 2018 12:56

poettering merged commit 8aeb1d3 into systemd:master Oct 17, 2018

SISheogorath mentioned this pull request Oct 19, 2018

Webpack: Cleanup common config hackmdio/codimd#1021

Merged

poettering mentioned this pull request Nov 26, 2018

AndroidStudio fails to start with " library initialization failed - unable to allocate file descriptor table - out of memoryAborted" #10921

Closed

mati865 mentioned this pull request Dec 8, 2018

Grand Theft Auto V (271590) ValveSoftware/Proton#37

Open

mimmus mentioned this pull request Sep 2, 2019

Default Max Open File too high? coreos/bugs#2611

Closed

AdamPioneer mentioned this pull request Nov 1, 2019

using k8s to bootup "tomcat service" fail, caused “out of memory” error. kata-containers/runtime#2161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bump RLIMIT_NOFILE #10244

bump RLIMIT_NOFILE #10244

poettering commented Oct 2, 2018 •

edited

poettering commented Oct 2, 2018

keszybz left a comment

keszybz Oct 3, 2018

keszybz Oct 3, 2018

keszybz Oct 3, 2018

keszybz Oct 3, 2018

keszybz commented Oct 3, 2018

poettering commented Oct 4, 2018

poettering commented Oct 5, 2018

keszybz commented Oct 11, 2018

poettering commented Oct 11, 2018

keszybz left a comment

keszybz Oct 17, 2018

keszybz commented Oct 17, 2018

poettering commented Oct 17, 2018

poettering Oct 17, 2018

keszybz Oct 17, 2018

poettering commented Oct 17, 2018

aeikum commented Oct 22, 2018

mimmus commented Aug 30, 2019 •

edited

bump RLIMIT_NOFILE #10244

bump RLIMIT_NOFILE #10244

Conversation

poettering commented Oct 2, 2018 • edited

poettering commented Oct 2, 2018

keszybz left a comment

Choose a reason for hiding this comment

keszybz Oct 3, 2018

Choose a reason for hiding this comment

keszybz Oct 3, 2018

Choose a reason for hiding this comment

keszybz Oct 3, 2018

Choose a reason for hiding this comment

keszybz Oct 3, 2018

Choose a reason for hiding this comment

keszybz commented Oct 3, 2018

poettering commented Oct 4, 2018

poettering commented Oct 5, 2018

keszybz commented Oct 11, 2018

poettering commented Oct 11, 2018

keszybz left a comment

Choose a reason for hiding this comment

keszybz Oct 17, 2018

Choose a reason for hiding this comment

keszybz commented Oct 17, 2018

poettering commented Oct 17, 2018

poettering Oct 17, 2018

Choose a reason for hiding this comment

keszybz Oct 17, 2018

Choose a reason for hiding this comment

poettering commented Oct 17, 2018

aeikum commented Oct 22, 2018

mimmus commented Aug 30, 2019 • edited

poettering commented Oct 2, 2018 •

edited

mimmus commented Aug 30, 2019 •

edited