Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump RLIMIT_NOFILE #10244

Merged
merged 14 commits into from Oct 17, 2018
Merged

bump RLIMIT_NOFILE #10244

merged 14 commits into from Oct 17, 2018

Conversation

poettering
Copy link
Member

@poettering poettering commented Oct 2, 2018

After discussions with some kernel devs at All Systems Go! it appears that we should bump the RLIMIT_NOFILE hard limit substantially these days, as fds are nowadays cheap in memory and performance, much cheaper than they used to be. (or in other words: by bumping the hard limit we have little to lose but a lot to win)

This PR universally bumps the hard limit to 256K, overriding the kernel's default of 4K. The soft limit's default of 1K is left as it is (for compat with select()). For all services and tools using the journal this also bumps the soft limit to 256K, and for PID 1 (and systemd --user) instance to the highest value permitted.

@poettering
Copy link
Member Author

/cc @htejun

Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed. Suggestions rather than issues really.


r = setrlimit_closest(RLIMIT_NOFILE, &RLIMIT_MAKE_CONST(limit));
if (r < 0)
return log_debug_errno(r, "Failed to bump RLIMIT_NOFILE: %m");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bump" → "set", because it can be lower than the current value.

src/core/main.c Outdated
* select(), the way everybody should) can explicitly opt into high fds by bumping their soft limit
* beyond 1024, to the hard limit we pass. */
if (arg_system)
arg_default_rlimit[RLIMIT_NOFILE]->rlim_max = MIN((rlim_t) nr, MAX(arg_default_rlimit[RLIMIT_NOFILE]->rlim_max, (rlim_t) HIGH_RLIMIT_NOFILE));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the "global" arg_default_rlimit is used here is a bit confusing. Why not use rl here, and then assign at the end when the structure is ready? That way the reader immediately knows that global state is not used in the calculation:

if (arg_system)
    rl->rlim_max = MIN((rlim_t) nr, MAX(rl->rlim_max, (rlim_t) HIGH_RLIMIT_NOFILE));
arg_default_rlimit[RLIMIT_NOFILE] = rl;

# access them all and combine
LimitNOFILE=16384
# If there are many split up journal files we need a lot of fds to access them
# all and combine them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe drop "all and combine them", since that conveys no information and/or is incorrect (we don't use the fds for combining).

LimitNOFILE=16384
# If there are many split up journal files we need a lot of fds to access them
# all and combine them.
LimitNOFILE=262144
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see the HIGH_RLIMIT_NOFILE number exported as a configuration variable (with a fixed value, not configurable) and inserted here.

@keszybz
Copy link
Member

keszybz commented Oct 3, 2018

I'll set the needs-update flag, but feel free to drop it.

@keszybz keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 3, 2018
@poettering
Copy link
Member Author

I force pushed a new version now, with all of @keszybz's items addressed, except for the requested config variable thing. I didn't see a nice way to implement that, without also having to remove the define from def.h and move it into meson, which I didn't like...

So, dunno, let's maybe do that part next time changing this rlimit comes up, for now the current patches should be good enough I think, if that makes sense?

no other changes

ptal!

@poettering poettering removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 4, 2018
@poettering
Copy link
Member Author

@keszybz
Copy link
Member

keszybz commented Oct 11, 2018

As discussed on telegram, we also need to bump fs.file-max to avoid a trivial DOS.

@keszybz keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 11, 2018
@poettering poettering removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 11, 2018
@poettering
Copy link
Member Author

I force pushed a new version now, that also bumps the fs.file-max and fs.nr_open sysctls to their maximums. The code is frickin' ugly because the APIs are so weird, but we have to do what we have to do. Also, I commented it with emojis, so who can be mad?

PTAL!

This simply adds a new constant we can use for bumping RLIMIT_NOFILE to
a "high" value. It default to 256K for now, which is pretty high, but
smaller than the kernel built-in limit of 1M.

Previously, some tools that needed a higher RLIMIT_NOFILE bumped it to
16K. This new define goes substantially higher than this, following the
discussion with the kernel folks.
Following discussions with some kernel folks at All Systems Go! it
appears that file descriptors are not really as expensive as they used
to be (both memory and performance-wise) and it should thus be OK to allow
programs (including unprivileged ones) to have more of them without ill
effects.

Unfortunately we can't just raise the RLIMIT_NOFILE soft limit
globally for all processes, as select() and friends can't handle fds
>= 1024, and thus unexpecting programs might fail if they accidently get
an fd outside of that range. We can however raise the hard limit, so
that programs that need a lot of fds can opt-in into getting fds beyond
the 1024 boundary, simply by bumping the soft limit to the now higher
hard limit.

This is useful for all our client code that accesses the journal, as the
journal merging logic might need a lot of fds. Let's add a unified
function for bumping the limit in a robust way.
…the journal

This makes use of rlimit_nofile_bump() in all tools that access the
journal. In some cases this replaces older code to achieve this, and
others we add it in where it was missing.
Following the discussions with the kernel folks, let's substantially
increase the hard limit (but not the soft limit) of RLIMIT_NOFILE to
256K for all services we start.

Note that PID 1 itself bumps the limit even further, to the max the
kernel allows. We can deal with that after all.
… the journal

This updates the unit files of all our serviecs that deal with journal
stuff to use a higher RLIMIT_NOFILE soft limit by default. The new value
is the same as used for the new HIGH_RLIMIT_NOFILE we just added.

With this we ensure all code that access the journal has higher
RLIMIT_NOFILE. The code that runs as daemon via the unit files, the code
that is run from the user's command line via C code internal to the
relevant tools. In some cases this means we'll redundantly bump the
limits as there are tools run both from the command line and as service.
Previously we'd do this for PID 1 only. Let's do this when running in
user mode too, because we know we can handle it.
…anything

Just a tiny tweak to avoid generating an error if there's no need to.
Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to all work fine.

meson.build Outdated
@@ -229,6 +231,8 @@ conf.set_quoted('USER_KEYRING_PATH', join_paths(pkgsysc
conf.set_quoted('DOCUMENT_ROOT', join_paths(pkgdatadir, 'gatewayd'))
conf.set('MEMORY_ACCOUNTING_DEFAULT', memory_accounting_default ? 'true' : 'false')
conf.set_quoted('MEMORY_ACCOUNTING_DEFAULT_YES_NO', memory_accounting_default ? 'yes' : 'no')
conf.set('BUMP_PROC_SYS_FS_FILE_MAX', bump_proc_sys_fs_file_max ? 'true' : 'false')
conf.set('BUMP_PROC_SYS_FS_NR_OPEN', bump_proc_sys_fs_nr_open ? 'true' : 'false')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be conf.set10 without the ternary operator. I'd also move it right under conf.set10('HAVE_SYSV_COMPAT', ...).

I'll push a fixup for this.

After discussions with kernel folks, a system with memcg really
shouldn't need extra hard limits on file descriptors anymore, as they
are properly accounted for by memcg anyway. Hence, let's bump these
values to their maximums.

This also adds a build time option to turn thiss off, to cover those
users who do not want to use memcg.
@keszybz
Copy link
Member

keszybz commented Oct 17, 2018

I pushed two an ammended commit and two more commits on top.

@keszybz keszybz added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Oct 17, 2018
@poettering
Copy link
Member Author

additional changes look good to me...

btw, it might make sense to also export the high memlock thing via meson now, just to align this with the high fileno thing

@@ -53,7 +53,7 @@
#DefaultLimitSTACK=
#DefaultLimitCORE=
#DefaultLimitRSS=
#DefaultLimitNOFILE=
#DefaultLimitNOFILE=@HIGH_RLIMIT_NOFILE@
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, actually, this should be DefaultLimitNOFILE=1024:@HIGH_RLIMIT_NOFILE@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Let's just use the simplest form, it doesn't really matter how the define
looks after preprocessing.
@poettering
Copy link
Member Author

lgtm

@poettering poettering merged commit 8aeb1d3 into systemd:master Oct 17, 2018
@aeikum
Copy link
Contributor

aeikum commented Oct 22, 2018

Apologies that I missed this conversation by just days. Cool to see this increased by default. I don't know if Proton's esync patches were the main motivator for this, but I just wanted to add a note that we have seen real applications use over 300,000 eventfd objects via that implementation (Google Earth VR). I know Debian-derived distros have used a limit over 1,000,000 for a while. Unless there's some reason to prefer a lower limit, may I suggest just unifying around the number they picked?

dannyauble pushed a commit to SchedMD/slurm that referenced this pull request Dec 6, 2018
The Linux kernel default hard limit of 4096 for the number of file
descriptors is quite small. Debian/Ubuntu have for a long time
overridden this, increasing it to 1M. Recently systemd also bumped the
default to 512k.

https://github.com/systemd/systemd/blob/master/NEWS

systemd/systemd#10244

https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/ZN5TK3D6L7SE46KGXICUKLKPX2LQISVX/

systemd/systemd@09dad04

Here the limits are increased as follows:

- slurmd: 128k; some workloads like Hadoop/Spark need a lot of fd's,
  and recommend that the limit is increased to at least 64k.

- slurmctld: 64k; per the Slurm high throughput and big system guides
  which recommend a file-max of at least 32k.

- slurmdbd: 64k, matching slurmctld, though slurmdbd shouldn't need
  that many fd's, bumping the limit shouldn't hurt either.

Bug 6171
@mimmus
Copy link

mimmus commented Aug 30, 2019

Just a note to report an issue with a Java application, not starting with a "unable to allocate file descriptor table" error, probably due to this PR and this bug:
https://bugs.openjdk.java.net/browse/JDK-8150460

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed journal pid1 tree-wide units
Development

Successfully merging this pull request may close these issues.

None yet

4 participants