New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bump RLIMIT_NOFILE #10244
bump RLIMIT_NOFILE #10244
Conversation
/cc @htejun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed. Suggestions rather than issues really.
src/basic/rlimit-util.c
Outdated
|
||
r = setrlimit_closest(RLIMIT_NOFILE, &RLIMIT_MAKE_CONST(limit)); | ||
if (r < 0) | ||
return log_debug_errno(r, "Failed to bump RLIMIT_NOFILE: %m"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"bump" → "set", because it can be lower than the current value.
src/core/main.c
Outdated
* select(), the way everybody should) can explicitly opt into high fds by bumping their soft limit | ||
* beyond 1024, to the hard limit we pass. */ | ||
if (arg_system) | ||
arg_default_rlimit[RLIMIT_NOFILE]->rlim_max = MIN((rlim_t) nr, MAX(arg_default_rlimit[RLIMIT_NOFILE]->rlim_max, (rlim_t) HIGH_RLIMIT_NOFILE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that the "global" arg_default_rlimit is used here is a bit confusing. Why not use rl
here, and then assign at the end when the structure is ready? That way the reader immediately knows that global state is not used in the calculation:
if (arg_system)
rl->rlim_max = MIN((rlim_t) nr, MAX(rl->rlim_max, (rlim_t) HIGH_RLIMIT_NOFILE));
arg_default_rlimit[RLIMIT_NOFILE] = rl;
# access them all and combine | ||
LimitNOFILE=16384 | ||
# If there are many split up journal files we need a lot of fds to access them | ||
# all and combine them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe drop "all and combine them", since that conveys no information and/or is incorrect (we don't use the fds for combining).
LimitNOFILE=16384 | ||
# If there are many split up journal files we need a lot of fds to access them | ||
# all and combine them. | ||
LimitNOFILE=262144 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see the HIGH_RLIMIT_NOFILE number exported as a configuration variable (with a fixed value, not configurable) and inserted here.
I'll set the needs-update flag, but feel free to drop it. |
aa3d73a
to
0c7ed3d
Compare
I force pushed a new version now, with all of @keszybz's items addressed, except for the requested config variable thing. I didn't see a nice way to implement that, without also having to remove the define from def.h and move it into meson, which I didn't like... So, dunno, let's maybe do that part next time changing this rlimit comes up, for now the current patches should be good enough I think, if that makes sense? no other changes ptal! |
Also relevant, this discussion: |
As discussed on telegram, we also need to bump |
0c7ed3d
to
4a60402
Compare
I force pushed a new version now, that also bumps the fs.file-max and fs.nr_open sysctls to their maximums. The code is frickin' ugly because the APIs are so weird, but we have to do what we have to do. Also, I commented it with emojis, so who can be mad? PTAL! |
This simply adds a new constant we can use for bumping RLIMIT_NOFILE to a "high" value. It default to 256K for now, which is pretty high, but smaller than the kernel built-in limit of 1M. Previously, some tools that needed a higher RLIMIT_NOFILE bumped it to 16K. This new define goes substantially higher than this, following the discussion with the kernel folks.
Following discussions with some kernel folks at All Systems Go! it appears that file descriptors are not really as expensive as they used to be (both memory and performance-wise) and it should thus be OK to allow programs (including unprivileged ones) to have more of them without ill effects. Unfortunately we can't just raise the RLIMIT_NOFILE soft limit globally for all processes, as select() and friends can't handle fds >= 1024, and thus unexpecting programs might fail if they accidently get an fd outside of that range. We can however raise the hard limit, so that programs that need a lot of fds can opt-in into getting fds beyond the 1024 boundary, simply by bumping the soft limit to the now higher hard limit. This is useful for all our client code that accesses the journal, as the journal merging logic might need a lot of fds. Let's add a unified function for bumping the limit in a robust way.
…the journal This makes use of rlimit_nofile_bump() in all tools that access the journal. In some cases this replaces older code to achieve this, and others we add it in where it was missing.
Following the discussions with the kernel folks, let's substantially increase the hard limit (but not the soft limit) of RLIMIT_NOFILE to 256K for all services we start. Note that PID 1 itself bumps the limit even further, to the max the kernel allows. We can deal with that after all.
… the journal This updates the unit files of all our serviecs that deal with journal stuff to use a higher RLIMIT_NOFILE soft limit by default. The new value is the same as used for the new HIGH_RLIMIT_NOFILE we just added. With this we ensure all code that access the journal has higher RLIMIT_NOFILE. The code that runs as daemon via the unit files, the code that is run from the user's command line via C code internal to the relevant tools. In some cases this means we'll redundantly bump the limits as there are tools run both from the command line and as service.
Previously we'd do this for PID 1 only. Let's do this when running in user mode too, because we know we can handle it.
…anything Just a tiny tweak to avoid generating an error if there's no need to.
4a60402
to
f95ebff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to all work fine.
meson.build
Outdated
@@ -229,6 +231,8 @@ conf.set_quoted('USER_KEYRING_PATH', join_paths(pkgsysc | |||
conf.set_quoted('DOCUMENT_ROOT', join_paths(pkgdatadir, 'gatewayd')) | |||
conf.set('MEMORY_ACCOUNTING_DEFAULT', memory_accounting_default ? 'true' : 'false') | |||
conf.set_quoted('MEMORY_ACCOUNTING_DEFAULT_YES_NO', memory_accounting_default ? 'yes' : 'no') | |||
conf.set('BUMP_PROC_SYS_FS_FILE_MAX', bump_proc_sys_fs_file_max ? 'true' : 'false') | |||
conf.set('BUMP_PROC_SYS_FS_NR_OPEN', bump_proc_sys_fs_nr_open ? 'true' : 'false') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be conf.set10
without the ternary operator. I'd also move it right under conf.set10('HAVE_SYSV_COMPAT', ...)
.
I'll push a fixup for this.
After discussions with kernel folks, a system with memcg really shouldn't need extra hard limits on file descriptors anymore, as they are properly accounted for by memcg anyway. Hence, let's bump these values to their maximums. This also adds a build time option to turn thiss off, to cover those users who do not want to use memcg.
I pushed two an ammended commit and two more commits on top. |
additional changes look good to me... btw, it might make sense to also export the high memlock thing via meson now, just to align this with the high fileno thing |
src/core/system.conf.in
Outdated
@@ -53,7 +53,7 @@ | |||
#DefaultLimitSTACK= | |||
#DefaultLimitCORE= | |||
#DefaultLimitRSS= | |||
#DefaultLimitNOFILE= | |||
#DefaultLimitNOFILE=@HIGH_RLIMIT_NOFILE@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, actually, this should be DefaultLimitNOFILE=1024:@HIGH_RLIMIT_NOFILE@
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Let's just use the simplest form, it doesn't really matter how the define looks after preprocessing.
lgtm |
Apologies that I missed this conversation by just days. Cool to see this increased by default. I don't know if Proton's esync patches were the main motivator for this, but I just wanted to add a note that we have seen real applications use over 300,000 eventfd objects via that implementation (Google Earth VR). I know Debian-derived distros have used a limit over 1,000,000 for a while. Unless there's some reason to prefer a lower limit, may I suggest just unifying around the number they picked? |
The Linux kernel default hard limit of 4096 for the number of file descriptors is quite small. Debian/Ubuntu have for a long time overridden this, increasing it to 1M. Recently systemd also bumped the default to 512k. https://github.com/systemd/systemd/blob/master/NEWS systemd/systemd#10244 https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/ZN5TK3D6L7SE46KGXICUKLKPX2LQISVX/ systemd/systemd@09dad04 Here the limits are increased as follows: - slurmd: 128k; some workloads like Hadoop/Spark need a lot of fd's, and recommend that the limit is increased to at least 64k. - slurmctld: 64k; per the Slurm high throughput and big system guides which recommend a file-max of at least 32k. - slurmdbd: 64k, matching slurmctld, though slurmdbd shouldn't need that many fd's, bumping the limit shouldn't hurt either. Bug 6171
Just a note to report an issue with a Java application, not starting with a "unable to allocate file descriptor table" error, probably due to this PR and this bug: |
After discussions with some kernel devs at All Systems Go! it appears that we should bump the RLIMIT_NOFILE hard limit substantially these days, as fds are nowadays cheap in memory and performance, much cheaper than they used to be. (or in other words: by bumping the hard limit we have little to lose but a lot to win)
This PR universally bumps the hard limit to 256K, overriding the kernel's default of 4K. The soft limit's default of 1K is left as it is (for compat with select()). For all services and tools using the journal this also bumps the soft limit to 256K, and for PID 1 (and systemd --user) instance to the highest value permitted.