Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd 240 fails to start mariadb + galera on Arch Linux #11510

Closed
cbredi opened this issue Jan 21, 2019 · 32 comments
Closed

systemd 240 fails to start mariadb + galera on Arch Linux #11510

cbredi opened this issue Jan 21, 2019 · 32 comments
Labels
needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer

Comments

@cbredi
Copy link

cbredi commented Jan 21, 2019

systemd version the issue has been seen with

systemd 240

Used distribution

Arch Linux

Expected behaviour you didn't see

Start mariadb + galera

Unexpected behaviour you saw

systemd[1]: Failed to start MariaDB 10.3.12 database server.

Steps to reproduce the problem
Upgrade from systemd 239 to 240 and mariadb + galera will not start after reboot

journalctl -u mariadb output:
Jan 20 18:30:36 black systemd[1]: Starting MariaDB 10.3.12 database server...
Jan 20 18:31:03 black sh[13666]: WSREP: Failed to start mysqld for wsrep recovery: '2019-01-20 18:30:37 0 [Note] ./bin/mysqld (mysqld 10.3.12-MariaDB-log) start>
Jan 20 18:31:03 black sh[13666]: /usr/bin/galera_recovery: line 71: 13797 Killed ./bin/mysqld --user=mysql --wsrep_recover --disable-log-error'
Jan 20 18:31:03 black systemd[1]: mariadb.service: Control process exited, code=exited, status=1/FAILURE
Jan 20 18:31:03 black systemd[1]: mariadb.service: Failed with result 'exit-code'.
Jan 20 18:31:03 black systemd[1]: Failed to start MariaDB 10.3.12 database server.

dmesg output:
[125744.312898] Out of memory: Kill process 13797 (mysqld) score 608 or sacrifice child
[125744.312907] Killed process 13797 (mysqld) total-vm:21035008kB, anon-rss:6903752kB, file-rss:4kB, shmem-rss:0kB
[125744.488378] oom_reaper: reaped process 13797 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

mariadb + galera starts successfully on systemd 239

@cbredi
Copy link
Author

cbredi commented Jan 21, 2019

I have also raised the same issue on https://bugs.archlinux.org/task/61433

@poettering
Copy link
Member

hmm, if you lower RLIMIT_NOFILE (both hard and soft) to something low, e.g. 4096, do things work then?

@poettering poettering added the needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer label Jan 21, 2019
@cbredi
Copy link
Author

cbredi commented Jan 21, 2019

LimitNOFILE=16364 is already in mariadb.service.
With LimitNOFILE=4096 mariadb + galera still won’t start under systemd 240.

@eworm-de
Copy link
Contributor

Just bisected this and found a8b627a to be the first bad commit.

@eworm-de
Copy link
Contributor

BTW, kernel is built with CONFIG_MEMCG=y.

@cbredi
Copy link
Author

cbredi commented Jan 23, 2019

Rebooting with cgroup_disable=memory kernel parameter does not fix this issue.

@eworm-de
Copy link
Contributor

With systemd 239:

# grep . /proc/sys/fs/{file-max,nr_open}
/proc/sys/fs/file-max:400965
/proc/sys/fs/nr_open:1048576

... and 240:

# grep . /proc/sys/fs/{file-max,nr_open}
/proc/sys/fs/file-max:18446744073709551615
/proc/sys/fs/nr_open:1073741816

So does mariadb start if we set the values to what we had before?

# sysctl -w fs.file-max=400965
fs.file-max = 400965
# sysctl -w fs.nr_open=1048576
fs.nr_open = 1048576

And even more interesting... Does it start if we restore just one of both?

@eworm-de
Copy link
Contributor

eworm-de commented Jan 23, 2019

Does not help, even building systemd with -Dbump-proc-sys-fs-file-max=false and -Dbump-proc-sys-fs-nr-open=false does not change anything. Wondering if I messed with bisect... Or do we have another issue on top?
Scratch this. I built from master and did not notice mariadb did not start due to permission denied. (Let's hope this does not rise another issue.)

@eworm-de
Copy link
Contributor

Building with -Dbump-proc-sys-fs-file-max=false and -Dbump-proc-sys-fs-nr-open=false works around the issue.

Setting the values manually I can still restart mariadb successfully:

# sysctl -w fs.file-max=18446744073709551615
fs.file-max = 18446744073709551615
# sysctl -w fs.nr_open=1073741816          
fs.nr_open = 1073741816

So what else does this influence?

@eworm-de
Copy link
Contributor

Looks like after changing the values we need a daemon-reexec and/or daemon-reload to make the issue happen.

Booted systemd built with -Dbump-proc-sys-fs-file-max=false and -Dbump-proc-sys-fs-nr-open=false => OK

Run sysctl -w fs.file-max=18446744073709551615, daemon-re{exec,load}, restart mariadb => OK

Run sysctl -w fs.nr_open=1073741816, daemon-re{exec,load}, restart mariadb => Killed with OOM

@poettering
Copy link
Member

so that suggests mariadb proabably uses fs.nr_open for sizing a memory allocation. It really shouldn't do this, and this needs to be fixed in mariadb, so that it doesn't do this anymore.

THere's nothing really actionable on systemd's side I figure. There are two ways out for arch I guess:

  1. for now, build systemd with -Dbump-proc-sys-fs-nr-open=false and avoid the problem. Then work with mariadb to fix it properly.
  2. include a sysctl snippet in the arch mariadb package to lower nr_open if it's installed. Then work with mariadb to fix it properly.

@cbredi
Copy link
Author

cbredi commented Jan 23, 2019

Thank you for your efforts. I can confirm that with fs.nr_open = 1048576 mariabd will start only after daemon-reexec on systemd 240.

How to set fs.nr_open = 1048576 for mariadb without having to recompile systemd? I added fs.nr_open = 1048576 to /etc/sysctl.d but mariadb won't start unless systemctl daemon-reexec is run, not even after a reboot.

@poettering
Copy link
Member

the reexec shouldn't be necessary. I figure you need to combine LimitNOFILE= with the fs.nr_open sysctl change.

(background: when systemd determines which RLIMIT_NOFILE to default to for its services it takes the fs.nr_open into account, and won't set the RLIMIT_NOFILE larger than the sysctl. Hence, if you set the sysctl explicitly and set LimitNOFILE= explicitly for that specific service, you should be good)

@cbredi
Copy link
Author

cbredi commented Jan 23, 2019

I have added

[Service]
ExecStartPre=/usr/bin/sysctl -w fs.nr_open=1048576

to /etc/systemd/system/mariadb.service.d/extra.conf but it looks that this won't help start mariadb. Apparently this code is executed after mariadb.service ExecStartPre galera_recovery

@cbredi
Copy link
Author

cbredi commented Jan 23, 2019

mariadb.service has LimitNOFILE=16364

@eworm-de
Copy link
Contributor

eworm-de commented Jan 23, 2019

so that suggests mariadb proabably uses fs.nr_open for sizing a memory allocation. It really shouldn't do this, and this needs to be fixed in mariadb, so that it doesn't do this anymore.

I could not find anything in mariadb or galera code base. Where could the value be read other than /proc?

@eworm-de
Copy link
Contributor

These are the last lines form strace for the mysqld process:

[pid  1287] prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1073741815, rlim_max=1073741815}) = 0
[pid  1287] mmap(NULL, 21474836480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fa754c00000
[pid  1287] mmap(NULL, 6291456, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fa754600000
[pid  1287] +++ killed by SIGKILL +++

@floppym
Copy link
Contributor

floppym commented Jan 23, 2019

This function looks pretty suspicious. It checks/sets the maximum number of open files, and then immediately calls malloc().

https://github.com/MariaDB/server/blob/5abc79dd7ab2fccb4b05ca38a512ec816d2f8e52/mysys/my_file.c#L95

@eworm-de
Copy link
Contributor

Created a bug report for MariaDB:
https://jira.mariadb.org/browse/MDEV-18360

@poettering
Copy link
Member

hmm, if mariadb.service indeed has LimitNOFILE=16364, how come that in the strace @eworm-de posted the limit is 1073741815? Something is strange there. Is it possible that somewhere mariadb internally bumps RLIMIT_NOFILE to fs.nr_open first?

@cbredi
Copy link
Author

cbredi commented Jan 23, 2019

This is mariadb.service file:

[Unit]
Description=MariaDB 10.3.12 database server
Documentation=man:mysqld(8)
Documentation=https://mariadb.com/kb/en/library/systemd/
After=network.target
[Install]
WantedBy=multi-user.target
[Service]
Type=notify
PrivateNetwork=false
User=mysql
Group=mysql
CapabilityBoundingSet=CAP_IPC_LOCK
ProtectSystem=full
PrivateDevices=true
ProtectHome=true
PermissionsStartOnly=true
ExecStartPre=/bin/sh -c "systemctl unset-environment _WSREP_START_POSITION"
ExecStartPre=/bin/sh -c "[ ! -e /usr/bin/galera_recovery ] && VAR= || \
 VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ] \
 && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1"
ExecStart=/usr/bin/mysqld $MYSQLD_OPTS $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION
ExecStartPost=/bin/sh -c "systemctl unset-environment _WSREP_START_POSITION"
KillSignal=SIGTERM
SendSIGKILL=no
Restart=on-abort
RestartSec=5s
UMask=007
PrivateTmp=true
LimitNOFILE=16364
Environment="LD_PRELOAD=/usr/lib/libjemalloc.so"

@eworm-de
Copy link
Contributor

Does LimitNOFILE=16364 go for ExecStart= only, not for ExecStartPre=?
That would explain why standalone mariadb is ok but mariadb with galera fails.

@poettering
Copy link
Member

Does LimitNOFILE=16364 go for ExecStart= only, not for ExecStartPre=?
That would explain why standalone mariadb is ok but mariadb with galera fails.

nope, should apply to all processes forked off for the service.

@cvicentiu
Copy link

@eworm-de Thanks a lot for your investigation. This helped me a lot in reproducing. Note that it doesn't crash for me as my workstation has sufficient memory to allocate the mess :)

I've investigated this on MariaDB side. The too-big alloc happens here:
https://github.com/MariaDB/server/blob/5abc79dd7ab2fccb4b05ca38a512ec816d2f8e52/mysys/my_file.c#L106

This happens because of the call from above of set_max_open_files, which looks for the magic number RLIM_INFINITY. If the rlim_cur is not equal to the magic number, but greater than the requested max_file_limit, we'll return the current rlim_cur value. That's bad as it means we can return a very large number.

I will push a fix into MariaDB soon, to 5.5, which will end up in all future releases. In the meantime, one can patch MariaDB like so:

diff --git a/mysys/my_file.c b/mysys/my_file.c
index a23ab487d00..23226595b2e 100644
--- a/mysys/my_file.c
+++ b/mysys/my_file.c
@@ -52,10 +52,9 @@ static uint set_max_open_files(uint max_file_limit)
     DBUG_PRINT("info", ("rlim_cur: %u  rlim_max: %u",
                        (uint) rlimit.rlim_cur,
                        (uint) rlimit.rlim_max));
-    if ((ulonglong) rlimit.rlim_cur == (ulonglong) RLIM_INFINITY)
-      rlimit.rlim_cur = max_file_limit;
-    if (rlimit.rlim_cur >= max_file_limit)
-      DBUG_RETURN(rlimit.rlim_cur);            /* purecov: inspected */
+    if ((ulonglong) rlimit.rlim_cur == (ulonglong) RLIM_INFINITY ||
+        rlimit.rlim_cur >= max_file_limit)
+      DBUG_RETURN(max_file_limit);
     rlimit.rlim_cur= rlimit.rlim_max= max_file_limit;
     if (setrlimit(RLIMIT_NOFILE, &rlimit))
       max_file_limit= old_cur;                 /* Use original value */

Or set max open files within systemd script to a sane number.

@eworm-de
Copy link
Contributor

Thanks @cvicentiu !

I was looking at that code myself, but it was too late for me to get the facts. :-p
It's kind of funny that systemd makes code break that lasted for nearly two decades.

This change looks promising, I pushed new packages with this patch to [testing].

@eworm-de
Copy link
Contributor

The permission denied issue I mentioned above is caused by enabling fs.protected_regular by default. Already opened a pull request for MariaDB, perhaps interesting for @cvicentiu ?
MariaDB/server#1137

@poettering
Copy link
Member

I'd still be really keen on figuring out why the LimitNOFILE= line didn't work to fix this. If you temporarily replace the mysqld binary with a shell script that just dumps ulimit -a -H and ulimit -a -S somewhere and thengoes on and exec()s the realy mysqld, what do you see there? is the rlimit properly set?

@eworm-de
Copy link
Contributor

Looks like the limits are not applied to ExecStartPre... Had this in my service:

ExecStartPre=/usr/bin/bash -c "ulimit -a -H"
ExecStartPre=/usr/bin/bash -c "ulimit -a -S"
ExecStart=/usr/bin/bash -c "ulimit -a -S"

... and the result:

bash[6290]: core file size          (blocks, -c) unlimited
bash[6290]: data seg size           (kbytes, -d) unlimited
bash[6290]: scheduling priority             (-e) 0
bash[6290]: file size               (blocks, -f) unlimited
bash[6290]: pending signals                 (-i) 15706
bash[6290]: max locked memory       (kbytes, -l) 65536
bash[6290]: max memory size         (kbytes, -m) unlimited
bash[6290]: open files                      (-n) 1073741816
bash[6290]: pipe size            (512 bytes, -p) 8
bash[6290]: POSIX message queues     (bytes, -q) 819200
bash[6290]: real-time priority              (-r) 0
bash[6290]: stack size              (kbytes, -s) unlimited
bash[6290]: cpu time               (seconds, -t) unlimited
bash[6290]: max user processes              (-u) 15706
bash[6290]: virtual memory          (kbytes, -v) unlimited
bash[6290]: file locks                      (-x) unlimited
bash[6291]: core file size          (blocks, -c) unlimited
bash[6291]: data seg size           (kbytes, -d) unlimited
bash[6291]: scheduling priority             (-e) 0
bash[6291]: file size               (blocks, -f) unlimited
bash[6291]: pending signals                 (-i) 15706
bash[6291]: max locked memory       (kbytes, -l) 65536
bash[6291]: max memory size         (kbytes, -m) unlimited
bash[6291]: open files                      (-n) 1073741816
bash[6291]: pipe size            (512 bytes, -p) 8
bash[6291]: POSIX message queues     (bytes, -q) 819200
bash[6291]: real-time priority              (-r) 0
bash[6291]: stack size              (kbytes, -s) 8192
bash[6291]: cpu time               (seconds, -t) unlimited
bash[6291]: max user processes              (-u) 15706
bash[6291]: virtual memory          (kbytes, -v) unlimited
bash[6291]: file locks                      (-x) unlimited
bash[6436]: core file size          (blocks, -c) unlimited
bash[6436]: data seg size           (kbytes, -d) unlimited
bash[6436]: scheduling priority             (-e) 0
bash[6436]: file size               (blocks, -f) unlimited
bash[6436]: pending signals                 (-i) 15706
bash[6436]: max locked memory       (kbytes, -l) 65536
bash[6436]: max memory size         (kbytes, -m) unlimited
bash[6436]: open files                      (-n) 16364
bash[6436]: pipe size            (512 bytes, -p) 8
bash[6436]: POSIX message queues     (bytes, -q) 819200
bash[6436]: real-time priority              (-r) 0
bash[6436]: stack size              (kbytes, -s) 8192
bash[6436]: cpu time               (seconds, -t) unlimited
bash[6436]: max user processes              (-u) 15706
bash[6436]: virtual memory          (kbytes, -v) unlimited
bash[6436]: file locks                      (-x) unlimited

The limit on open files is applied to last invocation only.

@cvicentiu
Copy link

FYI:
https://jira.mariadb.org/browse/MDEV-18360

Was fixed. Each subsequent release from this point on should have it incorporated.

@eworm-de
Copy link
Contributor

Well, the service file has PermissionsStartOnly=true. So yes, LimitNOFILE= is ignored for ExecStartPre=...

Wondering if having this in the service file has a good reason... But that's not systemd's issue, I guess we can close here.

@eworm-de
Copy link
Contributor

Created another pull request for MariaDB... Running commands with root privileges just to pass an environment value is not that smart. Another one for @cvicentiu ?
MariaDB/server#1143

This would have prevented the complete issue.

@poettering
Copy link
Member

Well, the service file has PermissionsStartOnly=true. So yes, LimitNOFILE= is ignored for ExecStartPre=...

Oh, right. That makes sense. Of course, one might argue whether LimitXYZ= should apply to ExecStartPre= or not... (I do wonder though why ExecStartPre= is used at all here, maybe whatever is started there should actually be started in a separate service ordered before.)

but yeah, i think the behaviour in systemd does make some sense the way it is hence. Given that mariadb is fixing this already upstream I guess we can close this here as @eworm-de suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-reporter-feedback ❓ There's an unanswered question, the reporter needs to answer
Development

No branches or pull requests

5 participants