xrootd dies with RTMIN+29 #1113

bockjoo · 2020-01-14T21:45:58Z

Process: 3140 ExecStart=/usr/bin/xrootd -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-%i.cfg -k fifo -s /var/run/xrootd/xrootd-%i.pid -n %i -R xrootd (code=killed, signal=RTMIN+29)
Main PID: 3140 (code=killed, signal=RTMIN+29)

bockjoo · 2020-01-14T22:13:41Z

This is the systemctl status of the priviledged xrootd

echo Checking xrootd-privileged status
● xrootd-privileged@clustered.service - XRootD xrootd deamon instance clustered
Loaded: loaded (/usr/lib/systemd/system/xrootd-privileged@.service; disabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Tue 2020-01-14 17:10:59 EST; 2s ago
Docs: man:xrootd(8)
https://urldefense.proofpoint.com/v2/url?u=http-3A__xrootd.org_docs.html&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=KD6WLuIFdDP9pLkLtsx3jFJEuhhZthNRKW96sbeIehg&m=OjiGgxMCbMLg9KuyUWzTY2sUyZAlj384KtoRE0U9DYQ&s=QYf39kknWNvzgKBcezC3t32VfWe_BcHxdUJYNz5oguA&e=
Process: 21951 ExecStart=/usr/bin/xrootd -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-%i.cfg -k fifo -s /var/run/xrootd/xrootd-%i.pid -n %i -R xrootd (code=killed, signal=RTMIN+29)
Main PID: 21951 (code=killed, signal=RTMIN+29)

bbockelm · 2020-01-15T18:48:21Z

This is not familiar to me. Is there anything "obvious" about the behavior of the server right before the crash? Is there a particular sequence of events or user who triggers the problem?

bockjoo · 2020-01-15T18:57:31Z

There is nothing much.
Two users were accessing the xrootd a lot.
One of them was accessing a directory that does not exist using
ofs_opendir
I did not notice anything special with the other guy.
Also, it looked like the xrootd was restarting by itself at a high frequency, thus the message?

osschar · 2020-01-15T18:59:56Z

Real-time-min + 29 is not even a thing for any thread implementation, as far as I could see. And I also suspect nobody here is running real-time kernel so this must be completely spurious.

Do you have a core file?

bockjoo · 2020-01-15T19:07:04Z

I can not find any core file. It's not segfaulted according to the syslog:
Jan 14 23:11:46 cmsio4 systemd[1]: Stopped XRootD xrootd deamon instance clustered.
Jan 14 23:11:46 cmsio4 systemd[1]: Started XRootD xrootd deamon instance clustered.
Jan 14 23:11:52 cmsio4 systemd[1]: xrootd-privileged@clustered.service: main process exited, code=killed, status=63/RTMIN+29
Jan 14 23:11:52 cmsio4 systemd[1]: Unit xrootd-privileged@clustered.service entered failed state.
Jan 14 23:11:52 cmsio4 systemd[1]: xrootd-privileged@clustered.service failed.
Jan 14 23:11:52 cmsio4 systemd[1]: xrootd-privileged@clustered.service has no holdoff time, scheduling restart.
Jan 14 23:11:52 cmsio4 systemd[1]: Stopped XRootD xrootd deamon instance clustered.
Jan 14 23:11:52 cmsio4 systemd[1]: Started XRootD xrootd deamon instance clustered.
Jan 14 23:12:16 cmsio4 systemd[1]: xrootd-privileged@clustered.service: main process exited, code=killed, status=63/RTMIN+29
Jan 14 23:12:16 cmsio4 systemd[1]: Unit xrootd-privileged@clustered.service entered failed state.
Jan 14 23:12:16 cmsio4 systemd[1]: xrootd-privileged@clustered.service failed.
Jan 14 23:12:16 cmsio4 systemd[1]: xrootd-privileged@clustered.service has no holdoff time, scheduling restart.

osschar · 2020-01-15T19:17:39Z

It says killed ... I suspect you have core generation off. I'm not sure how to do it with systemd / cgroups ... but I know Terrence does, he set it up for our XCache cluster.

bbockelm · 2020-01-15T20:09:48Z

@osschar - the "real time" signals in POSIX are simply just an extension to have more signals. They tend to have nothing to with anything "real time" and more to do with the POSIX 2001 def (see http://man7.org/linux/man-pages/man7/signal.7.html for more info...).

The reason I asked about behavior is that I started seeing a similar signal (RT 32, not RT 29 like Bockjoo) sent internally from Xrootd when aio_writes were being used.

osschar · 2020-01-15T20:54:59Z

Right, from XrdOss/XrdOssAio.cc:

#ifdef _POSIX_ASYNCHRONOUS_IO
#ifdef SIGRTMAX
const int OSS_AIO_READ_DONE  = SIGRTMAX-1;
const int OSS_AIO_WRITE_DONE = SIGRTMAX;
#else
#define OSS_AIO_READ_DONE  SIGUSR1
#define OSS_AIO_WRITE_DONE SIGUSR2
#endif
#endif

So I'd assume the signal handlers do not get setup correctly ... or somebody spawns a thread in some plugin in a way that does not give xrootd a chance to block them for that thread. IIRC, signals can get delivered to any thread that does not block them (so one usually sets up a special signal handling thread where a specific set of signals gets unblocked).

We need a core for sure :)

bbockelm · 2020-01-15T21:00:57Z

Oh - 100% there are libraries that Xrootd links against which might spawn threads.

Why would it start triggering now though? Is something unexpectedly enabling AIO?

xrootd-dev · 2020-01-15T21:03:04Z

The crash indicates that a signal was directed to the process (by all accounts this is SIGRTMIN+29 which calculate to SIG 61). The OSS AIO implementation uses SIG 64 and SIG 63). I have seen this kind of thing before and it's been due to some plugin using signals but forgetting to set a handler. Unfortunately, this always takes a while to figure out.

…

On Wed, 15 Jan 2020, Matev? Tadel wrote: Real-time-min + 29 is not even a thing for any thread implementation, as far as I could see. And I also suspect nobody here is running real-time kernel so this must be completely spurious. Do you have a core file? -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #1113 (comment) ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

abh3 · 2020-01-15T21:16:30Z

Like I said, there is likely a plugin that enables these signals when it should not. It's an easy mistake to make. Normally, xroot blocks these except for the thred that fields them.

…

On Wed, 15 Jan 2020, Brian P Bockelman wrote: Oh - 100% there are libraries that Xrootd links against which might spawn threads. Why would it start triggering now though? Is something unexpectedly enabling AIO? -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #1113 (comment)

osschar · 2020-01-15T21:16:34Z

Xrootd does not use SIG 64 and SIG 63, it uses SIGRTMAX and SIGRTMAX - 1 :)

On fedora 30 these are:

root [2] SIGRTMIN -> (int) 34
root [3] SIGRTMAX ->(int) 64
root [4] SIGRTMAX - SIGRTMIN -> (int) 30

so SIGRTMAX - 1 = SIGRTMIN + 29.

How about a plugin using signals and not blocking the RT signals (or unblocking them through ignorance) so they get delivered there from XRootd AIO code, even though they should not? Although I must admit I can not imagine the mess when signals are used for internal communication by two different sub-systems in a single multi-threaded program :)

abh3 · 2020-01-15T21:18:55Z

Indeex, I used the numbers from RH6. Yeah, min/max change based on the distro, sigh. Anyway, some thread is unblocking these signals.

…

On Wed, 15 Jan 2020, Matev? Tadel wrote: Xrootd does not use SIG 64 and SIG 63, it uses SIGRTMAX and SIGRTMAX - 1 :) On fedora 30 these are:``` root [2] SIGRTMIN -> (int) 34 root [3] SIGRTMAX ->(int) 64 root [4] SIGRTMAX - SIGRTMIN -> (int) 30 ``` so `SIGRTMAX - 1 = SIGRTMIN + 29`. How about a plugin using signals and not blocking the RT signals (or unblocking them through ignorance) so they get delivered there from XRootd AIO code, even though they should not? Although I must admit I can not imagine the mess when signals are used for internal communication by two different sub-systems in a single multi-threaded program :) -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #1113 (comment)

bbockelm · 2020-01-17T12:02:12Z

@bockjoo - does the problem go away if you turn off asynchronous IO? That is, try adding:

xrootd.async off

Has this been ongoing for awhile or is it something new?

I suspect it's probably due to the fact that XrdLcmaps links against Globus to do the proxy verification and that is somewhere causing Globus to try and take over signal handlers. Nothing else I know of besides Globus would touch that from a plug-in point-of-view.

bockjoo · 2020-01-17T12:56:50Z

I will try that.
The symptom disappeared. It appears from time to time.

Yes, this is only a recent phenomenon.

So, your guess might be correct in saying that Globus might be taking over the signal handling because
there were a lot of users that were being authorized at the time.
Thanks for looking into this.

abh3 · 2020-01-17T20:49:14Z

Indeed, linking against Globus usually (though not always) screws up signal handling. That has been an ongoing problem for at least two decades now. They fix it then break it again and again.

…

On Fri, 17 Jan 2020, Brian P Bockelman wrote: @bockjoo - does the problem go away if you turn off asynchronous IO? That is, try adding: ``` xrootd.async off ``` Has this been ongoing for awhile or is it something new? I suspect it's probably due to the fact that `XrdLcmaps` links against Globus to do the proxy verification and that is somewhere causing Globus to try and take over signal handlers. Nothing else I know of besides Globus would touch that from a plug-in point-of-view. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #1113 (comment)

abh3 · 2020-01-22T23:27:29Z

So, was this the problem? That is, some plugin enabling itself for all signals?

bockjoo · 2020-01-22T23:55:36Z

I think so.

Without xrootd.async off, RTMIN+29 still shows up:
[root@cmsio4 ~]# grep async.off /etc/xrootd/xrootd-clustered.cfg
#xrootd.async off
[root@cmsio4 ~]# printf "$(cat /var/log/messages | sed 's#%#%%#g')\n" | grep RTMIN | tail -2
Jan 21 20:31:55 cmsio4 systemd[1]: xrootd-privileged@clustered.service: main process exited, code=killed, status=63/RTMIN+29
Jan 21 20:53:08 cmsio4 systemd[1]: xrootd-privileged@clustered.service: main process exited, code=killed, status=63/RTMIN+29

On the other hand, with xrootd.async off, RTMIN+29 does not show up anymore:
[root@cmsio7 ~]# grep async.off /etc/xrootd/xrootd-clustered.cfg
xrootd.async off
[root@cmsio7 ~]# printf "$(cat /var/log/messages | sed 's#%#%%#g')\n" | grep RTMIN | tail -2
Jan 20 19:09:48 cmsio7 systemd[1]: Received SIGRTMIN+21 from PID 296 (plymouthd).

abh3 · 2020-01-22T23:59:03Z

Yes, that's to be expected but you also turned off a performance feature in XRootD so it won't run as fast as it could. It would be nice to get a core file so we can see which plugin is the offending one (not that we could necessarily fix it). I'll leave this open for a while hoping tat you'll post a trace back from a core file. Eventually, I will close this issue.

bockjoo · 2020-01-23T00:49:13Z

I have configured something. I will let you know if any core file appears in /var/lib/systemd/coredump/

abh3 · 2020-02-18T14:07:02Z

I am closing this as there has been no activity for almost a month. It known that using signals in a plugin that conflict with XRootD's use of the same signals is an issue. The ony long-term solution is to not use signal-based async I/O. If this occurs in the future, the bypass is to turn async I/O off (not ideal).

abh3 closed this as completed Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xrootd dies with RTMIN+29 #1113

xrootd dies with RTMIN+29 #1113

bockjoo commented Jan 14, 2020

bockjoo commented Jan 14, 2020

bbockelm commented Jan 15, 2020

bockjoo commented Jan 15, 2020

osschar commented Jan 15, 2020

bockjoo commented Jan 15, 2020

osschar commented Jan 15, 2020

bbockelm commented Jan 15, 2020

osschar commented Jan 15, 2020

bbockelm commented Jan 15, 2020

xrootd-dev commented Jan 15, 2020 via email

abh3 commented Jan 15, 2020 via email

osschar commented Jan 15, 2020 •

edited

abh3 commented Jan 15, 2020 via email

bbockelm commented Jan 17, 2020

bockjoo commented Jan 17, 2020

abh3 commented Jan 17, 2020 via email

abh3 commented Jan 22, 2020

bockjoo commented Jan 22, 2020

abh3 commented Jan 22, 2020

bockjoo commented Jan 23, 2020

abh3 commented Feb 18, 2020

xrootd dies with RTMIN+29 #1113

xrootd dies with RTMIN+29 #1113

Comments

bockjoo commented Jan 14, 2020

bockjoo commented Jan 14, 2020

bbockelm commented Jan 15, 2020

bockjoo commented Jan 15, 2020

osschar commented Jan 15, 2020

bockjoo commented Jan 15, 2020

osschar commented Jan 15, 2020

bbockelm commented Jan 15, 2020

osschar commented Jan 15, 2020

bbockelm commented Jan 15, 2020

xrootd-dev commented Jan 15, 2020 via email

abh3 commented Jan 15, 2020 via email

osschar commented Jan 15, 2020 • edited

abh3 commented Jan 15, 2020 via email

bbockelm commented Jan 17, 2020

bockjoo commented Jan 17, 2020

abh3 commented Jan 17, 2020 via email

abh3 commented Jan 22, 2020

bockjoo commented Jan 22, 2020

abh3 commented Jan 22, 2020

bockjoo commented Jan 23, 2020

abh3 commented Feb 18, 2020

osschar commented Jan 15, 2020 •

edited