Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xrootd dies with RTMIN+29 #1113

Closed
bockjoo opened this issue Jan 14, 2020 · 21 comments
Closed

xrootd dies with RTMIN+29 #1113

bockjoo opened this issue Jan 14, 2020 · 21 comments

Comments

@bockjoo
Copy link

bockjoo commented Jan 14, 2020

Process: 3140 ExecStart=/usr/bin/xrootd -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-%i.cfg -k fifo -s /var/run/xrootd/xrootd-%i.pid -n %i -R xrootd (code=killed, signal=RTMIN+29)
Main PID: 3140 (code=killed, signal=RTMIN+29)

@bockjoo
Copy link
Author

bockjoo commented Jan 14, 2020

This is the systemctl status of the priviledged xrootd

echo Checking xrootd-privileged status
xrootd-privileged@clustered.service - XRootD xrootd deamon instance clustered
Loaded: loaded (/usr/lib/systemd/system/xrootd-privileged@.service; disabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Tue 2020-01-14 17:10:59 EST; 2s ago
Docs: man:xrootd(8)
https://urldefense.proofpoint.com/v2/url?u=http-3A__xrootd.org_docs.html&d=DwIDaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=KD6WLuIFdDP9pLkLtsx3jFJEuhhZthNRKW96sbeIehg&m=OjiGgxMCbMLg9KuyUWzTY2sUyZAlj384KtoRE0U9DYQ&s=QYf39kknWNvzgKBcezC3t32VfWe_BcHxdUJYNz5oguA&e=
Process: 21951 ExecStart=/usr/bin/xrootd -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-%i.cfg -k fifo -s /var/run/xrootd/xrootd-%i.pid -n %i -R xrootd (code=killed, signal=RTMIN+29)
Main PID: 21951 (code=killed, signal=RTMIN+29)

@bbockelm
Copy link
Contributor

This is not familiar to me. Is there anything "obvious" about the behavior of the server right before the crash? Is there a particular sequence of events or user who triggers the problem?

@bockjoo
Copy link
Author

bockjoo commented Jan 15, 2020

There is nothing much.
Two users were accessing the xrootd a lot.
One of them was accessing a directory that does not exist using
ofs_opendir
I did not notice anything special with the other guy.
Also, it looked like the xrootd was restarting by itself at a high frequency, thus the message?

@osschar
Copy link
Contributor

osschar commented Jan 15, 2020

Real-time-min + 29 is not even a thing for any thread implementation, as far as I could see. And I also suspect nobody here is running real-time kernel so this must be completely spurious.

Do you have a core file?

@bockjoo
Copy link
Author

bockjoo commented Jan 15, 2020

I can not find any core file. It's not segfaulted according to the syslog:
Jan 14 23:11:46 cmsio4 systemd[1]: Stopped XRootD xrootd deamon instance clustered.
Jan 14 23:11:46 cmsio4 systemd[1]: Started XRootD xrootd deamon instance clustered.
Jan 14 23:11:52 cmsio4 systemd[1]: xrootd-privileged@clustered.service: main process exited, code=killed, status=63/RTMIN+29
Jan 14 23:11:52 cmsio4 systemd[1]: Unit xrootd-privileged@clustered.service entered failed state.
Jan 14 23:11:52 cmsio4 systemd[1]: xrootd-privileged@clustered.service failed.
Jan 14 23:11:52 cmsio4 systemd[1]: xrootd-privileged@clustered.service has no holdoff time, scheduling restart.
Jan 14 23:11:52 cmsio4 systemd[1]: Stopped XRootD xrootd deamon instance clustered.
Jan 14 23:11:52 cmsio4 systemd[1]: Started XRootD xrootd deamon instance clustered.
Jan 14 23:12:16 cmsio4 systemd[1]: xrootd-privileged@clustered.service: main process exited, code=killed, status=63/RTMIN+29
Jan 14 23:12:16 cmsio4 systemd[1]: Unit xrootd-privileged@clustered.service entered failed state.
Jan 14 23:12:16 cmsio4 systemd[1]: xrootd-privileged@clustered.service failed.
Jan 14 23:12:16 cmsio4 systemd[1]: xrootd-privileged@clustered.service has no holdoff time, scheduling restart.

@osschar
Copy link
Contributor

osschar commented Jan 15, 2020

It says killed ... I suspect you have core generation off. I'm not sure how to do it with systemd / cgroups ... but I know Terrence does, he set it up for our XCache cluster.

@bbockelm
Copy link
Contributor

@osschar - the "real time" signals in POSIX are simply just an extension to have more signals. They tend to have nothing to with anything "real time" and more to do with the POSIX 2001 def (see http://man7.org/linux/man-pages/man7/signal.7.html for more info...).

The reason I asked about behavior is that I started seeing a similar signal (RT 32, not RT 29 like Bockjoo) sent internally from Xrootd when aio_writes were being used.

@osschar
Copy link
Contributor

osschar commented Jan 15, 2020

Right, from XrdOss/XrdOssAio.cc:

#ifdef _POSIX_ASYNCHRONOUS_IO
#ifdef SIGRTMAX
const int OSS_AIO_READ_DONE  = SIGRTMAX-1;
const int OSS_AIO_WRITE_DONE = SIGRTMAX;
#else
#define OSS_AIO_READ_DONE  SIGUSR1
#define OSS_AIO_WRITE_DONE SIGUSR2
#endif
#endif

So I'd assume the signal handlers do not get setup correctly ... or somebody spawns a thread in some plugin in a way that does not give xrootd a chance to block them for that thread. IIRC, signals can get delivered to any thread that does not block them (so one usually sets up a special signal handling thread where a specific set of signals gets unblocked).

We need a core for sure :)

@bbockelm
Copy link
Contributor

Oh - 100% there are libraries that Xrootd links against which might spawn threads.

Why would it start triggering now though? Is something unexpectedly enabling AIO?

@xrootd-dev
Copy link

xrootd-dev commented Jan 15, 2020 via email

@abh3
Copy link
Member

abh3 commented Jan 15, 2020 via email

@osschar
Copy link
Contributor

osschar commented Jan 15, 2020

Xrootd does not use SIG 64 and SIG 63, it uses SIGRTMAX and SIGRTMAX - 1 :)

On fedora 30 these are:

root [2] SIGRTMIN -> (int) 34
root [3] SIGRTMAX ->(int) 64
root [4] SIGRTMAX - SIGRTMIN -> (int) 30

so SIGRTMAX - 1 = SIGRTMIN + 29.

How about a plugin using signals and not blocking the RT signals (or unblocking them through ignorance) so they get delivered there from XRootd AIO code, even though they should not? Although I must admit I can not imagine the mess when signals are used for internal communication by two different sub-systems in a single multi-threaded program :)

@abh3
Copy link
Member

abh3 commented Jan 15, 2020 via email

@bbockelm
Copy link
Contributor

@bockjoo - does the problem go away if you turn off asynchronous IO? That is, try adding:

xrootd.async off

Has this been ongoing for awhile or is it something new?

I suspect it's probably due to the fact that XrdLcmaps links against Globus to do the proxy verification and that is somewhere causing Globus to try and take over signal handlers. Nothing else I know of besides Globus would touch that from a plug-in point-of-view.

@bockjoo
Copy link
Author

bockjoo commented Jan 17, 2020

I will try that.
The symptom disappeared. It appears from time to time.

Yes, this is only a recent phenomenon.

So, your guess might be correct in saying that Globus might be taking over the signal handling because
there were a lot of users that were being authorized at the time.
Thanks for looking into this.

@abh3
Copy link
Member

abh3 commented Jan 17, 2020 via email

@abh3
Copy link
Member

abh3 commented Jan 22, 2020

So, was this the problem? That is, some plugin enabling itself for all signals?

@bockjoo
Copy link
Author

bockjoo commented Jan 22, 2020

I think so.

Without xrootd.async off, RTMIN+29 still shows up:
[root@cmsio4 ~]# grep async.off /etc/xrootd/xrootd-clustered.cfg
#xrootd.async off
[root@cmsio4 ~]# printf "$(cat /var/log/messages | sed 's#%#%%#g')\n" | grep RTMIN | tail -2
Jan 21 20:31:55 cmsio4 systemd[1]: xrootd-privileged@clustered.service: main process exited, code=killed, status=63/RTMIN+29
Jan 21 20:53:08 cmsio4 systemd[1]: xrootd-privileged@clustered.service: main process exited, code=killed, status=63/RTMIN+29

On the other hand, with xrootd.async off, RTMIN+29 does not show up anymore:
[root@cmsio7 ~]# grep async.off /etc/xrootd/xrootd-clustered.cfg
xrootd.async off
[root@cmsio7 ~]# printf "$(cat /var/log/messages | sed 's#%#%%#g')\n" | grep RTMIN | tail -2
Jan 20 19:09:48 cmsio7 systemd[1]: Received SIGRTMIN+21 from PID 296 (plymouthd).

@abh3
Copy link
Member

abh3 commented Jan 22, 2020

Yes, that's to be expected but you also turned off a performance feature in XRootD so it won't run as fast as it could. It would be nice to get a core file so we can see which plugin is the offending one (not that we could necessarily fix it). I'll leave this open for a while hoping tat you'll post a trace back from a core file. Eventually, I will close this issue.

@bockjoo
Copy link
Author

bockjoo commented Jan 23, 2020

I have configured something. I will let you know if any core file appears in /var/lib/systemd/coredump/

@abh3
Copy link
Member

abh3 commented Feb 18, 2020

I am closing this as there has been no activity for almost a month. It known that using signals in a plugin that conflict with XRootD's use of the same signals is an issue. The ony long-term solution is to not use signal-based async I/O. If this occurs in the future, the bypass is to turn async I/O off (not ideal).

@abh3 abh3 closed this as completed Feb 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants