New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xrootd dies with RTMIN+29 #1113
Comments
This is the systemctl status of the priviledged xrootd echo Checking xrootd-privileged status |
This is not familiar to me. Is there anything "obvious" about the behavior of the server right before the crash? Is there a particular sequence of events or user who triggers the problem? |
There is nothing much. |
Real-time-min + 29 is not even a thing for any thread implementation, as far as I could see. And I also suspect nobody here is running real-time kernel so this must be completely spurious. Do you have a core file? |
I can not find any core file. It's not segfaulted according to the syslog: |
It says killed ... I suspect you have core generation off. I'm not sure how to do it with systemd / cgroups ... but I know Terrence does, he set it up for our XCache cluster. |
@osschar - the "real time" signals in POSIX are simply just an extension to have more signals. They tend to have nothing to with anything "real time" and more to do with the POSIX 2001 def (see http://man7.org/linux/man-pages/man7/signal.7.html for more info...). The reason I asked about behavior is that I started seeing a similar signal (RT 32, not RT 29 like Bockjoo) sent internally from Xrootd when aio_writes were being used. |
Right, from XrdOss/XrdOssAio.cc:
So I'd assume the signal handlers do not get setup correctly ... or somebody spawns a thread in some plugin in a way that does not give xrootd a chance to block them for that thread. IIRC, signals can get delivered to any thread that does not block them (so one usually sets up a special signal handling thread where a specific set of signals gets unblocked). We need a core for sure :) |
Oh - 100% there are libraries that Xrootd links against which might spawn threads. Why would it start triggering now though? Is something unexpectedly enabling AIO? |
The crash indicates that a signal was directed to the process (by all
accounts this is SIGRTMIN+29 which calculate to SIG 61). The OSS AIO
implementation uses SIG 64 and SIG 63). I have seen this kind of thing
before and it's been due to some plugin using signals but forgetting to
set a handler. Unfortunately, this always takes a while to figure out.
…On Wed, 15 Jan 2020, Matev? Tadel wrote:
Real-time-min + 29 is not even a thing for any thread implementation, as far as I could see. And I also suspect nobody here is running real-time kernel so this must be completely spurious.
Do you have a core file?
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#1113 (comment)
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
Like I said, there is likely a plugin that enables these signals when it
should not. It's an easy mistake to make. Normally, xroot blocks these
except for the thred that fields them.
…On Wed, 15 Jan 2020, Brian P Bockelman wrote:
Oh - 100% there are libraries that Xrootd links against which might spawn threads.
Why would it start triggering now though? Is something unexpectedly enabling AIO?
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#1113 (comment)
|
Xrootd does not use SIG 64 and SIG 63, it uses SIGRTMAX and SIGRTMAX - 1 :) On fedora 30 these are:
so How about a plugin using signals and not blocking the RT signals (or unblocking them through ignorance) so they get delivered there from XRootd AIO code, even though they should not? Although I must admit I can not imagine the mess when signals are used for internal communication by two different sub-systems in a single multi-threaded program :) |
Indeex, I used the numbers from RH6. Yeah, min/max change based on the
distro, sigh. Anyway, some thread is unblocking these signals.
…On Wed, 15 Jan 2020, Matev? Tadel wrote:
Xrootd does not use SIG 64 and SIG 63, it uses SIGRTMAX and SIGRTMAX - 1 :)
On fedora 30 these are:```
root [2] SIGRTMIN -> (int) 34
root [3] SIGRTMAX ->(int) 64
root [4] SIGRTMAX - SIGRTMIN -> (int) 30
```
so `SIGRTMAX - 1 = SIGRTMIN + 29`.
How about a plugin using signals and not blocking the RT signals (or unblocking them through ignorance) so they get delivered there from XRootd AIO code, even though they should not? Although I must admit I can not imagine the mess when signals are used for internal communication by two different sub-systems in a single multi-threaded program :)
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#1113 (comment)
|
@bockjoo - does the problem go away if you turn off asynchronous IO? That is, try adding:
Has this been ongoing for awhile or is it something new? I suspect it's probably due to the fact that |
I will try that. Yes, this is only a recent phenomenon. So, your guess might be correct in saying that Globus might be taking over the signal handling because |
Indeed, linking against Globus usually (though not always) screws up
signal handling. That has been an ongoing problem for at least two decades
now. They fix it then break it again and again.
…On Fri, 17 Jan 2020, Brian P Bockelman wrote:
@bockjoo - does the problem go away if you turn off asynchronous IO? That is, try adding:
```
xrootd.async off
```
Has this been ongoing for awhile or is it something new?
I suspect it's probably due to the fact that `XrdLcmaps` links against Globus to do the proxy verification and that is somewhere causing Globus to try and take over signal handlers. Nothing else I know of besides Globus would touch that from a plug-in point-of-view.
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#1113 (comment)
|
So, was this the problem? That is, some plugin enabling itself for all signals? |
I think so. Without xrootd.async off, RTMIN+29 still shows up: On the other hand, with xrootd.async off, RTMIN+29 does not show up anymore: |
Yes, that's to be expected but you also turned off a performance feature in XRootD so it won't run as fast as it could. It would be nice to get a core file so we can see which plugin is the offending one (not that we could necessarily fix it). I'll leave this open for a while hoping tat you'll post a trace back from a core file. Eventually, I will close this issue. |
I have configured something. I will let you know if any core file appears in /var/lib/systemd/coredump/ |
I am closing this as there has been no activity for almost a month. It known that using signals in a plugin that conflict with XRootD's use of the same signals is an issue. The ony long-term solution is to not use signal-based async I/O. If this occurs in the future, the bypass is to turn async I/O off (not ideal). |
Process: 3140 ExecStart=/usr/bin/xrootd -l /var/log/xrootd/xrootd.log -c /etc/xrootd/xrootd-%i.cfg -k fifo -s /var/run/xrootd/xrootd-%i.pid -n %i -R xrootd (code=killed, signal=RTMIN+29)
Main PID: 3140 (code=killed, signal=RTMIN+29)
The text was updated successfully, but these errors were encountered: