-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Traffic stops when threads saturate #1799
Comments
Ho Andrea,
Yes, the key here is that you have 2K connections but limit threads to 2K
as well. That means likely you are getting a passive deadlock where some
critical action cannot be done because there is no thread to do it. So,
try to increase the thread count to 4K abd see what happens.
Andy
…On Wed, 12 Oct 2022, andrearendina wrote:
Hi all,
at INFN-T1, we have been experiencing for some months an issue with the XrootD istances dedicated to the CMS experiment, concerning thread saturation.
In particular, we have 4 XrootD servers, running XrootD 4.12.4-1.el7, only used for XrootD and equipped with 2x100Gbps Ethernet which are remote mounting shared (gpfs) filesystem (they are not used as GPFS NSD servers).
At the peak load, each server is doing 4GB/s (40Gbps) of I/O (in and out) via network, serving ~2k connections.
On average, more than 400 files are transferred per minute (assuming a line with "read" in the log file corresponds to a transferred file).
As you can see from a snap-shot of our monitoring system, the threads used by the XrootD daemons (collected by a sensu plugin running "cat /proc/<pid>/status | grep Threads") tend to periodically increase up to the limit (xrd.sched mint 16 maxt 2048 avlt 8 idle 60s):
![sat_threads_NU_in](https://user-images.githubusercontent.com/73068549/195327020-fe9ce8a6-8e86-4bfc-8006-6ae56fd87491.png)
When the thread saturation is reached, the throughput goes to 0 and the service stops logging in its log file. The last lines of the log files show a long sequence of the following errors
```
221011 11:43:53 6104 XrdLink: Unable to send to ***@***.***; broken pipe
221011 11:43:53 6104 XrootdXeq: ***@***.*** disc 0:35:48 (send failure)
221011 11:44:03 6104 XrdLink: Unable to send to ***@***.***; broken pipe
221011 11:44:03 6104 XrootdXeq: ***@***.*** disc 0:28:21 (send failure)
221011 11:44:26 6104 XrdLink: Unable to send to ***@***.***; broken pipe
221011 11:44:26 6104 XrootdXeq: ***@***.*** disc 0:25:06 (send failure)
221011 11:44:28 6104 XrdLink: Unable to send to ***@***.***; broken pipe
221011 11:44:28 6104 XrootdXeq: ***@***.*** disc 0:41:29 (send failure)
221011 11:44:29 6104 XrdLink: Unable to send to cms001.5134:3786@[::159.93.224.95]; broken pipe
221011 11:44:29 6104 XrootdXeq: cms001.5134:3786@[::159.93.224.95] disc 0:22:18 (send failure)
```
By restarting the XrootD service, the threads reset and the servers start again to make traffic, but once the threads saturate we come back to the same previous situation. You can find also the configuration file attached.
[xrootd-cms.cfg.txt](https://github.com/xrootd/xrootd/files/9764257/xrootd-cms.cfg.txt)
Do you have any suggestion?
Thank you very much,
Andrea
--
Reply to this email directly or view it on GitHub:
#1799
You are receiving this because you are subscribed to this thread.
Message ID: ***@***.***>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
The big question is where is all this traffic coming from and when did this start happening? It's hard to imagine there are 4000 simultaneously active users. However, there is one other possibility and that is something is hanging in one of the plugins causing clients to abandon the session and reconnect. We've seen these kinds of problems before. What are you actually running? That is, what's in your config file and what kind of file system? |
Hi, traffic is simply from CMS jobs, we don't think this trafficis weird. What looks weird is threads hanging. Is there a way we could monitor hanging threads? Such threads do not make traffic. As already said, we are running 4 XrootD servers dedicated to CMS, running XrootD 4.12.4-1.el7, only used for XrootD and equipped with 2x100Gbps Ethernet which are remote mounting shared (GPFS) filesystem (they are not used as GPFS NSD servers). Thanks for your help, |
The only thing I see in the config file is that you are exporting gpfs. Is it possible that gpfs gets overloaded at some point due to activity other than xroot? For instance, we do know that gps can become very slow when backups happen or some other job (other than xroot) is writing a lot of small files. The log appears to imply this is what is happening. We can do a deeper analysis by looking at what each thread is doing. This can be done online or offline. In either case, please make sure you have installed the debug xroot RPM package so that we get statement numbers in the following steps: If you have generated a gcore then use gdb to look at it if you've done the online step then you ae already in gdb. Either way, do the following: (gdb) set logging on This will generate a full back trace for each thread and write it to gdb.txt. This will take some time if there are many threads and the file could be very large. Once this is done, please send a link to where we can download the resulting file and then we can see what is going on. One more question, is this gfs instance backed by tape and gpfs has the ability to recall files from tape? |
Thanks for this reply. Following your suggestions, we installed xrootd-debuginfo, set again 2048 as maximum threads, and we are waiting for hanging threads to happen :-) Thank you very much for your help, |
Dear Andy, we tried to follow your instructions concerning the troubleshooting of xrootd thread saturation. When the problem occured, we started gdb on the corresponding server and attached to the running XrootD process, but we experienced the error "try debuginfo-install xrootd-server-4.12.4-1.el7". Thank you again for your help, Andrea |
So, the trace shows that 2048 threads are waiting forlink serialization either for open() or close() (about 50%each). I don't have a reason for that but it usually means that there is an outstanding operation somewhere in the file system and until that completes neither and open nor a close can proceed. Could you post the log file that corresponds to this trace record? |
Hi Andy, I attach the xrootd.log recorded from the first "broken pipe" observed error until the thread saturation. Moreover, we do not see any outstanding operations on the file system ("gpfs waiters"), nor on this server nor on the entire gpfs cluster. Thank you again! |
Hi, It is pretty hard to believe some outstanding activity is undergoing this often somewhere in the CMS filesystem and completely hiding from our monitoring: we see no blocking, no pending operations, nothing weird gpfs-side. In case it is useful, we attach a file containing the output of "netstat -an" during thread saturation in one of the server. Let us know how we can debug further. Thanks for your help, |
The stack trace simply shows everyone waiting for things to complete. It does not tell me what is waiting for what nor how many different clients are in this state. I must admit that release 4.12.4 is very old (2020-09-03). Any chance you can upgrade to a supported release? Otherwise, the only thing left is with running the sever with additional tracing enabled and after that a core file. Frankly, something weird is going on with that machine. When did this start happening? What changed roughly at the time this started happening? |
It is not something going on with a specific machine: it's every CMS XrootD instance at INFN Tier1. |
If you don't use HTTP third party copy (TPC) then 5.5.1 is the latest release and looks stable. Otherwise, we will have 5.5.2 available late this month. As for what additional directives you may wish to add: xrootd.trace request response stall Since this will generate relatively large log files, you may wish to first try turning async I/O off. In the 4.12 series we did not track lost I/O operations; so, should one happen a thread stall condition could happen. So, if you haven't disabled it please add: xrootd.async off If that solves the problem then we are done. Otherwise, also add the trace directive so we can see what causes this issue. |
Hi Andy, One year go (19/11/2021) Elvin Sindrilaru from CERN, who was helping us with debugging precisely this issue, wrote us: "After having a chat with Michal and Andy Hanushevsky (xrootd server side developer), we realized there is one easy configuration change that you could do to avoid having too many small read requests done against the back-end. Therefore, could you please set the following in your configuration and restart the daemon: xrootd.async nofs" At that time, we add xrootd.async off Following that suggestion, we changed xrootd.async off to xrootd.async nosf, and saw great improvements. Thanks, |
Ah, no for gpfs you really need xrootd.async nosf so in this case specify: xrootd.async off nosf OK? |
Ok.
In case it is not, and thread saturation happens again, then:
Am I correct? |
Quite correct :-) |
Since I have not seen additional comments, is it the case the problem went away when you turned off ASYNC I/O? |
Hi Andy, |
Hi Lucia,
I think that was your solution. So, two questions a) do any of the plugins
you use also use Globus and b) what is the version of your Lustre?
Is the answer to (a) is yes then this is indeed your solution as it is
known that Globus is incompatible with async I/O. Also, depending on the
version of the Lustre being used, it may also be incompatible with async
I/O.
Andy
…On Thu, 1 Dec 2022, lmorganti wrote:
Hi Andy,
we did turn off async i/o in two of the 4 CMS servers 20 days ago.
Since then, we never observed thread saturation, neither in those 2 servers nor in the 2 with async i/o on.
So unfortunately we cannot comment further at the moment.
Lucia
--
Reply to this email directly or view it on GitHub:
#1799 (comment)
You are receiving this because you were assigned.
Message ID: ***@***.***>
|
Hi Andy, |
Hi Lucia,
Ah, misunderstood. The Globus question remains and the second part should
have GPFS but technically it is still a valid question.
Andy
…On Thu, 1 Dec 2022, lmorganti wrote:
Hi Andy,
unfortunately it was not a solution, as I was trying to say: currently we don't observe thread saturation in those servers where async I/O is on and we don't observe it in those servers where async I/O is off.
So we cannot conclude anything.
We do not use Lustre, we use GPFS.
Cheers,
lucia
--
Reply to this email directly or view it on GitHub:
#1799 (comment)
You are receiving this because you were assigned.
Message ID: ***@***.***>
|
Hi Andy, rpm -qa | grep -i globus[root@xs-001 ~]# |
It's probably easier if you just provide your config file so I can see
what plugins are being used. Likely you don't want to post the config in a
public place so just directly email it to me or email a link to it.
Andy
…On Thu, 1 Dec 2022, lmorganti wrote:
Hi Andy,
GPFS 5.0.5-9
I cannot find Globus
# rpm -qa | grep -i globus
***@***.*** ~]#
but please tell me how to look for it.
Thanks,
lucia
--
Reply to this email directly or view it on GitHub:
#1799 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
Hi Andy, |
Hi Lucia,
Right, should have looked there first. One more thing, is GPFS being
acessed via native client or via an NFS mount?
Andy
…On Thu, 1 Dec 2022, lmorganti wrote:
Hi Andy,
the xrootd config file is attached in the first message of this thread.
lucia
--
Reply to this email directly or view it on GitHub:
#1799 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
Hi Andy, |
Thanks. There are several possibilities; none of which look impossible but none that I could point to the smoking gun. I see what is happening but can't explain why. I do know hat there have been significant work on trying to keep GPFS from slowing down when multiple threads hit it at once (this happens in this particular case). You can see all fixed to GPFS since your release here: https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_apars_505x.html#APARs You may want to consider upgrading GPFS. Also, there have been numerous improvements on how async I/I is handled by the server since 4.12.4. As this release is no longer supported I would strongly urge you to upgrade to the latest release, 5.5.1. In any case, it would strongly imply that turning off async on several servers keeps GPFS from getting overloaded and that does no trigger deficiencies in xrootd that wind up effectively deadlocking. Tha, because client recovery, quickly uses up all available threads. So, we can wait and I suspect we won't see a problem but let's not make that call until a couple of more weeks go by. It could simply be that the load has dropped off due to the holidays. |
@abh3 - it looks like there's a potential deadlock whenever there's a I/O operation queued in the Scheduler that holds a reference count to the XrdLink object. When there are no remaining idle threads, then the queued jobs that hold a reference count will keep the reference count non-zero indefinitely. However, some of the running jobs cannot complete until the ref count drops to zero. There's ~1500 stuck threads In @lmorganti's stack trace of the following pattern:
Assuming there's a queued I/O request corresponding to those links, those 1500 threads will never finish. Queued job can't run until something finishes. Running job can't finish until queued job runs. Classic deadlock. Now, in what circumstances do queued jobs hold a reference to the link? I count two cases:
I'm not 100% sure but perhaps (2) is rare (is it really rare in the case of caches as well?)? That might give some hope that (1) can be avoided simply by turning off async IO. Is it possible to automatically turn off async IO when thread counts are limited? |
Hi Brian,
That certainly is possible but async I/O is supposed to automatically
stop when it reaches a level that should be below the available threads
(though that's not the way it's computed). We can certainly add that to
the formulae. Today, you can manually tune that.
That said, the whole async architecture was redone to avoid such problems
in R5. Even then, we automatically refuse to use async I/O for local
devices as it makes no sense to do so until io_uring becomes widely
available. Those are strong reasons to recommend an upgrade to the latest
release.
Finally, we would have suggested something more than 8K max threads to
avoid such a problem but the implication was that the deployed hardware
couldn't handle such a load. So, if the trigger is async I/O then it's
completely correct to turn it off as it's not helping anyway.
Andy
…On Thu, 1 Dec 2022, Brian P Bockelman wrote:
@abh3 - it looks like there's a potential deadlock whenever there's a I/O operation queued in the Scheduler that holds a reference count to the XrdLink object.
When there are no remaining idle threads, then the queued jobs that hold a reference count will keep the reference count non-zero indefinitely.
However, some of the running jobs cannot complete until the ref count drops to zero. There's ~1500 stuck threads In @lmorganti's stack trace of the following pattern:
```
#4 0x00007fae9f87c1fe in Wait (this=<optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:419
#5 XrdLink::Serialize (this=0x7fac6401b978) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:1086
#6 0x00007fae9fb002a8 in XrdXrootdProtocol::do_Close (this=0x7fadbc0402e0) at /usr/src/debug/xrootd/xrootd/src/XrdXrootd/XrdXrootdXeq.cc:532
#7 0x00007fae9f87ce49 in XrdLink::DoIt (this=0x7fac6401b978) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:441
#8 0x00007fae9f8801df in XrdScheduler::Run (this=0x610e58 <XrdMain::Config+440>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:357
#9 0x00007fae9f880329 in XrdStartWorking (carg=<optimized out>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:87
#10 0x00007fae9f845be7 in XrdSysThread_Xeq (myargs=0x7fad100449f0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
```
Assuming there's a queued I/O request corresponding to those links, those 1500 threads will never finish.
Queued job can't run until something finishes. Running job can't finish until queued job runs. Classic deadlock.
Now, in what circumstances do queued jobs hold a reference to the link? I count two cases:
1. Async I/O (both normal reads and pgreads)
2. When non-default sockets are used for a read. I think this might occur if the client requests for encrypted control and unencrypted data.
I'm not 100% sure but perhaps (2) is rare (is it really rare in the case of caches as well?)? That might give some hope that (1) can be avoided simply by turning off async IO.
Is it possible to automatically turn off async IO when thread counts are limited?
--
Reply to this email directly or view it on GitHub:
#1799 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
Thanks Andy and thanks Brian for having looked at the stack trace! |
Looks like the issue has been solved using the suggested config settings. So, I am closing this. Please reopen if this is incorrect. |
Hi Andy, |
OK, should it appear that the problem occurs on the servers with async I/O
off, please reopen the ticket.
…On Mon, 30 Jan 2023, lumorganti wrote:
Hi Andy,
issue is not solved.
We upgraded to the xrootd version you suggested, and we left two servers with async I/O off and two with async I/O on.
We have been monitoring the 4 of them over the last month but traffic has never been high and threads never saturated, independent of configuration. Sorry about that.
--
Reply to this email directly or view it on GitHub:
#1799 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
Hi all,
at INFN-T1, we have been experiencing for some months an issue with the XrootD istances dedicated to the CMS experiment, concerning thread saturation.
In particular, we have 4 XrootD servers, running XrootD 4.12.4-1.el7, only used for XrootD and equipped with 2x100Gbps Ethernet which are remote mounting shared (gpfs) filesystem (they are not used as GPFS NSD servers).
At the peak load, each server is doing 4GB/s (40Gbps) of I/O (in and out) via network, serving ~2k connections.
On average, more than 400 files are transferred per minute (assuming a line with "read" in the log file corresponds to a transferred file).
As you can see from a snap-shot of our monitoring system, the threads used by the XrootD daemons (collected by a sensu plugin running "cat /proc//status | grep Threads") tend to periodically increase up to the limit (xrd.sched mint 16 maxt 2048 avlt 8 idle 60s):
When the thread saturation is reached, the throughput goes to 0 and the service stops logging in its log file. The last lines of the log files show a long sequence of the following errors
By restarting the XrootD service, the threads reset and the servers start again to make traffic, but once the threads saturate we come back to the same previous situation. You can find also the configuration file attached.
xrootd-cms.cfg.txt
Do you have any suggestion?
Thank you very much,
Andrea
The text was updated successfully, but these errors were encountered: