Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traffic stops when threads saturate #1799

Closed
andrearendina opened this issue Oct 12, 2022 · 34 comments
Closed

Traffic stops when threads saturate #1799

andrearendina opened this issue Oct 12, 2022 · 34 comments
Assignees

Comments

@andrearendina
Copy link

andrearendina commented Oct 12, 2022

Hi all,

at INFN-T1, we have been experiencing for some months an issue with the XrootD istances dedicated to the CMS experiment, concerning thread saturation.
In particular, we have 4 XrootD servers, running XrootD 4.12.4-1.el7, only used for XrootD and equipped with 2x100Gbps Ethernet which are remote mounting shared (gpfs) filesystem (they are not used as GPFS NSD servers).
At the peak load, each server is doing 4GB/s (40Gbps) of I/O (in and out) via network, serving ~2k connections.
On average, more than 400 files are transferred per minute (assuming a line with "read" in the log file corresponds to a transferred file).

As you can see from a snap-shot of our monitoring system, the threads used by the XrootD daemons (collected by a sensu plugin running "cat /proc//status | grep Threads") tend to periodically increase up to the limit (xrd.sched mint 16 maxt 2048 avlt 8 idle 60s):

sat_threads_NU_in

When the thread saturation is reached, the throughput goes to 0 and the service stops logging in its log file. The last lines of the log files show a long sequence of the following errors

221011 11:43:53 6104 XrdLink: Unable to send to job0022.9702:2382@wm26.grid.hep.ph.ic.ac.uk; broken pipe
221011 11:43:53 6104 XrootdXeq: job0022.9702:2382@wm26.grid.hep.ph.ic.ac.uk disc 0:35:48 (send failure)
221011 11:44:03 6104 XrdLink: Unable to send to cmspl1.7320:512@nid001499.alps.cscs.ch; broken pipe
221011 11:44:03 6104 XrootdXeq: cmspl1.7320:512@nid001499.alps.cscs.ch disc 0:28:21 (send failure)
221011 11:44:26 6104 XrdLink: Unable to send to pilcms00.8780:3034@t2-fw-02-w.lnl.infn.it; broken pipe
221011 11:44:26 6104 XrootdXeq: pilcms00.8780:3034@t2-fw-02-w.lnl.infn.it disc 0:25:06 (send failure)
221011 11:44:28 6104 XrdLink: Unable to send to job0001.12124:3856@wh00.grid.hep.ph.ic.ac.uk; broken pipe
221011 11:44:28 6104 XrootdXeq: job0001.12124:3856@wh00.grid.hep.ph.ic.ac.uk disc 0:41:29 (send failure)
221011 11:44:29 6104 XrdLink: Unable to send to cms001.5134:3786@[::159.93.224.95]; broken pipe
221011 11:44:29 6104 XrootdXeq: cms001.5134:3786@[::159.93.224.95] disc 0:22:18 (send failure)

By restarting the XrootD service, the threads reset and the servers start again to make traffic, but once the threads saturate we come back to the same previous situation. You can find also the configuration file attached.
xrootd-cms.cfg.txt

Do you have any suggestion?

Thank you very much,

Andrea

@xrootd-dev
Copy link

xrootd-dev commented Oct 12, 2022 via email

@lmorganti
Copy link

Hi Andy,
we did follow your suggestion and doubled thread count:
xrd.sched mint 16 maxt 4096 avlt 8 idle 60s
Unfortunately we have the same problem, this time saturating 4096 threads, as you can see here:
immagine
Again, the throughput goes to 0 and the service stops logging in its log file. By restarting the XrootD service, the threads reset and the servers start again to make traffic.
Do you have any suggestion?
Getting a passive deadlock where some critical action cannot be done because there is no thread to do it is expected?
Thanks,
lucia

@abh3
Copy link
Member

abh3 commented Nov 4, 2022

The big question is where is all this traffic coming from and when did this start happening? It's hard to imagine there are 4000 simultaneously active users. However, there is one other possibility and that is something is hanging in one of the plugins causing clients to abandon the session and reconnect. We've seen these kinds of problems before. What are you actually running? That is, what's in your config file and what kind of file system?

@lmorganti
Copy link

Hi, traffic is simply from CMS jobs, we don't think this trafficis weird. What looks weird is threads hanging. Is there a way we could monitor hanging threads? Such threads do not make traffic.

As already said, we are running 4 XrootD servers dedicated to CMS, running XrootD 4.12.4-1.el7, only used for XrootD and equipped with 2x100Gbps Ethernet which are remote mounting shared (GPFS) filesystem (they are not used as GPFS NSD servers).
Our configuration file is attached to the original message from Andrea, please see above
(xrd.sched mint 16 maxt 4096 avlt 8 idle 60s
following your suggestion)

Thanks for your help,
lucia

@abh3 abh3 self-assigned this Nov 7, 2022
@abh3
Copy link
Member

abh3 commented Nov 7, 2022

The only thing I see in the config file is that you are exporting gpfs. Is it possible that gpfs gets overloaded at some point due to activity other than xroot? For instance, we do know that gps can become very slow when backups happen or some other job (other than xroot) is writing a lot of small files. The log appears to imply this is what is happening. We can do a deeper analysis by looking at what each thread is doing. This can be done online or offline.
offline: Generate a core file of stalled xroot using the gcore command.
online: You will have to attach gdb to the stalled process

In either case, please make sure you have installed the debug xroot RPM package so that we get statement numbers in the following steps:

If you have generated a gcore then use gdb to look at it if you've done the online step then you ae already in gdb. Either way, do the following:

(gdb) set logging on
{gdb} set logging redirect on
(gdb) thread apply all bt

This will generate a full back trace for each thread and write it to gdb.txt. This will take some time if there are many threads and the file could be very large.

Once this is done, please send a link to where we can download the resulting file and then we can see what is going on.

One more question, is this gfs instance backed by tape and gpfs has the ability to recall files from tape?

@lmorganti
Copy link

Thanks for this reply.
For what concerns gpfs, these 4 hosts are not doing anything apart from accessing filesystem for XrootD service. We do monitor gpfs filesystem and there was no overload which could block or slow down access to filesystem when XrootD threads were hanging.
gpfs has the ability to recall files from tape but not via xrootd; the Xrootd instances of CMS have no plugin to interact with tapes.

Following your suggestions, we installed xrootd-debuginfo, set again 2048 as maximum threads, and we are waiting for hanging threads to happen :-)
Then we'll use gdb as you said and send you a link.

Thank you very much for your help,
lucia

@andrearendina
Copy link
Author

Dear Andy,

we tried to follow your instructions concerning the troubleshooting of xrootd thread saturation.

When the problem occured, we started gdb on the corresponding server and attached to the running XrootD process, but we experienced the error "try debuginfo-install xrootd-server-4.12.4-1.el7".
We were not able to find the xroot-server-debuginfo-4.12.4-1.el7 package anywhere, so we noticed that the current version of xrootd-debuginfo on the server was 5.3.1-1. Thus we removed xrootd-debuginfo-5.3.1-1.el7 and installed the 4.12.4-1.el7 version. After that, we encountered a new error message on gdb related to the vomsxrd-debuginfo-0.3.0-1.el7.cern package.
Anyway, we were able to obtain a readable backtrace log for xrootd threads.
You can find it attached hereafter.
gdb.txt

Thank you again for your help,
let us know if any other information is needed.

Andrea

@abh3
Copy link
Member

abh3 commented Nov 9, 2022

So, the trace shows that 2048 threads are waiting forlink serialization either for open() or close() (about 50%each). I don't have a reason for that but it usually means that there is an outstanding operation somewhere in the file system and until that completes neither and open nor a close can proceed. Could you post the log file that corresponds to this trace record?

@andrearendina
Copy link
Author

Hi Andy,

I attach the xrootd.log recorded from the first "broken pipe" observed error until the thread saturation.
xrootd.log
I hope that helps.

Moreover, we do not see any outstanding operations on the file system ("gpfs waiters"), nor on this server nor on the entire gpfs cluster.

Thank you again!

@lmorganti
Copy link

Hi,
did you manage to get some information from the xrootd.log?
On our side, we are experiencing thread saturation with no traffic more than once per day, and every time a restart of the service in each of the 4 servers is needed to resume traffic.

It is pretty hard to believe some outstanding activity is undergoing this often somewhere in the CMS filesystem and completely hiding from our monitoring: we see no blocking, no pending operations, nothing weird gpfs-side.

In case it is useful, we attach a file containing the output of "netstat -an" during thread saturation in one of the server.

Let us know how we can debug further.

Thanks for your help,
lucia
netstat_an.txt

@abh3
Copy link
Member

abh3 commented Nov 11, 2022

The stack trace simply shows everyone waiting for things to complete. It does not tell me what is waiting for what nor how many different clients are in this state. I must admit that release 4.12.4 is very old (2020-09-03). Any chance you can upgrade to a supported release? Otherwise, the only thing left is with running the sever with additional tracing enabled and after that a core file. Frankly, something weird is going on with that machine. When did this start happening? What changed roughly at the time this started happening?

@lmorganti
Copy link

It is not something going on with a specific machine: it's every CMS XrootD instance at INFN Tier1.
It's not something recent, we have been restarting the service from time to time following thread saturation for one year.
We started debugging this together with colleagues from CERN one year ago (I mean thread saturation, no traffic and so on).
They suggested we installed 4.12.4 release, therefore we upgraded to this version.
Now I've asked CMS whether it's fine to upgrade futher; which version would you suggest?
While waiting for a reply from CMS, we would be very happy to run with additional tracing enabled and produce a core file. If you provide instructions, we are happy to try.
Thank you very much for your help,
lucia

@abh3
Copy link
Member

abh3 commented Nov 11, 2022

If you don't use HTTP third party copy (TPC) then 5.5.1 is the latest release and looks stable. Otherwise, we will have 5.5.2 available late this month. As for what additional directives you may wish to add:

xrootd.trace request response stall

Since this will generate relatively large log files, you may wish to first try turning async I/O off. In the 4.12 series we did not track lost I/O operations; so, should one happen a thread stall condition could happen. So, if you haven't disabled it please add:

xrootd.async off

If that solves the problem then we are done. Otherwise, also add the trace directive so we can see what causes this issue.

@lmorganti
Copy link

Hi Andy,
sorry I am a bit confused here.

One year go (19/11/2021) Elvin Sindrilaru from CERN, who was helping us with debugging precisely this issue, wrote us:

"After having a chat with Michal and Andy Hanushevsky (xrootd server side developer), we realized there is one easy configuration change that you could do to avoid having too many small read requests done against the back-end. Therefore, could you please set the following in your configuration and restart the daemon:

xrootd.async nofs"

At that time, we add

xrootd.async off

Following that suggestion, we changed xrootd.async off to xrootd.async nosf, and saw great improvements.
Now, are you saying we should set it back to
xrootd.async off
?

Thanks,
lucia

@abh3
Copy link
Member

abh3 commented Nov 11, 2022

Ah, no for gpfs you really need xrootd.async nosf so in this case specify:

xrootd.async off nosf

OK?

@lmorganti
Copy link

Ok.
So let's see if I have understood:

  • xrootd.async off nosf
  • restart
  • see whether the problem is solved

In case it is not, and thread saturation happens again, then:

  • xrootd.trace request response stall
  • xrootd.async off nosf
  • restart
  • wait for the problem to happen, then send you a big log file

Am I correct?
Thanks,
lucia

@abh3
Copy link
Member

abh3 commented Nov 11, 2022

Quite correct :-)

@abh3
Copy link
Member

abh3 commented Dec 1, 2022

Since I have not seen additional comments, is it the case the problem went away when you turned off ASYNC I/O?

@lmorganti
Copy link

Hi Andy,
we did turn off async i/o in two of the 4 CMS servers 20 days ago.
Since then, we never observed thread saturation, neither in those 2 servers nor in the 2 with async i/o on.
So unfortunately we cannot comment further at the moment.
Lucia

@abh3
Copy link
Member

abh3 commented Dec 1, 2022 via email

@lmorganti
Copy link

Hi Andy,
unfortunately it was not a solution, as I was trying to say: currently we don't observe thread saturation in those servers where async I/O is on and we don't observe it in those servers where async I/O is off.
So we cannot conclude anything.
We do not use Lustre, we use GPFS.
Cheers,
lucia

@abh3
Copy link
Member

abh3 commented Dec 1, 2022 via email

@lmorganti
Copy link

Hi Andy,
GPFS 5.0.5-9
I cannot find Globus

rpm -qa | grep -i globus

[root@xs-001 ~]#
but please tell me how to look for it.
Thanks,
lucia

@xrootd-dev
Copy link

xrootd-dev commented Dec 1, 2022 via email

@lmorganti
Copy link

Hi Andy,
the xrootd config file is attached in the first message of this thread.
lucia

@xrootd-dev
Copy link

xrootd-dev commented Dec 1, 2022 via email

@lmorganti
Copy link

Hi Andy,
via native client.

@abh3
Copy link
Member

abh3 commented Dec 1, 2022

Thanks. There are several possibilities; none of which look impossible but none that I could point to the smoking gun. I see what is happening but can't explain why. I do know hat there have been significant work on trying to keep GPFS from slowing down when multiple threads hit it at once (this happens in this particular case). You can see all fixed to GPFS since your release here:

https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_apars_505x.html#APARs

You may want to consider upgrading GPFS. Also, there have been numerous improvements on how async I/I is handled by the server since 4.12.4. As this release is no longer supported I would strongly urge you to upgrade to the latest release, 5.5.1.

In any case, it would strongly imply that turning off async on several servers keeps GPFS from getting overloaded and that does no trigger deficiencies in xrootd that wind up effectively deadlocking. Tha, because client recovery, quickly uses up all available threads.

So, we can wait and I suspect we won't see a problem but let's not make that call until a couple of more weeks go by. It could simply be that the load has dropped off due to the holidays.

@bbockelm
Copy link
Contributor

bbockelm commented Dec 1, 2022

@abh3 - it looks like there's a potential deadlock whenever there's a I/O operation queued in the Scheduler that holds a reference count to the XrdLink object.

When there are no remaining idle threads, then the queued jobs that hold a reference count will keep the reference count non-zero indefinitely.

However, some of the running jobs cannot complete until the ref count drops to zero. There's ~1500 stuck threads In @lmorganti's stack trace of the following pattern:

#4  0x00007fae9f87c1fe in Wait (this=<optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:419
#5  XrdLink::Serialize (this=0x7fac6401b978) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:1086
#6  0x00007fae9fb002a8 in XrdXrootdProtocol::do_Close (this=0x7fadbc0402e0) at /usr/src/debug/xrootd/xrootd/src/XrdXrootd/XrdXrootdXeq.cc:532
#7  0x00007fae9f87ce49 in XrdLink::DoIt (this=0x7fac6401b978) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:441
#8  0x00007fae9f8801df in XrdScheduler::Run (this=0x610e58 <XrdMain::Config+440>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:357
#9  0x00007fae9f880329 in XrdStartWorking (carg=<optimized out>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:87
#10 0x00007fae9f845be7 in XrdSysThread_Xeq (myargs=0x7fad100449f0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86

Assuming there's a queued I/O request corresponding to those links, those 1500 threads will never finish.

Queued job can't run until something finishes. Running job can't finish until queued job runs. Classic deadlock.

Now, in what circumstances do queued jobs hold a reference to the link? I count two cases:

  1. Async I/O (both normal reads and pgreads)
  2. When non-default sockets are used for a read. I think this might occur if the client requests for encrypted control and unencrypted data.

I'm not 100% sure but perhaps (2) is rare (is it really rare in the case of caches as well?)? That might give some hope that (1) can be avoided simply by turning off async IO.

Is it possible to automatically turn off async IO when thread counts are limited?

@xrootd-dev
Copy link

xrootd-dev commented Dec 1, 2022 via email

@lmorganti
Copy link

Thanks Andy and thanks Brian for having looked at the stack trace!
We'll upgrade xrootd as soon as possible (we are waiting for CMS feedback and then we'll do that: https://ggus.eu/index.php?mode=ticket_info&ticket_id=159539).
After upgrading, we'll leave two servers with async I/O off and two with async I/O on, and hopefully we'll see what happens.
Thanks,
lucia

@abh3
Copy link
Member

abh3 commented Jan 30, 2023

Looks like the issue has been solved using the suggested config settings. So, I am closing this. Please reopen if this is incorrect.

@abh3 abh3 closed this as completed Jan 30, 2023
@lumorganti
Copy link

Hi Andy,
issue is not solved.
We upgraded to the xrootd version you suggested, and we left two servers with async I/O off and two with async I/O on.
We have been monitoring the 4 of them over the last month but traffic has never been high and threads never saturated, independent of configuration. Sorry about that.

@xrootd-dev
Copy link

xrootd-dev commented Jan 30, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants