Traffic stops when threads saturate #1799

andrearendina · 2022-10-12T11:31:17Z

Hi all,

at INFN-T1, we have been experiencing for some months an issue with the XrootD istances dedicated to the CMS experiment, concerning thread saturation.
In particular, we have 4 XrootD servers, running XrootD 4.12.4-1.el7, only used for XrootD and equipped with 2x100Gbps Ethernet which are remote mounting shared (gpfs) filesystem (they are not used as GPFS NSD servers).
At the peak load, each server is doing 4GB/s (40Gbps) of I/O (in and out) via network, serving ~2k connections.
On average, more than 400 files are transferred per minute (assuming a line with "read" in the log file corresponds to a transferred file).

As you can see from a snap-shot of our monitoring system, the threads used by the XrootD daemons (collected by a sensu plugin running "cat /proc//status | grep Threads") tend to periodically increase up to the limit (xrd.sched mint 16 maxt 2048 avlt 8 idle 60s):

When the thread saturation is reached, the throughput goes to 0 and the service stops logging in its log file. The last lines of the log files show a long sequence of the following errors

221011 11:43:53 6104 XrdLink: Unable to send to job0022.9702:2382@wm26.grid.hep.ph.ic.ac.uk; broken pipe
221011 11:43:53 6104 XrootdXeq: job0022.9702:2382@wm26.grid.hep.ph.ic.ac.uk disc 0:35:48 (send failure)
221011 11:44:03 6104 XrdLink: Unable to send to cmspl1.7320:512@nid001499.alps.cscs.ch; broken pipe
221011 11:44:03 6104 XrootdXeq: cmspl1.7320:512@nid001499.alps.cscs.ch disc 0:28:21 (send failure)
221011 11:44:26 6104 XrdLink: Unable to send to pilcms00.8780:3034@t2-fw-02-w.lnl.infn.it; broken pipe
221011 11:44:26 6104 XrootdXeq: pilcms00.8780:3034@t2-fw-02-w.lnl.infn.it disc 0:25:06 (send failure)
221011 11:44:28 6104 XrdLink: Unable to send to job0001.12124:3856@wh00.grid.hep.ph.ic.ac.uk; broken pipe
221011 11:44:28 6104 XrootdXeq: job0001.12124:3856@wh00.grid.hep.ph.ic.ac.uk disc 0:41:29 (send failure)
221011 11:44:29 6104 XrdLink: Unable to send to cms001.5134:3786@[::159.93.224.95]; broken pipe
221011 11:44:29 6104 XrootdXeq: cms001.5134:3786@[::159.93.224.95] disc 0:22:18 (send failure)

By restarting the XrootD service, the threads reset and the servers start again to make traffic, but once the threads saturate we come back to the same previous situation. You can find also the configuration file attached.
xrootd-cms.cfg.txt

Do you have any suggestion?

Thank you very much,

Andrea

The text was updated successfully, but these errors were encountered:

xrootd-dev · 2022-10-12T20:00:32Z

Ho Andrea, Yes, the key here is that you have 2K connections but limit threads to 2K as well. That means likely you are getting a passive deadlock where some critical action cannot be done because there is no thread to do it. So, try to increase the thread count to 4K abd see what happens. Andy

…

On Wed, 12 Oct 2022, andrearendina wrote: Hi all, at INFN-T1, we have been experiencing for some months an issue with the XrootD istances dedicated to the CMS experiment, concerning thread saturation. In particular, we have 4 XrootD servers, running XrootD 4.12.4-1.el7, only used for XrootD and equipped with 2x100Gbps Ethernet which are remote mounting shared (gpfs) filesystem (they are not used as GPFS NSD servers). At the peak load, each server is doing 4GB/s (40Gbps) of I/O (in and out) via network, serving ~2k connections. On average, more than 400 files are transferred per minute (assuming a line with "read" in the log file corresponds to a transferred file). As you can see from a snap-shot of our monitoring system, the threads used by the XrootD daemons (collected by a sensu plugin running "cat /proc/<pid>/status | grep Threads") tend to periodically increase up to the limit (xrd.sched mint 16 maxt 2048 avlt 8 idle 60s): ![sat_threads_NU_in](https://user-images.githubusercontent.com/73068549/195327020-fe9ce8a6-8e86-4bfc-8006-6ae56fd87491.png) When the thread saturation is reached, the throughput goes to 0 and the service stops logging in its log file. The last lines of the log files show a long sequence of the following errors ``` 221011 11:43:53 6104 XrdLink: Unable to send to ***@***.***; broken pipe 221011 11:43:53 6104 XrootdXeq: ***@***.*** disc 0:35:48 (send failure) 221011 11:44:03 6104 XrdLink: Unable to send to ***@***.***; broken pipe 221011 11:44:03 6104 XrootdXeq: ***@***.*** disc 0:28:21 (send failure) 221011 11:44:26 6104 XrdLink: Unable to send to ***@***.***; broken pipe 221011 11:44:26 6104 XrootdXeq: ***@***.*** disc 0:25:06 (send failure) 221011 11:44:28 6104 XrdLink: Unable to send to ***@***.***; broken pipe 221011 11:44:28 6104 XrootdXeq: ***@***.*** disc 0:41:29 (send failure) 221011 11:44:29 6104 XrdLink: Unable to send to cms001.5134:3786@[::159.93.224.95]; broken pipe 221011 11:44:29 6104 XrootdXeq: cms001.5134:3786@[::159.93.224.95] disc 0:22:18 (send failure) ``` By restarting the XrootD service, the threads reset and the servers start again to make traffic, but once the threads saturate we come back to the same previous situation. You can find also the configuration file attached. [xrootd-cms.cfg.txt](https://github.com/xrootd/xrootd/files/9764257/xrootd-cms.cfg.txt) Do you have any suggestion? Thank you very much, Andrea -- Reply to this email directly or view it on GitHub: #1799 You are receiving this because you are subscribed to this thread. Message ID: ***@***.***> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

lmorganti · 2022-11-04T08:56:11Z

Hi Andy,
we did follow your suggestion and doubled thread count:
xrd.sched mint 16 maxt 4096 avlt 8 idle 60s
Unfortunately we have the same problem, this time saturating 4096 threads, as you can see here:

Again, the throughput goes to 0 and the service stops logging in its log file. By restarting the XrootD service, the threads reset and the servers start again to make traffic.
Do you have any suggestion?
Getting a passive deadlock where some critical action cannot be done because there is no thread to do it is expected?
Thanks,
lucia

abh3 · 2022-11-04T19:22:11Z

The big question is where is all this traffic coming from and when did this start happening? It's hard to imagine there are 4000 simultaneously active users. However, there is one other possibility and that is something is hanging in one of the plugins causing clients to abandon the session and reconnect. We've seen these kinds of problems before. What are you actually running? That is, what's in your config file and what kind of file system?

lmorganti · 2022-11-07T09:35:27Z

Hi, traffic is simply from CMS jobs, we don't think this trafficis weird. What looks weird is threads hanging. Is there a way we could monitor hanging threads? Such threads do not make traffic.

As already said, we are running 4 XrootD servers dedicated to CMS, running XrootD 4.12.4-1.el7, only used for XrootD and equipped with 2x100Gbps Ethernet which are remote mounting shared (GPFS) filesystem (they are not used as GPFS NSD servers).
Our configuration file is attached to the original message from Andrea, please see above
(xrd.sched mint 16 maxt 4096 avlt 8 idle 60s
following your suggestion)

Thanks for your help,
lucia

abh3 · 2022-11-07T16:32:55Z

The only thing I see in the config file is that you are exporting gpfs. Is it possible that gpfs gets overloaded at some point due to activity other than xroot? For instance, we do know that gps can become very slow when backups happen or some other job (other than xroot) is writing a lot of small files. The log appears to imply this is what is happening. We can do a deeper analysis by looking at what each thread is doing. This can be done online or offline.
offline: Generate a core file of stalled xroot using the gcore command.
online: You will have to attach gdb to the stalled process

In either case, please make sure you have installed the debug xroot RPM package so that we get statement numbers in the following steps:

If you have generated a gcore then use gdb to look at it if you've done the online step then you ae already in gdb. Either way, do the following:

(gdb) set logging on
{gdb} set logging redirect on
(gdb) thread apply all bt

This will generate a full back trace for each thread and write it to gdb.txt. This will take some time if there are many threads and the file could be very large.

Once this is done, please send a link to where we can download the resulting file and then we can see what is going on.

One more question, is this gfs instance backed by tape and gpfs has the ability to recall files from tape?

lmorganti · 2022-11-08T15:04:35Z

Thanks for this reply.
For what concerns gpfs, these 4 hosts are not doing anything apart from accessing filesystem for XrootD service. We do monitor gpfs filesystem and there was no overload which could block or slow down access to filesystem when XrootD threads were hanging.
gpfs has the ability to recall files from tape but not via xrootd; the Xrootd instances of CMS have no plugin to interact with tapes.

Following your suggestions, we installed xrootd-debuginfo, set again 2048 as maximum threads, and we are waiting for hanging threads to happen :-)
Then we'll use gdb as you said and send you a link.

Thank you very much for your help,
lucia

andrearendina · 2022-11-09T14:40:55Z

Dear Andy,

we tried to follow your instructions concerning the troubleshooting of xrootd thread saturation.

When the problem occured, we started gdb on the corresponding server and attached to the running XrootD process, but we experienced the error "try debuginfo-install xrootd-server-4.12.4-1.el7".
We were not able to find the xroot-server-debuginfo-4.12.4-1.el7 package anywhere, so we noticed that the current version of xrootd-debuginfo on the server was 5.3.1-1. Thus we removed xrootd-debuginfo-5.3.1-1.el7 and installed the 4.12.4-1.el7 version. After that, we encountered a new error message on gdb related to the vomsxrd-debuginfo-0.3.0-1.el7.cern package.
Anyway, we were able to obtain a readable backtrace log for xrootd threads.
You can find it attached hereafter.
gdb.txt

Thank you again for your help,
let us know if any other information is needed.

Andrea

abh3 · 2022-11-09T16:44:01Z

So, the trace shows that 2048 threads are waiting forlink serialization either for open() or close() (about 50%each). I don't have a reason for that but it usually means that there is an outstanding operation somewhere in the file system and until that completes neither and open nor a close can proceed. Could you post the log file that corresponds to this trace record?

andrearendina · 2022-11-09T17:07:13Z

Hi Andy,

I attach the xrootd.log recorded from the first "broken pipe" observed error until the thread saturation.
xrootd.log
I hope that helps.

Moreover, we do not see any outstanding operations on the file system ("gpfs waiters"), nor on this server nor on the entire gpfs cluster.

Thank you again!

lmorganti · 2022-11-11T11:37:58Z

Hi,
did you manage to get some information from the xrootd.log?
On our side, we are experiencing thread saturation with no traffic more than once per day, and every time a restart of the service in each of the 4 servers is needed to resume traffic.

It is pretty hard to believe some outstanding activity is undergoing this often somewhere in the CMS filesystem and completely hiding from our monitoring: we see no blocking, no pending operations, nothing weird gpfs-side.

In case it is useful, we attach a file containing the output of "netstat -an" during thread saturation in one of the server.

Let us know how we can debug further.

Thanks for your help,
lucia
netstat_an.txt

abh3 · 2022-11-11T12:00:49Z

The stack trace simply shows everyone waiting for things to complete. It does not tell me what is waiting for what nor how many different clients are in this state. I must admit that release 4.12.4 is very old (2020-09-03). Any chance you can upgrade to a supported release? Otherwise, the only thing left is with running the sever with additional tracing enabled and after that a core file. Frankly, something weird is going on with that machine. When did this start happening? What changed roughly at the time this started happening?

lmorganti · 2022-11-11T12:45:21Z

It is not something going on with a specific machine: it's every CMS XrootD instance at INFN Tier1.
It's not something recent, we have been restarting the service from time to time following thread saturation for one year.
We started debugging this together with colleagues from CERN one year ago (I mean thread saturation, no traffic and so on).
They suggested we installed 4.12.4 release, therefore we upgraded to this version.
Now I've asked CMS whether it's fine to upgrade futher; which version would you suggest?
While waiting for a reply from CMS, we would be very happy to run with additional tracing enabled and produce a core file. If you provide instructions, we are happy to try.
Thank you very much for your help,
lucia

abh3 · 2022-11-11T13:40:06Z

If you don't use HTTP third party copy (TPC) then 5.5.1 is the latest release and looks stable. Otherwise, we will have 5.5.2 available late this month. As for what additional directives you may wish to add:

xrootd.trace request response stall

Since this will generate relatively large log files, you may wish to first try turning async I/O off. In the 4.12 series we did not track lost I/O operations; so, should one happen a thread stall condition could happen. So, if you haven't disabled it please add:

xrootd.async off

If that solves the problem then we are done. Otherwise, also add the trace directive so we can see what causes this issue.

lmorganti · 2022-11-11T14:11:29Z

Hi Andy,
sorry I am a bit confused here.

One year go (19/11/2021) Elvin Sindrilaru from CERN, who was helping us with debugging precisely this issue, wrote us:

"After having a chat with Michal and Andy Hanushevsky (xrootd server side developer), we realized there is one easy configuration change that you could do to avoid having too many small read requests done against the back-end. Therefore, could you please set the following in your configuration and restart the daemon:

xrootd.async nofs"

At that time, we add

xrootd.async off

Following that suggestion, we changed xrootd.async off to xrootd.async nosf, and saw great improvements.
Now, are you saying we should set it back to
xrootd.async off
?

Thanks,
lucia

abh3 · 2022-11-11T14:34:35Z

Ah, no for gpfs you really need xrootd.async nosf so in this case specify:

xrootd.async off nosf

OK?

lmorganti · 2022-11-11T14:37:47Z

Ok.
So let's see if I have understood:

xrootd.async off nosf
restart
see whether the problem is solved

In case it is not, and thread saturation happens again, then:

xrootd.trace request response stall
xrootd.async off nosf
restart
wait for the problem to happen, then send you a big log file

Am I correct?
Thanks,
lucia

abh3 · 2022-11-11T14:52:29Z

Quite correct :-)

abh3 · 2022-12-01T01:18:39Z

Since I have not seen additional comments, is it the case the problem went away when you turned off ASYNC I/O?

lmorganti · 2022-12-01T09:11:15Z

Hi Andy,
we did turn off async i/o in two of the 4 CMS servers 20 days ago.
Since then, we never observed thread saturation, neither in those 2 servers nor in the 2 with async i/o on.
So unfortunately we cannot comment further at the moment.
Lucia

abh3 · 2022-12-01T16:24:50Z

Hi Lucia, I think that was your solution. So, two questions a) do any of the plugins you use also use Globus and b) what is the version of your Lustre? Is the answer to (a) is yes then this is indeed your solution as it is known that Globus is incompatible with async I/O. Also, depending on the version of the Lustre being used, it may also be incompatible with async I/O. Andy

…

On Thu, 1 Dec 2022, lmorganti wrote: Hi Andy, we did turn off async i/o in two of the 4 CMS servers 20 days ago. Since then, we never observed thread saturation, neither in those 2 servers nor in the 2 with async i/o on. So unfortunately we cannot comment further at the moment. Lucia -- Reply to this email directly or view it on GitHub: #1799 (comment) You are receiving this because you were assigned. Message ID: ***@***.***>

lmorganti · 2022-12-01T16:30:36Z

Hi Andy,
unfortunately it was not a solution, as I was trying to say: currently we don't observe thread saturation in those servers where async I/O is on and we don't observe it in those servers where async I/O is off.
So we cannot conclude anything.
We do not use Lustre, we use GPFS.
Cheers,
lucia

abh3 · 2022-12-01T16:33:30Z

Hi Lucia, Ah, misunderstood. The Globus question remains and the second part should have GPFS but technically it is still a valid question. Andy

…

On Thu, 1 Dec 2022, lmorganti wrote: Hi Andy, unfortunately it was not a solution, as I was trying to say: currently we don't observe thread saturation in those servers where async I/O is on and we don't observe it in those servers where async I/O is off. So we cannot conclude anything. We do not use Lustre, we use GPFS. Cheers, lucia -- Reply to this email directly or view it on GitHub: #1799 (comment) You are receiving this because you were assigned. Message ID: ***@***.***>

lmorganti · 2022-12-01T16:35:45Z

Hi Andy,
GPFS 5.0.5-9
I cannot find Globus

rpm -qa | grep -i globus

[root@xs-001 ~]#
but please tell me how to look for it.
Thanks,
lucia

xrootd-dev · 2022-12-01T16:46:59Z

It's probably easier if you just provide your config file so I can see what plugins are being used. Likely you don't want to post the config in a public place so just directly email it to me or email a link to it. Andy

…

On Thu, 1 Dec 2022, lmorganti wrote: Hi Andy, GPFS 5.0.5-9 I cannot find Globus # rpm -qa | grep -i globus ***@***.*** ~]# but please tell me how to look for it. Thanks, lucia -- Reply to this email directly or view it on GitHub: #1799 (comment) You are receiving this because you commented. Message ID: ***@***.***> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

lmorganti · 2022-12-01T16:48:24Z

Hi Andy,
the xrootd config file is attached in the first message of this thread.
lucia

xrootd-dev · 2022-12-01T17:10:44Z

Hi Lucia, Right, should have looked there first. One more thing, is GPFS being acessed via native client or via an NFS mount? Andy

…

On Thu, 1 Dec 2022, lmorganti wrote: Hi Andy, the xrootd config file is attached in the first message of this thread. lucia -- Reply to this email directly or view it on GitHub: #1799 (comment) You are receiving this because you commented. Message ID: ***@***.***> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

lmorganti · 2022-12-01T18:53:56Z

Hi Andy,
via native client.

abh3 · 2022-12-01T19:22:02Z

Thanks. There are several possibilities; none of which look impossible but none that I could point to the smoking gun. I see what is happening but can't explain why. I do know hat there have been significant work on trying to keep GPFS from slowing down when multiple threads hit it at once (this happens in this particular case). You can see all fixed to GPFS since your release here:

https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_apars_505x.html#APARs

You may want to consider upgrading GPFS. Also, there have been numerous improvements on how async I/I is handled by the server since 4.12.4. As this release is no longer supported I would strongly urge you to upgrade to the latest release, 5.5.1.

In any case, it would strongly imply that turning off async on several servers keeps GPFS from getting overloaded and that does no trigger deficiencies in xrootd that wind up effectively deadlocking. Tha, because client recovery, quickly uses up all available threads.

So, we can wait and I suspect we won't see a problem but let's not make that call until a couple of more weeks go by. It could simply be that the load has dropped off due to the holidays.

bbockelm · 2022-12-01T21:38:32Z

@abh3 - it looks like there's a potential deadlock whenever there's a I/O operation queued in the Scheduler that holds a reference count to the XrdLink object.

When there are no remaining idle threads, then the queued jobs that hold a reference count will keep the reference count non-zero indefinitely.

However, some of the running jobs cannot complete until the ref count drops to zero. There's ~1500 stuck threads In @lmorganti's stack trace of the following pattern:

#4  0x00007fae9f87c1fe in Wait (this=<optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:419
#5  XrdLink::Serialize (this=0x7fac6401b978) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:1086
#6  0x00007fae9fb002a8 in XrdXrootdProtocol::do_Close (this=0x7fadbc0402e0) at /usr/src/debug/xrootd/xrootd/src/XrdXrootd/XrdXrootdXeq.cc:532
#7  0x00007fae9f87ce49 in XrdLink::DoIt (this=0x7fac6401b978) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:441
#8  0x00007fae9f8801df in XrdScheduler::Run (this=0x610e58 <XrdMain::Config+440>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:357
#9  0x00007fae9f880329 in XrdStartWorking (carg=<optimized out>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:87
#10 0x00007fae9f845be7 in XrdSysThread_Xeq (myargs=0x7fad100449f0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86

Assuming there's a queued I/O request corresponding to those links, those 1500 threads will never finish.

Queued job can't run until something finishes. Running job can't finish until queued job runs. Classic deadlock.

Now, in what circumstances do queued jobs hold a reference to the link? I count two cases:

Async I/O (both normal reads and pgreads)
When non-default sockets are used for a read. I think this might occur if the client requests for encrypted control and unencrypted data.

I'm not 100% sure but perhaps (2) is rare (is it really rare in the case of caches as well?)? That might give some hope that (1) can be avoided simply by turning off async IO.

Is it possible to automatically turn off async IO when thread counts are limited?

xrootd-dev · 2022-12-01T22:12:19Z

Hi Brian, That certainly is possible but async I/O is supposed to automatically stop when it reaches a level that should be below the available threads (though that's not the way it's computed). We can certainly add that to the formulae. Today, you can manually tune that. That said, the whole async architecture was redone to avoid such problems in R5. Even then, we automatically refuse to use async I/O for local devices as it makes no sense to do so until io_uring becomes widely available. Those are strong reasons to recommend an upgrade to the latest release. Finally, we would have suggested something more than 8K max threads to avoid such a problem but the implication was that the deployed hardware couldn't handle such a load. So, if the trigger is async I/O then it's completely correct to turn it off as it's not helping anyway. Andy

…

On Thu, 1 Dec 2022, Brian P Bockelman wrote: @abh3 - it looks like there's a potential deadlock whenever there's a I/O operation queued in the Scheduler that holds a reference count to the XrdLink object. When there are no remaining idle threads, then the queued jobs that hold a reference count will keep the reference count non-zero indefinitely. However, some of the running jobs cannot complete until the ref count drops to zero. There's ~1500 stuck threads In @lmorganti's stack trace of the following pattern: ``` #4 0x00007fae9f87c1fe in Wait (this=<optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:419 #5 XrdLink::Serialize (this=0x7fac6401b978) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:1086 #6 0x00007fae9fb002a8 in XrdXrootdProtocol::do_Close (this=0x7fadbc0402e0) at /usr/src/debug/xrootd/xrootd/src/XrdXrootd/XrdXrootdXeq.cc:532 #7 0x00007fae9f87ce49 in XrdLink::DoIt (this=0x7fac6401b978) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:441 #8 0x00007fae9f8801df in XrdScheduler::Run (this=0x610e58 <XrdMain::Config+440>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:357 #9 0x00007fae9f880329 in XrdStartWorking (carg=<optimized out>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:87 #10 0x00007fae9f845be7 in XrdSysThread_Xeq (myargs=0x7fad100449f0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86 ``` Assuming there's a queued I/O request corresponding to those links, those 1500 threads will never finish. Queued job can't run until something finishes. Running job can't finish until queued job runs. Classic deadlock. Now, in what circumstances do queued jobs hold a reference to the link? I count two cases: 1. Async I/O (both normal reads and pgreads) 2. When non-default sockets are used for a read. I think this might occur if the client requests for encrypted control and unencrypted data. I'm not 100% sure but perhaps (2) is rare (is it really rare in the case of caches as well?)? That might give some hope that (1) can be avoided simply by turning off async IO. Is it possible to automatically turn off async IO when thread counts are limited? -- Reply to this email directly or view it on GitHub: #1799 (comment) You are receiving this because you commented. Message ID: ***@***.***> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

lmorganti · 2022-12-02T09:55:06Z

Thanks Andy and thanks Brian for having looked at the stack trace!
We'll upgrade xrootd as soon as possible (we are waiting for CMS feedback and then we'll do that: https://ggus.eu/index.php?mode=ticket_info&ticket_id=159539).
After upgrading, we'll leave two servers with async I/O off and two with async I/O on, and hopefully we'll see what happens.
Thanks,
lucia

abh3 · 2023-01-30T10:28:14Z

Looks like the issue has been solved using the suggested config settings. So, I am closing this. Please reopen if this is incorrect.

lumorganti · 2023-01-30T10:32:06Z

Hi Andy,
issue is not solved.
We upgraded to the xrootd version you suggested, and we left two servers with async I/O off and two with async I/O on.
We have been monitoring the 4 of them over the last month but traffic has never been high and threads never saturated, independent of configuration. Sorry about that.

xrootd-dev · 2023-01-30T10:37:40Z

OK, should it appear that the problem occurs on the servers with async I/O off, please reopen the ticket.

…

On Mon, 30 Jan 2023, lumorganti wrote: Hi Andy, issue is not solved. We upgraded to the xrootd version you suggested, and we left two servers with async I/O off and two with async I/O on. We have been monitoring the 4 of them over the last month but traffic has never been high and threads never saturated, independent of configuration. Sorry about that. -- Reply to this email directly or view it on GitHub: #1799 (comment) You are receiving this because you commented. Message ID: ***@***.***> ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

abh3 self-assigned this Nov 7, 2022

abh3 added the Under Investigation label Nov 7, 2022

abh3 closed this as completed Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traffic stops when threads saturate #1799

Traffic stops when threads saturate #1799

andrearendina commented Oct 12, 2022 •

edited

xrootd-dev commented Oct 12, 2022 via email

lmorganti commented Nov 4, 2022

abh3 commented Nov 4, 2022

lmorganti commented Nov 7, 2022

abh3 commented Nov 7, 2022 •

edited

lmorganti commented Nov 8, 2022

andrearendina commented Nov 9, 2022

abh3 commented Nov 9, 2022

andrearendina commented Nov 9, 2022

lmorganti commented Nov 11, 2022

abh3 commented Nov 11, 2022

lmorganti commented Nov 11, 2022

abh3 commented Nov 11, 2022

lmorganti commented Nov 11, 2022

abh3 commented Nov 11, 2022

lmorganti commented Nov 11, 2022

abh3 commented Nov 11, 2022

abh3 commented Dec 1, 2022

lmorganti commented Dec 1, 2022

abh3 commented Dec 1, 2022 via email

lmorganti commented Dec 1, 2022

abh3 commented Dec 1, 2022 via email

lmorganti commented Dec 1, 2022

xrootd-dev commented Dec 1, 2022 via email

lmorganti commented Dec 1, 2022

xrootd-dev commented Dec 1, 2022 via email

lmorganti commented Dec 1, 2022

abh3 commented Dec 1, 2022

bbockelm commented Dec 1, 2022

xrootd-dev commented Dec 1, 2022 via email

lmorganti commented Dec 2, 2022

abh3 commented Jan 30, 2023

lumorganti commented Jan 30, 2023

xrootd-dev commented Jan 30, 2023 via email

Traffic stops when threads saturate #1799

Traffic stops when threads saturate #1799

Comments

andrearendina commented Oct 12, 2022 • edited

xrootd-dev commented Oct 12, 2022 via email

lmorganti commented Nov 4, 2022

abh3 commented Nov 4, 2022

lmorganti commented Nov 7, 2022

abh3 commented Nov 7, 2022 • edited

lmorganti commented Nov 8, 2022

andrearendina commented Nov 9, 2022

abh3 commented Nov 9, 2022

andrearendina commented Nov 9, 2022

lmorganti commented Nov 11, 2022

abh3 commented Nov 11, 2022

lmorganti commented Nov 11, 2022

abh3 commented Nov 11, 2022

lmorganti commented Nov 11, 2022

abh3 commented Nov 11, 2022

lmorganti commented Nov 11, 2022

abh3 commented Nov 11, 2022

abh3 commented Dec 1, 2022

lmorganti commented Dec 1, 2022

abh3 commented Dec 1, 2022 via email

lmorganti commented Dec 1, 2022

abh3 commented Dec 1, 2022 via email

lmorganti commented Dec 1, 2022

rpm -qa | grep -i globus

xrootd-dev commented Dec 1, 2022 via email

lmorganti commented Dec 1, 2022

xrootd-dev commented Dec 1, 2022 via email

lmorganti commented Dec 1, 2022

abh3 commented Dec 1, 2022

bbockelm commented Dec 1, 2022

xrootd-dev commented Dec 1, 2022 via email

lmorganti commented Dec 2, 2022

abh3 commented Jan 30, 2023

lumorganti commented Jan 30, 2023

xrootd-dev commented Jan 30, 2023 via email

andrearendina commented Oct 12, 2022 •

edited

abh3 commented Nov 7, 2022 •

edited