You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running on one instance (CMS) the patched xrootd-3.3.6-3.CERN.slc5 version which has a back port of the xrootd-4.0 semaphore implementation.
We observe since we changed the xrootd version regular lock contention (lock-up ~ once a day) because it seems a synchronous call to a running xrootd on localhost hangs in XrdSys::LinuxSemaphore::Wait() after issuing an XrdCl::FileSystem::Query call e.g. it does not receive a response or it misses the semaphore post. Is the new LinuxSemaphore impelementation stress tested enough in multi threaded applications to rule out a problem there?
The ATLAS instance with xrootd-3.3.4 never showed this hangs and the CMS instance before the update had very rare lock-ups using the GLIBC semaphore implementation at a lower frequency than with the new implementation.
#0 0x00000036bc4d1c69 in syscall () from /lib64/libc.so.6
#1 0x00002b2ce73b0e28 in XrdSys::LinuxSemaphore::Wait() () from /usr/lib64/libXrdCl.so.1
#2 0x00002b2ce73c56d7 in XrdCl::XRootDStatus XrdCl::MessageUtils::WaitForResponse<XrdCl::Buffer>(XrdCl::SyncResponseHandler*, XrdCl::Buffer*&) () from /usr/lib64/libXrdCl.so.1
#3 0x00002b2ce73c03c5 in XrdCl::FileSystem::Query(XrdCl::QueryCode::Code, XrdCl::Buffer const&, XrdCl::Buffer*&, unsigned short) () from /usr/lib64/libXrdCl.so.1
#4 0x00002b2ce70bb9f3 in XrdMqClient::SendMessage (this=0x2b2ce73706e0, msg=<value optimized out>, receiverid=<value optimized out>, sign=<value optimized out>,
encrypt=<value optimized out>) at /usr/src/debug/eos-0.3.25-1
...
We have two stack traces with exactly the same signature. When it happens the next time we will also strack-trace the destination xrootd to verify that the origin of the problem is not coming from there however the coincidence of the xrootd core change and the lock-up looks sort of suspicious.
Will provide more information when available.
The text was updated successfully, but these errors were encountered:
Yes, it has been pretty well stress-tested with a testing app running hundreds of threads doing constant posting and waiting on tens of semaphores. This has been run for weeks producing expected results. Also, Matevz and Alja run their proxy stress tests with this code in, so I am fairly confident it's OK. The fact that ATLAS works fine and that you saw this with the native implementation of semaphores would be a strong indicator that there is nothing wrong with this particular part of code.
If it comes to the stack traces. The code is doing what it is expected to do, ie. waiting for an answer. There are two possibilities:
The answer arrived but was not matched with the waiting handler. This is extremely unlikely, because this code runs in production for almost two years now and we haven't seen anything like this.
The answer does not arrive, in which case it should timeout eventually and return an error. So the questions is: Do you see these calls timeout?
We are running on one instance (CMS) the patched xrootd-3.3.6-3.CERN.slc5 version which has a back port of the xrootd-4.0 semaphore implementation.
We observe since we changed the xrootd version regular lock contention (lock-up ~ once a day) because it seems a synchronous call to a running xrootd on localhost hangs in XrdSys::LinuxSemaphore::Wait() after issuing an XrdCl::FileSystem::Query call e.g. it does not receive a response or it misses the semaphore post. Is the new LinuxSemaphore impelementation stress tested enough in multi threaded applications to rule out a problem there?
The ATLAS instance with xrootd-3.3.4 never showed this hangs and the CMS instance before the update had very rare lock-ups using the GLIBC semaphore implementation at a lower frequency than with the new implementation.
We have two stack traces with exactly the same signature. When it happens the next time we will also strack-trace the destination xrootd to verify that the origin of the problem is not coming from there however the coincidence of the xrootd core change and the lock-up looks sort of suspicious.
Will provide more information when available.
The text was updated successfully, but these errors were encountered: