Eventual bug in new XrdSys::LinuxSemaphore::Wait() ? #110

apeters1971 · 2014-04-29T11:52:46Z

We are running on one instance (CMS) the patched xrootd-3.3.6-3.CERN.slc5 version which has a back port of the xrootd-4.0 semaphore implementation.

We observe since we changed the xrootd version regular lock contention (lock-up ~ once a day) because it seems a synchronous call to a running xrootd on localhost hangs in XrdSys::LinuxSemaphore::Wait() after issuing an XrdCl::FileSystem::Query call e.g. it does not receive a response or it misses the semaphore post. Is the new LinuxSemaphore impelementation stress tested enough in multi threaded applications to rule out a problem there?

The ATLAS instance with xrootd-3.3.4 never showed this hangs and the CMS instance before the update had very rare lock-ups using the GLIBC semaphore implementation at a lower frequency than with the new implementation.

#0  0x00000036bc4d1c69 in syscall () from /lib64/libc.so.6
#1  0x00002b2ce73b0e28 in XrdSys::LinuxSemaphore::Wait() () from /usr/lib64/libXrdCl.so.1
#2  0x00002b2ce73c56d7 in XrdCl::XRootDStatus XrdCl::MessageUtils::WaitForResponse<XrdCl::Buffer>(XrdCl::SyncResponseHandler*, XrdCl::Buffer*&) () from /usr/lib64/libXrdCl.so.1
#3  0x00002b2ce73c03c5 in XrdCl::FileSystem::Query(XrdCl::QueryCode::Code, XrdCl::Buffer const&, XrdCl::Buffer*&, unsigned short) () from /usr/lib64/libXrdCl.so.1
#4  0x00002b2ce70bb9f3 in XrdMqClient::SendMessage (this=0x2b2ce73706e0, msg=<value optimized out>, receiverid=<value optimized out>, sign=<value optimized out>, 
    encrypt=<value optimized out>) at /usr/src/debug/eos-0.3.25-1
...

We have two stack traces with exactly the same signature. When it happens the next time we will also strack-trace the destination xrootd to verify that the origin of the problem is not coming from there however the coincidence of the xrootd core change and the lock-up looks sort of suspicious.

Will provide more information when available.

The text was updated successfully, but these errors were encountered:

ljanyst · 2014-05-05T08:21:24Z

Yes, it has been pretty well stress-tested with a testing app running hundreds of threads doing constant posting and waiting on tens of semaphores. This has been run for weeks producing expected results. Also, Matevz and Alja run their proxy stress tests with this code in, so I am fairly confident it's OK. The fact that ATLAS works fine and that you saw this with the native implementation of semaphores would be a strong indicator that there is nothing wrong with this particular part of code.

If it comes to the stack traces. The code is doing what it is expected to do, ie. waiting for an answer. There are two possibilities:

The answer arrived but was not matched with the waiting handler. This is extremely unlikely, because this code runs in production for almost two years now and we haven't seen anything like this.
The answer does not arrive, in which case it should timeout eventually and return an error. So the questions is: Do you see these calls timeout?

ljanyst · 2014-05-06T08:34:05Z

You were right, there was a race, and a silly one too. Fixed in 1154edc

ljanyst added bug labels May 6, 2014

ljanyst closed this as completed May 6, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eventual bug in new XrdSys::LinuxSemaphore::Wait() ? #110

Eventual bug in new XrdSys::LinuxSemaphore::Wait() ? #110

apeters1971 commented Apr 29, 2014

ljanyst commented May 5, 2014

ljanyst commented May 6, 2014

Eventual bug in new XrdSys::LinuxSemaphore::Wait() ? #110

Eventual bug in new XrdSys::LinuxSemaphore::Wait() ? #110

Comments

apeters1971 commented Apr 29, 2014

ljanyst commented May 5, 2014

ljanyst commented May 6, 2014