Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventual bug in new XrdSys::LinuxSemaphore::Wait() ? #110

Closed
apeters1971 opened this issue Apr 29, 2014 · 2 comments
Closed

Eventual bug in new XrdSys::LinuxSemaphore::Wait() ? #110

apeters1971 opened this issue Apr 29, 2014 · 2 comments

Comments

@apeters1971
Copy link
Contributor

We are running on one instance (CMS) the patched xrootd-3.3.6-3.CERN.slc5 version which has a back port of the xrootd-4.0 semaphore implementation.

We observe since we changed the xrootd version regular lock contention (lock-up ~ once a day) because it seems a synchronous call to a running xrootd on localhost hangs in XrdSys::LinuxSemaphore::Wait() after issuing an XrdCl::FileSystem::Query call e.g. it does not receive a response or it misses the semaphore post. Is the new LinuxSemaphore impelementation stress tested enough in multi threaded applications to rule out a problem there?

The ATLAS instance with xrootd-3.3.4 never showed this hangs and the CMS instance before the update had very rare lock-ups using the GLIBC semaphore implementation at a lower frequency than with the new implementation.

#0  0x00000036bc4d1c69 in syscall () from /lib64/libc.so.6
#1  0x00002b2ce73b0e28 in XrdSys::LinuxSemaphore::Wait() () from /usr/lib64/libXrdCl.so.1
#2  0x00002b2ce73c56d7 in XrdCl::XRootDStatus XrdCl::MessageUtils::WaitForResponse<XrdCl::Buffer>(XrdCl::SyncResponseHandler*, XrdCl::Buffer*&) () from /usr/lib64/libXrdCl.so.1
#3  0x00002b2ce73c03c5 in XrdCl::FileSystem::Query(XrdCl::QueryCode::Code, XrdCl::Buffer const&, XrdCl::Buffer*&, unsigned short) () from /usr/lib64/libXrdCl.so.1
#4  0x00002b2ce70bb9f3 in XrdMqClient::SendMessage (this=0x2b2ce73706e0, msg=<value optimized out>, receiverid=<value optimized out>, sign=<value optimized out>, 
    encrypt=<value optimized out>) at /usr/src/debug/eos-0.3.25-1
...

We have two stack traces with exactly the same signature. When it happens the next time we will also strack-trace the destination xrootd to verify that the origin of the problem is not coming from there however the coincidence of the xrootd core change and the lock-up looks sort of suspicious.

Will provide more information when available.

@ljanyst
Copy link
Contributor

ljanyst commented May 5, 2014

Yes, it has been pretty well stress-tested with a testing app running hundreds of threads doing constant posting and waiting on tens of semaphores. This has been run for weeks producing expected results. Also, Matevz and Alja run their proxy stress tests with this code in, so I am fairly confident it's OK. The fact that ATLAS works fine and that you saw this with the native implementation of semaphores would be a strong indicator that there is nothing wrong with this particular part of code.

If it comes to the stack traces. The code is doing what it is expected to do, ie. waiting for an answer. There are two possibilities:

  1. The answer arrived but was not matched with the waiting handler. This is extremely unlikely, because this code runs in production for almost two years now and we haven't seen anything like this.
  2. The answer does not arrive, in which case it should timeout eventually and return an error. So the questions is: Do you see these calls timeout?

@ljanyst
Copy link
Contributor

ljanyst commented May 6, 2014

You were right, there was a race, and a silly one too. Fixed in 1154edc

@ljanyst ljanyst closed this as completed May 6, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants