-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Built-in poller deadlock #4
Comments
More feedback from Andres: The owner of the lock in question is here:
owner=17344 ... unfortunately this thread is not alive anymore :-( |
It looks like an internal mutex of the poller is not released cleanly in some circumstances. |
Here is another case where it is more clear:
Now looking at the owner of the lock (which in this case still exists)
Even the owner wants to lock the same thing again .... look's like a reentrant lock ... |
In Lukasz's example, the thread holding the lock is no longer there. So, it would seem that the thread was cancelled while it was holding that lock. I can't do anything about that. The thread should not have been cancelled. In Andreas' second example, had he gone to thread 253 and printed out the lock information for that thread he would have likely seen that the thread holding the lock which 253 is waiting for is gone as well. Both the timeout queue lock and the callback lock are recursive so no problem with multiple locking by the same thread. So, who is cancelling these threads? Certainly not the poller. |
Have another case and it looks like a dead-lock triggered by lock inversion in the poller implementation. This are two threads blocking each other:
Thread 26 is stuck here:
Thread 21 is stuck here:
In Thread 26 the owner of chMutex is ID=2391 => Thread21
===> deadlock between both threads. |
OK, I see the issue now. It's how timeouts are handled. Seems that just as there was a timeout, the Enable() method was called from another thread for the same object that is in the process of calling the timeout callback. It seems rather suspicious that someone should do that because it means that the channel is being re-enabled while it's was enabled for timeouts. That does mean that it quite likely that the timeout callback will be called in any case. Was this really intended? I pushed a patch that should hopefully fix the deadlock. |
Thanks for the patch! We'll test it. The timeout is related to reading subscription, which never gets disabled. Here, the writing subscription was being enabled. Yes, this is intended. |
OK, since I have to review the timeout section to see if I can get it to be more efficient I may have to split the read and write queues. At the moment there is a minimization algorithm that tries to use only one of the two timeout values to keep the timeout queues short. |
I think that it's good enough for me for the moment. |
Yes, but I still have to go through the cpugrind output to see where I am inefficient, sigh. |
Added additional statistics to ceph access.
It looks like the built-in poller is deadlocking in:
XrdSys::IOEvents::Channel::Enable vs XrdSys::IOEvents::Poller::CbkTMO. In threads 26 and 24.
The text was updated successfully, but these errors were encountered: