Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release Stream lock before invoking callbacks. #216 #247

Merged
merged 1 commit into from
Jul 9, 2015

Conversation

bbockelm
Copy link
Contributor

@bbockelm bbockelm commented Jul 5, 2015

OnReadTimeout will call OnError with the Stream's lock held.
This causes OnError to invoke the user's callback with the
Stream lock held, possibly causing a lock ordering issue.

This fixes an observed deadlock within CMSSW over the following
objects (originally reported by Chris Jones):

Thread 1
    FileStateHandler::VectorRead takes FileStateHandler lock
    Stream::Send tries to take Stream lock

Thread 5
    FileTimer::Run takes FileTimer lock
    FileStateHandler::Tick tries to take FileStateHandler

Thread 6
    Stream::OnReadTimeout takes Stream lock
    DelayedClose::HandleResponseWithHosts  (this is callback code in CMSSW); this deletes a File object, which eventually calls
    FileTimer::UnRegisterFileObject tries to take FileTimer lock

OnReadTimeout will call OnError with the Stream's lock held.
This causes OnError to invoke the user's callback with the
Stream lock held, possibly causing a lock ordering issue.

This fixes an observed deadlock within CMSSW over the following
objects:

Thread 1
    FileStateHandler::VectorRead takes FileStateHandler lock
    Stream::Send tries to take Stream lock

Thread 5
    FileTimer::Run takes FileTimer lock
    FileStateHandler::Tick tries to take FileStateHandler

Thread 6
    Stream::OnReadTimeout takes Stream lock
    DelayedClose::HandleResponseWithHosts  (this is callback code in CMSSW); this deletes a File object, which eventually calls
    FileTimer::UnRegisterFileObject tries to take FileTimer lock
@ljanyst
Copy link
Contributor

ljanyst commented Jul 6, 2015

To avoid these you need to delegate all the callback processing to the worker thread pool. Currently error callbacks are processed in the poller thread.

@abh3
Copy link
Member

abh3 commented Jul 6, 2015

Indeed, I am hesitant to apply the patch because it may not fix all
possible deadlocks. I'd say Lukasz is right, the error callbacks (indeed
all the callbacks) should never have been done from poller thread. What's
the size of the fix to change the callback routing?

Andy

On Mon, 6 Jul 2015, Lukasz Janyst wrote:

To avoid these you need to delegate all the callback processing to the worker thread pool. Currently error callbacks are processed in the poller thread.


Reply to this email directly or view it on GitHub:
#247 (comment)

@ljanyst
Copy link
Contributor

ljanyst commented Jul 7, 2015

This patch is a no-op really because the first thing OnError does is lock the mutex again. The mutex is recursive because some of these methods may be called from both locked and unlocked contexts.

I don't think the proper fix will be much code if done right, but it may be fairly complex. I would have to look at the code to say for sure though.

@bbockelm
Copy link
Contributor Author

bbockelm commented Jul 7, 2015

@ljanyst - OnError does lock the mutex, but then it unlocks it before calling the handlers.

When called from OnReadTimeout, the mutex is already locked by the callee so on that call path, the handlers are called with the Stream mutex held (resulting in deadlock for us). All other call paths I could find drop that particular mutex before invoking handlers.

I'd be super-happy to have a more thorough patch, but I'd be sufficiently happy to have the Xrootd client not deadlock the Tier-0.

@ljanyst
Copy link
Contributor

ljanyst commented Jul 7, 2015

I did not look at the code and it looks like my memory of it is fading :(

That's definitely a strong point for applying the patch.

simonmichal added a commit that referenced this pull request Jul 9, 2015
Release Stream lock before invoking callbacks. #216
@simonmichal simonmichal merged commit bf3bfd6 into xrootd:master Jul 9, 2015
@simonmichal
Copy link
Contributor

I will merge the pull request because it fixes the problem in this particular case. It seems obvious to me that the mutex should be unlocked before calling the handlers, that's why it is done so (line 778), but since the mutex is recursive it has no effect in this particular case. Unlocking the mutex in the callee class wont have any side effects because it will anyway go out of scope.

I will also create a separate issue because the current implementation is error prone and problem has to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants