[XrdCl] Avoid race condition in AsyncSocketHander on use of reader/writer objects after link is re-enabled #1722
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
This patch is intended to fix a crash seen on EOS MGMs. The crash is consistent with the race between reset() and use of the unique_ptr rspreader and reqwriter. A trace obtained indicated a reset was happening in AsyncSocketHandler::OnReadTimeout() while the unique_ptr was being sued concurrently in AsyncMsgWriter::Write().
The race is understood to be started in AsyncSocketHandler::OnReadTimeout(), triggered by the pStream->OnReadTimeout call (XrdClAsyncSocketHandler.cc:698). On this timeout condition the XrdClStream may first Close() the AsyncSocketHandler. Close() calls pPoller->RemoveSocket(), which eventually calls XrdSys::IOEvents::Channel::Delete(), which will remove the socket from the IOEvents poller with XrdSys::IOEvents::Poller::Detach().
The above calls are done under the OnReadTimeout(), inside a callback from within the poller's event loop. The Detach correctly handles the situation where it is called from the poller's event loop, without blocking.
A later call by XrdClStream (in the same thread) reenables the link with EnableLink() in Stream::OnError(), which will call AsyncSocketHandler::Connect(). This will then re-add the socket to a poller by calling PollerBuiltIn::AddSocket(). AddSocket will assign the socket's XrdSys::IOEvents::Channel to a XrdSys::IOEvents::Poller from its poller pool. SInce EOS uses XRD_PARALLELEVTLOOP=4 the new poller may not be the same as the original (assuming the closed stream was the only one registered for the channelId). The AddSocket() is still being called from within the original poller's event loop, but if it is re-added to a different poller the AsyncSocketHandler event callbacks can now be called concurrently with the ongoing execution of the original