New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Built-in poller occasionally not invoking callbacks on OSX #5
Comments
The async socket handler in the client is issuing a connect(2) on a non-blocking socket via XrdNetConnect::Connect (which returns EINPROGGRESS) and subscribes to write notifications for the socket in order to get notified when connect returns. This is done in accordance with the manual, which says:
The write notification arrives erratically when using the built-in poller on MacOSX, we haven't observed this issue using the built-in poller on Linux, nor the libevent poller on any platform. |
To reproduce on OSX 10.8.3 with a clean xrootd build from master, run any client command such as |
After some more debugging, I have some more information: In
This obviously then returns without invoking the callback. Note that I'm not sure why this happens - I'll keep digging. Edit: The effective poller is still set to a
|
So it looks like there is a race condition in the poller initialisation. I am working on localhost, so what is happening is that the first handshake request comes back before the poller can initialise itself (see my edited post above). The issue is with the recursive mutex ( The channel mutex gets locked and unlocked all over the place, and I'm finding it a little difficult to see a way of avoiding this at the moment. |
Thanks Jason! That saves me a huge amount of work. I think I can come with something here to avoid this problem. Of course, the nagging question is why this doesn't show up in Linux. Anyway, I will work on it tomorrow. Oh yes, the reason that the channel mutex gets unlocked all over the place is because there was a desire to be able to call any of the channel methods while you were in a callback, including deleting the channel itself. It becomes rather messy when you liberalize the calls that way. |
OK, this is clearly a client bug. The client does not properly handle the case when XrdNet::Connect() immediately returns with success instead of EINPROGRESS. You can see this in the following two traces (I added some print statements). This only happens in Solaris and MacOS when connecting to a local socket (Linux seems to always return EINPROGRESS). Everything works fine as long as you get EINPROGRESS status. It appears that there is a logic issue when testing for connection status on the first callback (which always happens). The other issue I saw is that the callback does not handle the case when more than one event is reflected (e.g. readytoread along with readytowrite). This also seems to cause problems. In any case, the builtin poller is working exactly as advertised. Non working trace:
Working trace:
|
The problem is a race on poller init. |
I respectfully disagree and can show you the exact trace that shows why. After calling connect the client explicitly enables write notifications. The poller happily obliges and sends a "readytowrite" notifications because, in fact, this is true. There is no race condition at all. I can show you explicit traces that show that. The sequence is absolutely correct. The only difference is when the Connect method indicates that connect immediately succeeded,it does not correctly handle the readytowrite notification. This is likely because that notification should not have been enabled at that point (but always is). |
Let me explain further what the logic problem is in the client's builtin poller method. When a socket is immediately connected the socket status is marked as connected. The readytowrite callback is invoked after the enable for write notification is called The enable is called regardless of whether not the connect immediately succeeded. The callback method checks if there is a write outstanding or if the status is connecting. However, in this case we have a status of connected but no handshake outstanding because the logic is that the handshake will be driven via the callback.. In this instance, the callback ignores that notification because there is no write outstanding. So we have a condition where the handshake is never consummated when the connect immediately succeeds. I can produce explicit traces that show this. This is clearly an issue in the client. There is no race condition.Everything happens in the exact sequence that the client enables. It is ,merely a wrong assumption on the client's part when the connect immediately succeeds. |
Well, it's a small consolation but it appears that we were both correct. Indeed, there was a race condition that would affect MacOS. That problem was masked in my initial testing by the immediate connect problem. Adding sufficient tracing via cerr changed the timing to make the race issue obvious. So, the fix has been pushed. However, the immediate connect problem remains and can be best seen on Solaris when connecting to 'localhost' (or its equivalent). So, as a byproduct I now have a complete trace that points out the issue. When the client gets an immediate connect it fails to start the handshake but enables write events. When the event callback is invoked it sees that there is no outstanding write so it simply disables write notifications. That means the handshake is never preformed and the connection eventually times out. Below is the trace (not that lines starting with 'Cl' are in the client code and correspond to the first few characters of the associated file). The code was complicated enough that I couldn't see how to fix it. Again, you need to try this on Solaris as it always immediately connects to a server running on the same machine, MacOS sometimes does and sometimes not, and Linux never does. sysdev4500> ./xrdfs sysdev4500 ls /tmp/abh On the server side: 130427 22:43:24 001 XrdInet: Accepted connection from 22@sysdev4500.slac.stanford.edu |
Well, I respectfully disagree :) From the client perspective, three things can happen after XrdNetConnect::Connect returns:
So there is indeed a logic error. It is in the poller. |
Set xrootd release version (v4.0.4)
Fixed and expanded statistics in ceph_close log.
Hi,
When using the built-in poller on OSX, there is an occasional condition whereby a client thread (thread 1) which attempts to connect a socket asynchronously does not get called back. This locks up the client until the connect operation times out.
This doesn't happen with the libevent poller. The frequency of occurrence of the problem can be increased by adding time delays in the client code, so it's more than likely a timing issue (maybe the poller invoking the callback too early?).
Here are the stack traces:
Cheers,
Justin
The text was updated successfully, but these errors were encountered: