-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ API to XRootD should allow the "open file" retry logic to be turned off #673
Comments
After a quick glance I would say you should increase the value of the stream timeout. Just to clarify, is the problem that the server responds to the second open with an error because the file has been already opened by the first one, right? Let me think about it, maybe there's a better way ;-) |
This problem should become moot in 4.8.2. Assuming that multiple opens for reading causes no problems, then in 4.8.2 we make sure than only one open for writing will occur regardless of how many open requests were sent to the server during recovery. This corrects an optimization that allowed the server to have the file open multiple times (without immediate writing) to speed file access in 4.8.1. The new patches won't much affect normal opens but will slow, somewhat, error recovery opens. Fortunately, those shouldn't happen often. I'd close this ticket but I would surmise you might want' to test 4.8.2 to make sure it actually works that way before I do. |
To reply to Michal. When the second "open file and truncate" request reaches the XRootD server it truncates the file and then "goes away". The data transfer of the first "open file and truncate" unknowingly continues writing data to the now corrupted file. When the file is finally written, there is a hole of zero valued bytes at the very beginning of the file. |
To reply to Andy. I'm not sure if XRootD 4.8.2 RC2 will fix the problem. My guess (and I could well be wrong) is that instead of ending up with files with zero valued bytes at their beginnings, we'll end up with cleanly truncated, zero length files instead. |
I don't think so from my looking at the code.
…On Tue, 20 Mar 2018, murrayc3 wrote:
To reply to Andy. I'm not sure if XRootD 4.8.2 RC2 will fix the problem. My guess (and I could well be wrong) is that instead of ending up with files with zero valued bytes at their beginnings, we'll end up with cleanly truncated, zero length files instead.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#673 (comment)
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
Hi Andy, We have discussed a couple of concepts/ideas on our side of the pond and I am going to try to write them down for you here is in this ticket/issue. Briefly the two concepts/ideas are:
Going into more details with your server side fix in XRootD 4.8.2 RC2. The server will now reject a concurrent "open the file and truncate" request when there is an on going file transfer. This is great for relatively long files. Eric has pointed out that very small files will not be protected however. Here is the scenario with XRootD 4.8.2 RC2:
Going into more details about making sure the TCP/IP connection is really closed on the client side before creating a new one. We believe the first request to "open and truncate" the file is being replayed by the TCP/IP stack underneath the XRootD client library, even after the library has called
|
Hi Michael,
Well, we could enhance the endsess request. This is a request that the client sends to make sure that the previous session gets killed; though it is not necessarily always successful but when it’s not, there should not be a pending duplicate open. The original intent was to avoid lock conflicts. I can enhance it to deep six any outstanding messages by simply issues a shutdown on the server’s side. That should take care of the problem. You think that will convince Eric?
Andy
From: murrayc3
Sent: Tuesday, March 20, 2018 4:25 AM
To: xrootd/xrootd
Cc: Andrew Hanushevsky ; Comment
Subject: Re: [xrootd/xrootd] C++ API to XRootD should allow the "open file" retry logic to be turned off (#673)
Hi Andy,
We have discussed a couple of concepts/ideas on our side of the pond and I am going to try to write them down for you here is in this ticket/issue.
Briefly the two concepts/ideas are:
a.. Your server side fix of checking for mutual exclusivity before truncating a file won't work for small files.
b.. We're try to find a TCP/IP option that will guarantee on the client side that a connection to the server is truly "dead" before creating a brand new TCP/IP connection.
Going into more details with your server side fix in XRootD 4.8.2 RC2. The server will now reject a concurrent "open the file and truncate" request when there is an on going file transfer. This is great for relatively long files. Eric has pointed out that very small files will not be protected however. Here is the scenario with XRootD 4.8.2 RC2:
1.. XRootD client sends a "open file and truncate" request to the server.
2.. The corresponding TCP/IP packet gets dropped/lost.
3.. The XRootD client library times out and closes the TCP/IP connection.
4.. The XRootD client library opens a brand new connection to the server.
5.. The XRootD client completes the entire transmission of the small file to the server (open , write, write... and then close).
6.. The lost "open file and truncate" TCP/IP packet gets retransmitted by the TCP/IP stack underneath the XRootD client library.
7.. The server lets the request take the mutual exclusion lock on the file for writing.
8.. The server truncates the file.
9.. The server receives the FIN packet of the close request on the connection made ages ago by the XRootD client library.
We're done, the small file has now been truncated.
Going into more details about making sure the TCP/IP connection is really closed on the client side before creating a new one. We believe the first request to "open and truncate" the file is being replayed by the TCP/IP stack underneath the XRootD client library, even after the library has called close() on the connection's file descriptor. Eric has looked at some TCP/IP socket options and thinks there is a way to guarantee the connection is really closed before opening a new one. Eric sent the following e-mail to us here at CERN:
Maybe we could play with this (man 7 socket):
SO_LINGER
Sets or gets the SO_LINGER option. The argument is a linger
structure.
struct linger {
int l_onoff; /* linger active */
int l_linger; /* how many seconds to linger for */
};
When enabled, a close(2) or shutdown(2) will not return until
all queued messages for the socket have been successfully sent
or the linger timeout has been reached. Otherwise, the call
returns immediately and the closing is done in the background.
When the socket is closed as part of exit(2), it always
lingers in the background.
Cheers,
Eric
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@abh3 : that sounds promising :-) I was thinking about issuing shutdown when there is a close on error on client side, and then waiting for the EOF before opening a new connection (but I would need to do this within a reasonable timeout so it's imperfect). Issuing shutdown on your side on the endsess request sounds even better :-) |
Just a bit wider context: |
@abh3 : Your solution will solve the real problem (making sure the current connection will not retransmit anything before starting a new connection), but the timing will be very dependent on retransmissions in the TCP stack. As we saw, it sometimes takes minutes before the 1st open get retransmitted and corrupts the file, so the retry would take that amount of time in those situations (open will have to go through on the 1st TCP connection anyway before the endsess command goes through). My proposed direction of investigation (as this linger option has not been tested in any way besides RTFMing) has the drawback that it might not be portable to all client architectures. Overall, the client must never have 2 path to the server open at the same time, so both solutions would fit the bill. I have a slight preference for the linger solution as it will allow more aggressive retries (when we recall from tape with the computer center, we could have on open timeout of only 10s of seconds if not seconds, as the usual RTT is very small). In our case, this would prevent wasting tape drive bandwidth. We only have ~3 minutes of in-memory buffer in the tape servers, so blocking for that amount of time will eventually waste drive resources. |
Please keep in mind that the "lingering" packet might not only be sent by the local client kernel, but could also come from some intermediate piece of network equipment (routers, multipath routing with one congested leg?). As such a client-side fix such as SO_LINGER might not reliably address the issue? |
Agreed, but this will already severely cut the time window during which the 1st open can raise from the dead (from TCP retry timeouts to practical packet time of flight on the network). In Linux, the default TTL is 64, so a mourning period of 64s would ensure that an in-flight packet will not pop up. As this number is used for both seconds and hops count, we can hardly lower it. I suspect the practical value of this mourning is much lower, of the order of a few seconds if not less (but that's not fully rigorous). |
A trace of such an incident from the XrdCl client's perspective
Then there is a continuous stream of writes and the close:
|
These are the corresponding CASTOR logs courtesy of Vlado:
|
Here are the XRootD server logs on
|
Holy shit, that's a nasty one! IMHO the only reliable solution to this kind of problems is to have a variation of Lamport Timestamps. This is how I would see it:
This works because TCP guarantees ordering per connection. Recovery connections would be different TCP connections but would have the same client/channel ID. |
@abh3 : could you put some light on how endsess is handled on the server side? On the client side when there is a stream timeout, the original connection gets closed, and then a new connection is opened. Once the connection is established a handshake happens, the last part of the handshake is the kXR_endsess request that is supposed to terminate the pre-existing connection. The "second open" from Steve/Eric's scenario is guaranteed to happen after the client receives the response to the kXR_endsess request. I noticed in the XrdXrootdProtocol::do_Endsess():
and then further in XrdLink::Terminate(...):
At a first glance it seems to me that the server will terminate the original connection while handling the kXR_endsess request. Could you comment on that? |
@jmuf : regarding the data corruption at CERN, is it possible to check if they were triggered by the fuse client or xrdcp? |
The corruptions reported e.g on EOSATLAS around Aug 2017 would have been via "xrdcp" (they got reported here as #552 - the "workaround on the EOS side" mentioned there prevented files from being deleted after such a truncate). The issues seen recently (March 2018) in CASTOR repack also would have been via "xrdcp". The recent (March 2018) ATLAS T0 job description test relied on FUSE for writing the files, and also saw spurious empty files without errors being reported to the application. |
@jmuf : the xrootd client will always try to recover at redirector if one is available so we are dealing here most likely with two different problems:
|
The endsess indeed does terminate the prvios session as long as it's not
still executing in a plugin (the sever has no knowledge of the side-efects
of terminating something outside of its scope). Part of termination is to
close the socket. My proposal was sto simply close the socket irrespective
of where the session is executing. Since there is explicit ordering here,
that should solve this problem as the second open cannot occur before the
first is either executed or thrown away. If the first is executed then the
second is gauranteed to fail.
Andy
…On Wed, 21 Mar 2018, simonmichal wrote:
@abh3 : could you put some light on how endsess is handled on the server side?
On the client side when there is a stream timeout, the original connection gets closed, and then a new connection is opened. Once the connection is established a handshake happens, the last part of the handshake is the kXR_endsess request that is supposed to terminate the pre-existing connection. The "second open" from Steve/Eric's scenario is guaranteed to happen after the client receives the response to the kXR_endsess request.
I noticed in the XrdXrootdProtocol::do_Endsess():
```
// Extract out the FD and Instance from the session ID
//
sp = (XrdXrootdSessID *)Request.endsess.sessid;
memcpy((void *)&sessID.Pid, &sp->Pid, sizeof(sessID.Pid));
memcpy((void *)&sessID.FD, &sp->FD, sizeof(sessID.FD));
memcpy((void *)&sessID.Inst, &sp->Inst, sizeof(sessID.Inst));
// Trace this request
//
TRACEP(LOGIN, "endsess " <<sessID.Pid <<':' <<sessID.FD <<'.' <<sessID.Inst);
// If this session id does not refer to us, ignore the request
//
if (sessID.Pid != myPID) return Response.Send();
// Terminate the indicated session, if possible. This could also be a self-termination.
//
if ((sessID.FD == 0 && sessID.Inst == 0)
|| !(rc = Link->Terminate(Link, sessID.FD, sessID.Inst))) return -1;
```
and then further in XrdLink::Terminate(...):
```
// We can now disable the link and schedule a close
//
snprintf(buff, sizeof(buff), "ended by %s", ID);
buff[sizeof(buff)-1] = '\0';
lp->Poller->Disable(lp, buff);
lp->opMutex.UnLock();
// Now wait for the link to shutdown. This avoids lock problems.
//
if (killDone.Wait(int(killWait))) wTime += killWait;
else wTime = -EPIPE;
killDone.UnLock();
```
At a first glance it seems to me that the server will terminate the original connection while handling the kXR_endsess request. Could you comment on that?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#673 (comment)
|
But given this (endsess) mechanism and the fact that we never get to call any OFS plugin method in the error scenario described earlier, why didn't this already protect us form the double open? |
From what it see in XrdXrootdProtocol::do_Endsess() it can happen that the server actually responds with kXR_wait to an kXR_endsess request (@abh3 : could you confirm?): https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L711-L712 I think the kXR_wait response is not taken into account on the client side logic: https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDTransport.cc#L2012-L2036 If the response is not an error the client declares the handshake as successful (@ljanyst : could you give me another 'holy shit' if you think this might be true? ;-). This could be the source of our problem. |
From what I recall, the server was not supposed to stall the login sequence. This and the fact that you may break third-party plugins by force-closing sockets on running sessions is why I suggested using Lamport Timestamps in the first place, even though it's a disruptive change to the protocol. These are just my random thoughts on a problem that I find interesting. Last time I looked at this code was three years ago, so you should definitely double-check on everything I say. |
As you can see in the code, it only prints an error and succeeds anyways even if an error has been returned by the server. Since there's a comment there saying that we disregard the error intentionally, I am sure there was a good reason for it. I just don't remember what it was. |
Thanks for pointing it out, I actually somehow did not realized that the return statement was commented out. |
From what I have seen so far in these discussions, the current XRootD protocol relies on only one session being active per client at a time. If this rule is broken then a brand new XRootD protocol would need to be defined which could of course use Lamport Timestamps. Assuming that the XRootD protocol will not be changed any time soon then keeping only one session active at a time per client must be enforced at all costs. This means the XRootD client must handle a kXR_wait response to an kXR_endsess request. Have I simplified this too far and am missing something? |
For what it's worth, I think changing the client would be a reasonable course of action. However, it appears that the current handling of |
The other problem was there was a sequencing bug in the server that allowed duplicate opens during recovery. That’s corrected for 4.8.2 and, I’m pretty confident that is sufficient for most cases. Steven does point out an aberrant case (very small file writes) that could escape that fix. That would require fixing the client’s handling of endsess. So, I think we will be doing that for 4.8.2 as well.
Andy
From: simonmichal
Sent: Thursday, March 22, 2018 6:02 AM
To: xrootd/xrootd
Cc: Andrew Hanushevsky ; Mention
Subject: Re: [xrootd/xrootd] C++ API to XRootD should allow the "open file" retry logic to be turned off (#673)
From what it see in XrdXrootdProtocol::do_Endsess() it can happen that the server actually responds with kXR_wait to an kXR_endsess request (@abh3 : could you confirm?):
https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L711-L712
I think the kXR_wait response is not taken into account on the client side logic:
https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDTransport.cc#L2012-L2036
If the response is not an error the client declares the handshake as successful (@ljanyst : could you give me another 'holy shit' if you think this might be true? ;-).
This could be the source of our problem.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
The server does not stall the login sequence. The endsess request is not considered part of login as it can be used any time you want. It just so happens that the only time you would use it is post-login as the first legitimate request.
From: Lukasz Janyst
Sent: Thursday, March 22, 2018 6:15 AM
To: xrootd/xrootd
Cc: Andrew Hanushevsky ; Mention
Subject: Re: [xrootd/xrootd] C++ API to XRootD should allow the "open file" retry logic to be turned off (#673)
From what I recall, the server was not supposed to stall the login sequence. This and the fact that you may break third-party plugins by force-closing sockets on running sessions is why I suggested using Lamport Timestamps in the first place, even though it's a disruptive change to the protocol.
These are just my random thoughts on a problem that I find interesting. Last time I looked at this code was three years ago, so you should definitely double-check on everything I say.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
To Andy, Michal and Lukasz. It is clear that there are some very good ideas on how to improve the session handling of XRootD in the event of an XRootD client re-connecting to the server. However will it be possible to modify the XRootD client so that we can switch off the connection retry logic all together? CASTOR's repack and the file transfer frameworks of the LHC experiments have fully functional policies for retrying failed file transfers. Given the fact that we are currently experiencing corrupted files, switching off the XRootD client reconnect logic would immediately prevent these corrupted files from being created. |
I think that comment applied the the “notfound” error which, actually, indicates success. The 4.8.2 fix makes success return kXR_ok which provide less information but at least it not confusing.
From: Lukasz Janyst
Sent: Thursday, March 22, 2018 6:35 AM
To: xrootd/xrootd
Cc: xrootd-dev ; Comment
Subject: Re: [xrootd/xrootd] C++ API to XRootD should allow the "open file" retry logic to be turned off (#673)
As you can see in the code, it only prints an error and succeeds anyways even if an error has been returned by the server. Since there's a comment there saying that we disregard the error intentionally, I am sure there was a good reason for it. I just don't remember what it was.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
…--------------------------------------------------------------------------------
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
In XrootD you can have any number of session from the same client and we fully support those. The only thing you should not have is phantom sessions from the same client because the client then thinks all is going well when, in fact, it’s violating it’s own rules of protecting data. Think of it as creating a thread that goes around modifying data and not trying to synchronize with it from another thread doing the same thing. The endsess request was introduced to allow the client to kill any phantom sessions that it may have introduced. Which, as Steven so well pointed out, may occur during error recovery. So, yes, it’s required that the client properly handle endsess to make this work. On the other hand, the new fixes for handling open requests will also fix this problem for most, but not all, cases. Fortunately, the ones that evade detection are not part of a HEP workflow from what I can tell.
Andy
From: murrayc3
Sent: Thursday, March 22, 2018 8:17 AM
To: xrootd/xrootd
Cc: Andrew Hanushevsky ; Mention
Subject: Re: [xrootd/xrootd] C++ API to XRootD should allow the "open file" retry logic to be turned off (#673)
From what I have seen so far in these discussions, the current XRootD protocol relies on only one session being active per client at a time. If this rule is broken then a brand new XRootD protocol would need to be defined which could use of course use Lamport Timestamps.
Assuming that the XRootD protocol will not be changed any time soon then keeping only one session active at a time per client must be enforced at all costs. This means the XRootD client must handle a kXR_wait response to an kXR_endsess request. Have I simplified this too far and am missing something?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@abh3, regarding the phantom session, I extrapolated that if the client gets If so, it is still worrying that the link could not be terminated in XrdXrootdProtocol::do_Endsess() one minute after establishement, and that more than one other minute after, the initial open finally got delivered. Would this mean the retry timing is dependent on the TCP session dying on its own? Could you kill the phantom/previous connection more drastically ("deep six" it as you were suggesting) so we can ensure the server will disregard anything coming from the phantom connection? I think this is needed to open the possibility for aggressive retries in low latency environments (or simply have control of our retry timings). |
The client needs to reissue a request that got a |
But in this very example, given that the open has not reached the server, do we have any plugin call in flight? Andy once mentioned something possibly badly handled in the Castor plug-in. Would this be a contributor to the problem? The open did not make it through yet at this time in the 1st connection. |
Perhaps that's the bug that Andy has fixed. |
On Fri, 23 Mar 2018, Eric Cano wrote:
@abh3, regarding the phantom session, I extrapolated that if the client
gets `kXR_wait` as a reply to `endsess`, it should wait and try again,
i.e. re-issue the endsess? Would that explain completely the problem?
We narrowed down the problem to the client-side not completely
implementing the required strategy for endsess handling. Yes, the server
is asking the client to wait and retry later
If so, it is still worrying that the link could not be terminated in
XrdXrootdProtocol::do_Endsess() one minute after establishement, and
that
more than one other minute after, the initial open finally got
delivered.
Would this mean the retry timing is dependent on the TCP session dying
on
its own? Could you kill the phantom/previous connection more drastically
("deep six" it as you were suggesting) so we can ensure the server will
disregard anything coming from the phantom connection? I think this is
needed to open the possibility for aggressive retries in low latency
environments (or simply have control of our retry timings).
Yes, I am investigating this. It's not as easy as one would think because
doing this introduces other side-effects. So, I'm proceeding with an
abundance of caution. At the moment, we believe, that implementing the
full endsess sequence in the client should alleviate the problem. That
doesn't mean you can't get a zero length file but that is no different
than the client issuing an open and then crashing. That you cannot solve
other then turning on POSC mode which comes with not insignificant
overhead.
Andy
|
The "badly handled" thing is that Castor can sometimes take a very long
time to do an open which causes the retry mechanism to kick in. However,
we've solved that part in 4.8.2. We're tryng to solve the other edge case
of bad TCP connections.
Andy
…On Fri, 23 Mar 2018, Eric Cano wrote:
But in this very example, given that the open has not reached the server, do we have any plugin call in flight? Andy once mentioned something possibly badly handled in the Castor plug-in. Would this be a contributor to the problem? The open did not make it through yet at this time in the 1st connection.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#673 (comment)
|
Yes, that is the bug I fixed.
…On Fri, 23 Mar 2018, Lukasz Janyst wrote:
Perhaps that's the bug that Andy has fixed.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#673 (comment)
|
The handling of error/wait response for endsess request has been fixed in: 814e8b1 |
@abh3 Out of curiosity, would it be possible for the server to send |
That would be a fairly big change in the server. From the servers
perspective, a kXR_wait is fair simpler to handle because it'stateless. A
kXR_waitresp if more complicated because it has to keep state. The current
endsess path has no provision to keep state so that would have to be
added. I think Michal found it rather easy to fix the client to properly
handle the various return cases for endsess.
Andy
…On Tue, 27 Mar 2018, Lukasz Janyst wrote:
@abh3 Out of curiosity, would it be possible for the server to send `kXR_waitresp` instead `kXR_wait`? It would alleviate Eric's concerns about retry latency due to the wait time overestimation and the client-side implementation would be simpler.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#673 (comment)
|
Makes sense. Thanks for explaining. I am not trying to suggest that Michal found it problematic, just saying that from the client's perspective |
Yeah, funny isn't it. You're right, waitresp is easy for the client but
difficult foe the server. Wait is easier for the server but more
complicated for the client. When you squeeze this in one direction is
simply expands in another :-)
ndy
…On Tue, 27 Mar 2018, Lukasz Janyst wrote:
Makes sense. Thanks for explaining. I am not trying to suggest that Michal found it problematic, just saying that from the client's perspective `kXR_waitresp` would be easier.
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#673 (comment)
########################################################################
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
I'll complement the thread with Elvin's findings as I think this is crucial to understand the problem:
|
I've been able to reproduce the scenario described by Elvin with xrootd 4.8.1 and indeed I observed file corruption. Doing the same test having the 814e8b1 fix in place, confirmed that the problem has been addressed and that the patch prevents file corruption. |
Just to be on the safe side I've add an envar to disable open recovery mechanism (for open+create and open+truncate) in case the request is retried on the same data node: 6ef8e30 |
The XRootD client will retry a request to open a file if the first attempt takes too long. Unfortunately this can result in two open requests being sent from the client to the server. The way this can happen is as follows:
Please could the XRootD client and its API be modified so that the calling application can explicitly switch off the "open file" retry logic.
The text was updated successfully, but these errors were encountered: