C++ API to XRootD should allow the "open file" retry logic to be turned off #673

murrayc3 · 2018-03-19T17:07:08Z

The XRootD client will retry a request to open a file if the first attempt takes too long. Unfortunately this can result in two open requests being sent from the client to the server. The way this can happen is as follows:

The client application calls XrdCl::File::Open().
The XRootD client library sends an "open file" request to the server over a TCP/IP connection.
The XRootD client library times out the "open file" request.
The XRootD client sends a second "open file" request over a different TCP/IP connection.
The XRootD server eventually receives two "open file" requests because the first was just delayed on the network.

Please could the XRootD client and its API be modified so that the calling application can explicitly switch off the "open file" retry logic.

simonmichal · 2018-03-19T20:30:47Z

After a quick glance I would say you should increase the value of the stream timeout.

Just to clarify, is the problem that the server responds to the second open with an error because the file has been already opened by the first one, right?

Let me think about it, maybe there's a better way ;-)

abh3 · 2018-03-19T21:17:48Z

This problem should become moot in 4.8.2. Assuming that multiple opens for reading causes no problems, then in 4.8.2 we make sure than only one open for writing will occur regardless of how many open requests were sent to the server during recovery. This corrects an optimization that allowed the server to have the file open multiple times (without immediate writing) to speed file access in 4.8.1. The new patches won't much affect normal opens but will slow, somewhat, error recovery opens. Fortunately, those shouldn't happen often.

I'd close this ticket but I would surmise you might want' to test 4.8.2 to make sure it actually works that way before I do.

murrayc3 · 2018-03-20T07:33:05Z

To reply to Michal. When the second "open file and truncate" request reaches the XRootD server it truncates the file and then "goes away". The data transfer of the first "open file and truncate" unknowingly continues writing data to the now corrupted file. When the file is finally written, there is a hole of zero valued bytes at the very beginning of the file.

murrayc3 · 2018-03-20T07:35:16Z

To reply to Andy. I'm not sure if XRootD 4.8.2 RC2 will fix the problem. My guess (and I could well be wrong) is that instead of ending up with files with zero valued bytes at their beginnings, we'll end up with cleanly truncated, zero length files instead.

xrootd-dev · 2018-03-20T08:47:53Z

I don't think so from my looking at the code.

…

On Tue, 20 Mar 2018, murrayc3 wrote: To reply to Andy. I'm not sure if XRootD 4.8.2 RC2 will fix the problem. My guess (and I could well be wrong) is that instead of ending up with files with zero valued bytes at their beginnings, we'll end up with cleanly truncated, zero length files instead. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #673 (comment) ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

murrayc3 · 2018-03-20T11:24:57Z

Hi Andy,

We have discussed a couple of concepts/ideas on our side of the pond and I am going to try to write them down for you here is in this ticket/issue.

Briefly the two concepts/ideas are:

Your server side fix of checking for mutual exclusivity before truncating a file won't work for small files.
We're try to find a TCP/IP option that will guarantee on the client side that a connection to the server is truly "dead" before creating a brand new TCP/IP connection.

Going into more details with your server side fix in XRootD 4.8.2 RC2. The server will now reject a concurrent "open the file and truncate" request when there is an on going file transfer. This is great for relatively long files. Eric has pointed out that very small files will not be protected however. Here is the scenario with XRootD 4.8.2 RC2:

XRootD client sends a "open file and truncate" request to the server.
The corresponding TCP/IP packet gets dropped/lost.
The XRootD client library times out and closes the TCP/IP connection.
The XRootD client library opens a brand new connection to the server.
The XRootD client completes the entire transmission of the small file to the server (open , write, write... and then close).
The lost "open file and truncate" TCP/IP packet gets retransmitted by the TCP/IP stack underneath the XRootD client library.
The server lets the request take the mutual exclusion lock on the file for writing.
The server truncates the file.
The server receives the FIN packet of the close request on the connection made ages ago by the XRootD client library.
We're done, the small file has now been truncated.

Going into more details about making sure the TCP/IP connection is really closed on the client side before creating a new one. We believe the first request to "open and truncate" the file is being replayed by the TCP/IP stack underneath the XRootD client library, even after the library has called close() on the connection's file descriptor. Eric has looked at some TCP/IP socket options and thinks there is a way to guarantee the connection is really closed before opening a new one. Eric sent the following e-mail to us here at CERN:

Maybe we could play with this (man 7 socket):
 
 
       SO_LINGER
              Sets or gets the SO_LINGER option.  The argument is a linger
              structure.
 
                  struct linger {
                      int l_onoff;    /* linger active */
                      int l_linger;   /* how many seconds to linger for */
                  };
 
              When enabled, a close(2) or shutdown(2) will not return until
              all queued messages for the socket have been successfully sent
              or the linger timeout has been reached.  Otherwise, the call
              returns immediately and the closing is done in the background.
              When the socket is closed as part of exit(2), it always
              lingers in the background.
 
 
Cheers,
Eric

abh3 · 2018-03-20T21:06:33Z

Hi Michael, Well, we could enhance the endsess request. This is a request that the client sends to make sure that the previous session gets killed; though it is not necessarily always successful but when it’s not, there should not be a pending duplicate open. The original intent was to avoid lock conflicts. I can enhance it to deep six any outstanding messages by simply issues a shutdown on the server’s side. That should take care of the problem. You think that will convince Eric? Andy From: murrayc3 Sent: Tuesday, March 20, 2018 4:25 AM To: xrootd/xrootd Cc: Andrew Hanushevsky ; Comment Subject: Re: [xrootd/xrootd] C++ API to XRootD should allow the "open file" retry logic to be turned off (#673) Hi Andy, We have discussed a couple of concepts/ideas on our side of the pond and I am going to try to write them down for you here is in this ticket/issue. Briefly the two concepts/ideas are: a.. Your server side fix of checking for mutual exclusivity before truncating a file won't work for small files. b.. We're try to find a TCP/IP option that will guarantee on the client side that a connection to the server is truly "dead" before creating a brand new TCP/IP connection. Going into more details with your server side fix in XRootD 4.8.2 RC2. The server will now reject a concurrent "open the file and truncate" request when there is an on going file transfer. This is great for relatively long files. Eric has pointed out that very small files will not be protected however. Here is the scenario with XRootD 4.8.2 RC2: 1.. XRootD client sends a "open file and truncate" request to the server. 2.. The corresponding TCP/IP packet gets dropped/lost. 3.. The XRootD client library times out and closes the TCP/IP connection. 4.. The XRootD client library opens a brand new connection to the server. 5.. The XRootD client completes the entire transmission of the small file to the server (open , write, write... and then close). 6.. The lost "open file and truncate" TCP/IP packet gets retransmitted by the TCP/IP stack underneath the XRootD client library. 7.. The server lets the request take the mutual exclusion lock on the file for writing. 8.. The server truncates the file. 9.. The server receives the FIN packet of the close request on the connection made ages ago by the XRootD client library. We're done, the small file has now been truncated. Going into more details about making sure the TCP/IP connection is really closed on the client side before creating a new one. We believe the first request to "open and truncate" the file is being replayed by the TCP/IP stack underneath the XRootD client library, even after the library has called close() on the connection's file descriptor. Eric has looked at some TCP/IP socket options and thinks there is a way to guarantee the connection is really closed before opening a new one. Eric sent the following e-mail to us here at CERN: Maybe we could play with this (man 7 socket): SO_LINGER Sets or gets the SO_LINGER option. The argument is a linger structure. struct linger { int l_onoff; /* linger active */ int l_linger; /* how many seconds to linger for */ }; When enabled, a close(2) or shutdown(2) will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise, the call returns immediately and the closing is done in the background. When the socket is closed as part of exit(2), it always lingers in the background. Cheers, Eric — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

simonmichal · 2018-03-20T21:40:04Z

@abh3 : that sounds promising :-)

I was thinking about issuing shutdown when there is a close on error on client side, and then waiting for the EOF before opening a new connection (but I would need to do this within a reasonable timeout so it's imperfect). Issuing shutdown on your side on the endsess request sounds even better :-)
I will discuss this tomorrow with Steve and Eric.

jmuf · 2018-03-21T07:26:43Z

Just a bit wider context:
spurious xrootd-internal retries have been blamed for data loss at CERN (truncated/empty files) since at least last summer, with LHC experiments struggling to turn this off. I understand that the internal retry mechanism (i.e drop TCP connection, open new connection, resend) is generic and that "just" turning it off just for writes or open+truncate is currently not an option.
I would advocate to disable this mechanism by default for now, and perhaps re-enable once it can be safely limited to a subset of idempotent operations (read?).
All LHC experiments have mechanisms for higher-level retry in case "xrdcp" gives some error. What they cannot really cope with is a silently corrupted file and "successful" exit code.

ericcano · 2018-03-21T07:53:14Z

@abh3 : Your solution will solve the real problem (making sure the current connection will not retransmit anything before starting a new connection), but the timing will be very dependent on retransmissions in the TCP stack. As we saw, it sometimes takes minutes before the 1st open get retransmitted and corrupts the file, so the retry would take that amount of time in those situations (open will have to go through on the 1st TCP connection anyway before the endsess command goes through).

My proposed direction of investigation (as this linger option has not been tested in any way besides RTFMing) has the drawback that it might not be portable to all client architectures.

Overall, the client must never have 2 path to the server open at the same time, so both solutions would fit the bill. I have a slight preference for the linger solution as it will allow more aggressive retries (when we recall from tape with the computer center, we could have on open timeout of only 10s of seconds if not seconds, as the usual RTT is very small). In our case, this would prevent wasting tape drive bandwidth. We only have ~3 minutes of in-memory buffer in the tape servers, so blocking for that amount of time will eventually waste drive resources.

jmuf · 2018-03-21T07:59:08Z

Please keep in mind that the "lingering" packet might not only be sent by the local client kernel, but could also come from some intermediate piece of network equipment (routers, multipath routing with one congested leg?). As such a client-side fix such as SO_LINGER might not reliably address the issue?

ericcano · 2018-03-21T08:28:45Z

Agreed, but this will already severely cut the time window during which the 1st open can raise from the dead (from TCP retry timeouts to practical packet time of flight on the network). In Linux, the default TTL is 64, so a mourning period of 64s would ensure that an in-flight packet will not pop up. As this number is used for both seconds and hops count, we can hardly lower it. I suspect the practical value of this mourning is much lower, of the order of a few seconds if not less (but that's not fully rigorous).

esindril · 2018-03-21T08:59:16Z

A trace of such an incident from the XrdCl client's perspective

[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] URL: root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.exptime=1521220581&castor.txtype=tape&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] Protocol:  root
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] User Name: 
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] Password:  
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] Host Name: tpsrv452.cern.ch
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] Port:      1095
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] Path:      /diskserver/74/1633530374@castorns.6889850064
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] URL: root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.exptime=1521220581&castor.txtype=tape&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] Protocol:  root
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] User Name: 
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] Password:  
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] Host Name: tpsrv452.cern.ch
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] Port:      1095
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] Path:      /diskserver/74/1633530374@castorns.6889850064
[2018-03-16 17:16:21.366657 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Sending an open command
[2018-03-16 17:16:21.366680 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Sending message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:16:21.366699 +0100][Dump   ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60) through substream 0 expecting answer at 0
[2018-03-16 17:16:21.366745 +0100][Dump   ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Wrote a message: kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60), 335 bytes
[2018-03-16 17:16:21.366762 +0100][Dump   ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Successfully sent message: kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60).
[2018-03-16 17:16:21.366785 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.
[2018-03-16 17:17:22.009295 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.009306 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.
[2018-03-16 17:17:22.009309 +0100][Debug  ][File              ] [0x68008a40@root://tpsrv452.cern.ch:1095//diskserver/48/1633520448@castorns.6889849983?castor.exptime=1521220467&castor.pfn1=/diskserver/48/1633520448@castorns.6889849983&castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txtype=tape] Close returned from tpsrv452.cern.ch:1095 with: [ERROR] Socket error
[2018-03-16 17:17:22.009317 +0100][Dump   ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60) through substream 0 expecting answer at 0
[2018-03-16 17:17:22.013987 +0100][Dump   ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Wrote a message: kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60), 335 bytes
[2018-03-16 17:17:22.013999 +0100][Dump   ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Successfully sent message: kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60).
[2018-03-16 17:17:22.014008 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.
[2018-03-16 17:17:22.019252 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Got a kXR_ok response to request kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.019267 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Parsing the response to kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) as OpenInfo
[2018-03-16 17:17:22.019276 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Parsing StatInfo in response to kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.019326 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Open has returned with status [SUCCESS] 
[2018-03-16 17:17:22.019336 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] successfully opened at tpsrv452.cern.ch:1095, handle: 0x2, session id: 6
[2018-03-16 17:17:22.019471 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Sending a write command for handle 0x2 to tpsrv452.cern.ch:1095

Then there is a continuous stream of writes and the close:

[2018-03-16 17:19:18.271335 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Got a kXR_ok response to request kXR_write (handle: 0x02000000, offset: 3046113280, size: 8316)
[2018-03-16 17:19:18.271347 +0100][Dump   ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011
674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Got state response for message kXR_write (handle: 0x02000000, offset: 3046113280, size: 8316)
[2018-03-16 17:19:18.271403 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011
674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Sending a close command for handle 0x2 to tpsrv452.cern.ch:1095
[2018-03-16 17:19:18.285227 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Got a kXR_ok response to request kXR_close (handle: 0x02000000)
[2018-03-16 17:19:18.285248 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011
674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Close returned from tpsrv452.cern.ch:1095 with: [SUCCESS] 
[2018-03-16 17:19:18.285302 +0100][Dump   ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Items in the fly 0, queued for recovery 0

murrayc3 · 2018-03-21T09:30:44Z

These are the corresponding CASTOR logs courtesy of Vlado:

CASTOR file 1633530374 corrupted on tpsrv452:

180316 17:19:00 time=1521217140.404038 func=close                    level=ERROR logid=bc2d2eae-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:974   tident=stage.27009:30@tpsrv210 path=/diskserver/74/1633530374@castorns.6889850064 msg="closed via destructor"
180316 17:32:25 time=1521217945.688662 func=close                    level=ERROR logid=94173868-2937-11e8-a9a8-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280526202624 source=XrdxCastor2Ofs:981   tident=stage.87764:22@tpsrv416 path=/diskserver/74/1633530374@castorns.6889850064 msg="checksum mismatch"

murrayc3 · 2018-03-21T10:40:17Z

Here are the XRootD server logs on tpsrv452 showing the double open from the same client tpsrv210. We clearly see the unwanted open being followed by an immediate close with ERROR (closed socket).

[root@tpsrv452 ~]# zgrep -E '1633530374@castorns|81a998c6-2935-11e8-8389-a4bf0112f566' /var/log/xrootd/server/xrootd.log-20180317.gz | egrep 'func=open|func=close' | grep tpsrv210
180316 17:17:22 time=1521217042.018172 func=open                     level=INFO  logid=81a998c6-2935-11e8-8389-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140281188312832 source=XrdxCastor2Ofs:697   tident=stage.27009:35@tpsrv210 path=/diskserver/74/1633530374@castorns.6889850064, opaque=castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95ac105d30
180316 17:19:00 time=1521217140.188374 func=open                     level=INFO  logid=bc2d2eae-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:697   tident=stage.27009:30@tpsrv210 path=/diskserver/74/1633530374@castorns.6889850064, opaque=castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95740033d0
180316 17:19:00 time=1521217140.404038 func=close                    level=ERROR logid=bc2d2eae-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:974   tident=stage.27009:30@tpsrv210 path=/diskserver/74/1633530374@castorns.6889850064 msg="closed via destructor"
180316 17:19:18 time=1521217158.283782 func=close                    level=INFO  logid=81a998c6-2935-11e8-8389-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140281188312832 source=XrdxCastor2Ofs:1007  tident=stage.27009:35@tpsrv210 msg="send to diskmanager" errc=0 errmsg=""
[root@tpsrv452 ~]#

ljanyst · 2018-03-21T12:17:01Z

Holy shit, that's a nasty one!

IMHO the only reliable solution to this kind of problems is to have a variation of Lamport Timestamps. This is how I would see it:

Each client sends a random identifier during login (per channel).
The server keeps a hash table with the identifiers and a 64 bit integer associated with each.
The server purges this hash table once in a while so that it does not grow too large.
The client has a 64 bit counter starting at 0.
Before sending any message, the client increments the counter by 1.
The client embeds the current counter value in every message.
Every time the server receives a message from the client on a given channel it considers the counter.
If the counter value in the message is lower than what the server has, it declares an error to the client.
If the value is higher, the server performs the operation and associates the new counter value with the client id in its hash table.

This works because TCP guarantees ordering per connection. Recovery connections would be different TCP connections but would have the same client/channel ID.

simonmichal · 2018-03-21T16:33:38Z

@abh3 : could you put some light on how endsess is handled on the server side?

On the client side when there is a stream timeout, the original connection gets closed, and then a new connection is opened. Once the connection is established a handshake happens, the last part of the handshake is the kXR_endsess request that is supposed to terminate the pre-existing connection. The "second open" from Steve/Eric's scenario is guaranteed to happen after the client receives the response to the kXR_endsess request.

I noticed in the XrdXrootdProtocol::do_Endsess():

// Extract out the FD and Instance from the session ID
//
   sp = (XrdXrootdSessID *)Request.endsess.sessid;
   memcpy((void *)&sessID.Pid,  &sp->Pid,  sizeof(sessID.Pid));
   memcpy((void *)&sessID.FD,   &sp->FD,   sizeof(sessID.FD));
   memcpy((void *)&sessID.Inst, &sp->Inst, sizeof(sessID.Inst));

// Trace this request
//
   TRACEP(LOGIN, "endsess " <<sessID.Pid <<':' <<sessID.FD <<'.' <<sessID.Inst);

// If this session id does not refer to us, ignore the request
//
   if (sessID.Pid != myPID) return Response.Send();

// Terminate the indicated session, if possible. This could also be a self-termination.
//
   if ((sessID.FD == 0 && sessID.Inst == 0) 
   ||  !(rc = Link->Terminate(Link, sessID.FD, sessID.Inst))) return -1;

and then further in XrdLink::Terminate(...):

// We can now disable the link and schedule a close
//
   snprintf(buff, sizeof(buff), "ended by %s", ID);
   buff[sizeof(buff)-1] = '\0';
   lp->Poller->Disable(lp, buff);
   lp->opMutex.UnLock();

// Now wait for the link to shutdown. This avoids lock problems.
//
   if (killDone.Wait(int(killWait))) wTime += killWait;
      else wTime = -EPIPE;
   killDone.UnLock();

At a first glance it seems to me that the server will terminate the original connection while handling the kXR_endsess request. Could you comment on that?

simonmichal · 2018-03-21T16:39:11Z

@jmuf : regarding the data corruption at CERN, is it possible to check if they were triggered by the fuse client or xrdcp?

jmuf · 2018-03-21T17:33:17Z

The corruptions reported e.g on EOSATLAS around Aug 2017 would have been via "xrdcp" (they got reported here as #552 - the "workaround on the EOS side" mentioned there prevented files from being deleted after such a truncate). The issues seen recently (March 2018) in CASTOR repack also would have been via "xrdcp". The recent (March 2018) ATLAS T0 job description test relied on FUSE for writing the files, and also saw spurious empty files without errors being reported to the application.

simonmichal · 2018-03-21T20:37:31Z

@jmuf : the xrootd client will always try to recover at redirector if one is available so we are dealing here most likely with two different problems:

In case of EOSATLAS, xrdcp issues an open against MGM and then gets redirected to a FST, so if a stream timeout occurs the client will always retry at MGM.
In the two remaining cases it's different. CASTOR does a direct copy from data server to data server, so there is no redirector. FUSE client queries the MGM for the file location, and then issues a direct copy to FST, so from the client perspective there's no redirector either.

abh3 · 2018-03-21T20:42:27Z

The endsess indeed does terminate the prvios session as long as it's not still executing in a plugin (the sever has no knowledge of the side-efects of terminating something outside of its scope). Part of termination is to close the socket. My proposal was sto simply close the socket irrespective of where the session is executing. Since there is explicit ordering here, that should solve this problem as the second open cannot occur before the first is either executed or thrown away. If the first is executed then the second is gauranteed to fail. Andy

…

On Wed, 21 Mar 2018, simonmichal wrote: @abh3 : could you put some light on how endsess is handled on the server side? On the client side when there is a stream timeout, the original connection gets closed, and then a new connection is opened. Once the connection is established a handshake happens, the last part of the handshake is the kXR_endsess request that is supposed to terminate the pre-existing connection. The "second open" from Steve/Eric's scenario is guaranteed to happen after the client receives the response to the kXR_endsess request. I noticed in the XrdXrootdProtocol::do_Endsess(): ``` // Extract out the FD and Instance from the session ID // sp = (XrdXrootdSessID *)Request.endsess.sessid; memcpy((void *)&sessID.Pid, &sp->Pid, sizeof(sessID.Pid)); memcpy((void *)&sessID.FD, &sp->FD, sizeof(sessID.FD)); memcpy((void *)&sessID.Inst, &sp->Inst, sizeof(sessID.Inst)); // Trace this request // TRACEP(LOGIN, "endsess " <<sessID.Pid <<':' <<sessID.FD <<'.' <<sessID.Inst); // If this session id does not refer to us, ignore the request // if (sessID.Pid != myPID) return Response.Send(); // Terminate the indicated session, if possible. This could also be a self-termination. // if ((sessID.FD == 0 && sessID.Inst == 0) || !(rc = Link->Terminate(Link, sessID.FD, sessID.Inst))) return -1; ``` and then further in XrdLink::Terminate(...): ``` // We can now disable the link and schedule a close // snprintf(buff, sizeof(buff), "ended by %s", ID); buff[sizeof(buff)-1] = '\0'; lp->Poller->Disable(lp, buff); lp->opMutex.UnLock(); // Now wait for the link to shutdown. This avoids lock problems. // if (killDone.Wait(int(killWait))) wTime += killWait; else wTime = -EPIPE; killDone.UnLock(); ``` At a first glance it seems to me that the server will terminate the original connection while handling the kXR_endsess request. Could you comment on that? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #673 (comment)

esindril · 2018-03-22T09:36:39Z

But given this (endsess) mechanism and the fact that we never get to call any OFS plugin method in the error scenario described earlier, why didn't this already protect us form the double open?

simonmichal · 2018-03-22T13:02:47Z

From what it see in XrdXrootdProtocol::do_Endsess() it can happen that the server actually responds with kXR_wait to an kXR_endsess request (@abh3 : could you confirm?):

https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L711-L712

I think the kXR_wait response is not taken into account on the client side logic:

https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDTransport.cc#L2012-L2036

If the response is not an error the client declares the handshake as successful (@ljanyst : could you give me another 'holy shit' if you think this might be true? ;-).

This could be the source of our problem.

ljanyst · 2018-03-22T13:14:59Z

From what I recall, the server was not supposed to stall the login sequence. This and the fact that you may break third-party plugins by force-closing sockets on running sessions is why I suggested using Lamport Timestamps in the first place, even though it's a disruptive change to the protocol.

These are just my random thoughts on a problem that I find interesting. Last time I looked at this code was three years ago, so you should definitely double-check on everything I say.

ljanyst · 2018-03-22T13:35:53Z

As you can see in the code, it only prints an error and succeeds anyways even if an error has been returned by the server. Since there's a comment there saying that we disregard the error intentionally, I am sure there was a good reason for it. I just don't remember what it was.

simonmichal · 2018-03-22T14:00:18Z

Thanks for pointing it out, I actually somehow did not realized that the return statement was commented out.

murrayc3 · 2018-03-22T15:17:24Z

From what I have seen so far in these discussions, the current XRootD protocol relies on only one session being active per client at a time. If this rule is broken then a brand new XRootD protocol would need to be defined which could of course use Lamport Timestamps.

Assuming that the XRootD protocol will not be changed any time soon then keeping only one session active at a time per client must be enforced at all costs. This means the XRootD client must handle a kXR_wait response to an kXR_endsess request. Have I simplified this too far and am missing something?

ljanyst · 2018-03-22T15:29:59Z

For what it's worth, I think changing the client would be a reasonable course of action. However, it appears that the current handling of kXR_endsess is not a bug, so I would investigate whether the reasons it's been done this way are still good reasons. :)

abh3 · 2018-03-22T21:08:44Z

The other problem was there was a sequencing bug in the server that allowed duplicate opens during recovery. That’s corrected for 4.8.2 and, I’m pretty confident that is sufficient for most cases. Steven does point out an aberrant case (very small file writes) that could escape that fix. That would require fixing the client’s handling of endsess. So, I think we will be doing that for 4.8.2 as well. Andy From: simonmichal Sent: Thursday, March 22, 2018 6:02 AM To: xrootd/xrootd Cc: Andrew Hanushevsky ; Mention Subject: Re: [xrootd/xrootd] C++ API to XRootD should allow the "open file" retry logic to be turned off (#673) From what it see in XrdXrootdProtocol::do_Endsess() it can happen that the server actually responds with kXR_wait to an kXR_endsess request (@abh3 : could you confirm?): https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L711-L712 I think the kXR_wait response is not taken into account on the client side logic: https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDTransport.cc#L2012-L2036 If the response is not an error the client declares the handshake as successful (@ljanyst : could you give me another 'holy shit' if you think this might be true? ;-). This could be the source of our problem. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

abh3 · 2018-03-22T21:10:09Z

The server does not stall the login sequence. The endsess request is not considered part of login as it can be used any time you want. It just so happens that the only time you would use it is post-login as the first legitimate request. From: Lukasz Janyst Sent: Thursday, March 22, 2018 6:15 AM To: xrootd/xrootd Cc: Andrew Hanushevsky ; Mention Subject: Re: [xrootd/xrootd] C++ API to XRootD should allow the "open file" retry logic to be turned off (#673) From what I recall, the server was not supposed to stall the login sequence. This and the fact that you may break third-party plugins by force-closing sockets on running sessions is why I suggested using Lamport Timestamps in the first place, even though it's a disruptive change to the protocol. These are just my random thoughts on a problem that I find interesting. Last time I looked at this code was three years ago, so you should definitely double-check on everything I say. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

murrayc3 · 2018-03-22T21:10:51Z

To Andy, Michal and Lukasz. It is clear that there are some very good ideas on how to improve the session handling of XRootD in the event of an XRootD client re-connecting to the server. However will it be possible to modify the XRootD client so that we can switch off the connection retry logic all together?

CASTOR's repack and the file transfer frameworks of the LHC experiments have fully functional policies for retrying failed file transfers. Given the fact that we are currently experiencing corrupted files, switching off the XRootD client reconnect logic would immediately prevent these corrupted files from being created.

xrootd-dev · 2018-03-22T21:11:53Z

I think that comment applied the the “notfound” error which, actually, indicates success. The 4.8.2 fix makes success return kXR_ok which provide less information but at least it not confusing. From: Lukasz Janyst Sent: Thursday, March 22, 2018 6:35 AM To: xrootd/xrootd Cc: xrootd-dev ; Comment Subject: Re: [xrootd/xrootd] C++ API to XRootD should allow the "open file" retry logic to be turned off (#673) As you can see in the code, it only prints an error and succeeds anyways even if an error has been returned by the server. Since there's a comment there saying that we disregard the error intentionally, I am sure there was a good reason for it. I just don't remember what it was. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

…

-------------------------------------------------------------------------------- Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

abh3 · 2018-03-22T21:22:30Z

In XrootD you can have any number of session from the same client and we fully support those. The only thing you should not have is phantom sessions from the same client because the client then thinks all is going well when, in fact, it’s violating it’s own rules of protecting data. Think of it as creating a thread that goes around modifying data and not trying to synchronize with it from another thread doing the same thing. The endsess request was introduced to allow the client to kill any phantom sessions that it may have introduced. Which, as Steven so well pointed out, may occur during error recovery. So, yes, it’s required that the client properly handle endsess to make this work. On the other hand, the new fixes for handling open requests will also fix this problem for most, but not all, cases. Fortunately, the ones that evade detection are not part of a HEP workflow from what I can tell. Andy From: murrayc3 Sent: Thursday, March 22, 2018 8:17 AM To: xrootd/xrootd Cc: Andrew Hanushevsky ; Mention Subject: Re: [xrootd/xrootd] C++ API to XRootD should allow the "open file" retry logic to be turned off (#673) From what I have seen so far in these discussions, the current XRootD protocol relies on only one session being active per client at a time. If this rule is broken then a brand new XRootD protocol would need to be defined which could use of course use Lamport Timestamps. Assuming that the XRootD protocol will not be changed any time soon then keeping only one session active at a time per client must be enforced at all costs. This means the XRootD client must handle a kXR_wait response to an kXR_endsess request. Have I simplified this too far and am missing something? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ericcano · 2018-03-23T11:20:30Z

@abh3, regarding the phantom session, I extrapolated that if the client gets kXR_wait as a reply to endsess, it should wait and try again, i.e. re-issue the endsess? Would that explain completely the problem?

If so, it is still worrying that the link could not be terminated in XrdXrootdProtocol::do_Endsess() one minute after establishement, and that more than one other minute after, the initial open finally got delivered. Would this mean the retry timing is dependent on the TCP session dying on its own? Could you kill the phantom/previous connection more drastically ("deep six" it as you were suggesting) so we can ensure the server will disregard anything coming from the phantom connection? I think this is needed to open the possibility for aggressive retries in low latency environments (or simply have control of our retry timings).

ljanyst · 2018-03-23T11:38:40Z

The client needs to reissue a request that got a kXR_wait response. As far as I recall, the idea here is not to wait for the TCP connection to die off on its own, but to wait for the plug-in code to stop acting on the requests that came through that channel. Perhaps kXR_waitresp would be a better option here since it would eliminate the wait time estimation guesswork and just notify the client asynchronously that it can proceed. Certainly it would be easier to implement in the client.

ericcano · 2018-03-23T12:46:53Z

But in this very example, given that the open has not reached the server, do we have any plugin call in flight? Andy once mentioned something possibly badly handled in the Castor plug-in. Would this be a contributor to the problem? The open did not make it through yet at this time in the 1st connection.

ljanyst · 2018-03-23T13:13:03Z

Perhaps that's the bug that Andy has fixed.

abh3 · 2018-03-23T19:50:40Z

On Fri, 23 Mar 2018, Eric Cano wrote: @abh3, regarding the phantom session, I extrapolated that if the client gets `kXR_wait` as a reply to `endsess`, it should wait and try again, i.e. re-issue the endsess? Would that explain completely the problem?

We narrowed down the problem to the client-side not completely implementing the required strategy for endsess handling. Yes, the server is asking the client to wait and retry later

If so, it is still worrying that the link could not be terminated in XrdXrootdProtocol::do_Endsess() one minute after establishement, and that more than one other minute after, the initial open finally got delivered. Would this mean the retry timing is dependent on the TCP session dying on its own? Could you kill the phantom/previous connection more drastically ("deep six" it as you were suggesting) so we can ensure the server will disregard anything coming from the phantom connection? I think this is needed to open the possibility for aggressive retries in low latency environments (or simply have control of our retry timings).

Yes, I am investigating this. It's not as easy as one would think because doing this introduces other side-effects. So, I'm proceeding with an abundance of caution. At the moment, we believe, that implementing the full endsess sequence in the client should alleviate the problem. That doesn't mean you can't get a zero length file but that is no different than the client issuing an open and then crashing. That you cannot solve other then turning on POSC mode which comes with not insignificant overhead. Andy

abh3 · 2018-03-23T19:52:45Z

The "badly handled" thing is that Castor can sometimes take a very long time to do an open which causes the retry mechanism to kick in. However, we've solved that part in 4.8.2. We're tryng to solve the other edge case of bad TCP connections. Andy

…

On Fri, 23 Mar 2018, Eric Cano wrote: But in this very example, given that the open has not reached the server, do we have any plugin call in flight? Andy once mentioned something possibly badly handled in the Castor plug-in. Would this be a contributor to the problem? The open did not make it through yet at this time in the 1st connection. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #673 (comment)

abh3 · 2018-03-23T19:53:04Z

Yes, that is the bug I fixed.

…

On Fri, 23 Mar 2018, Lukasz Janyst wrote: Perhaps that's the bug that Andy has fixed. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #673 (comment)

simonmichal · 2018-03-27T11:22:13Z

The handling of error/wait response for endsess request has been fixed in: 814e8b1

ljanyst · 2018-03-27T15:18:58Z

@abh3 Out of curiosity, would it be possible for the server to send kXR_waitresp instead kXR_wait? It would alleviate Eric's concerns about retry latency due to the wait time overestimation and the client-side implementation would be simpler.

abh3 · 2018-03-27T15:24:45Z

That would be a fairly big change in the server. From the servers perspective, a kXR_wait is fair simpler to handle because it'stateless. A kXR_waitresp if more complicated because it has to keep state. The current endsess path has no provision to keep state so that would have to be added. I think Michal found it rather easy to fix the client to properly handle the various return cases for endsess. Andy

…

On Tue, 27 Mar 2018, Lukasz Janyst wrote: @abh3 Out of curiosity, would it be possible for the server to send `kXR_waitresp` instead `kXR_wait`? It would alleviate Eric's concerns about retry latency due to the wait time overestimation and the client-side implementation would be simpler. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #673 (comment)

ljanyst · 2018-03-27T22:29:11Z

Makes sense. Thanks for explaining. I am not trying to suggest that Michal found it problematic, just saying that from the client's perspective kXR_waitresp would be easier.

xrootd-dev · 2018-03-28T09:25:07Z

Yeah, funny isn't it. You're right, waitresp is easy for the client but difficult foe the server. Wait is easier for the server but more complicated for the client. When you squeeze this in one direction is simply expands in another :-) ndy

…

On Tue, 27 Mar 2018, Lukasz Janyst wrote: Makes sense. Thanks for explaining. I am not trying to suggest that Michal found it problematic, just saying that from the client's perspective `kXR_waitresp` would be easier. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #673 (comment) ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

simonmichal · 2018-04-06T15:13:15Z

I'll complement the thread with Elvin's findings as I think this is crucial to understand the problem:

Below I've retraced what happened from both the client and the server perspective.

First there is the open of file id 1633520448 for writing:

[2018-03-16 17:14:27.601768 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Message kXR_open (file: /diskserver/48/1633520448@castorns.6889849983?castor.exptime=1521220467&castor.pfn1=/diskserver/48/1633520448@castorns.6889849983&castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.

After this the client continues to write and then at 17:14:37 sends a close request for this file:

[2018-03-16 17:14:37.138279 +0100][Debug  ][File              ] [0x68008a40@root://tpsrv452.cern.ch:1095//diskserver/48/1633520448@castorns.6889849983?castor.exptime=1521220467&castor.pfn1=/diskserver/48/
1633520448@castorns.6889849983&castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txt
ype=tape] Sending a close command for handle 0x0 to tpsrv452.cern.ch:1095

On the server side we have a perfect match of the previous operations:

180316 17:14:27 time=1521216867.602143 func=open                     level=INFO  logid=19b3d1aa-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:697   ti
dent=stage.27009:30@tpsrv210 path=/diskserver/48/1633520448@castorns.6889849983, opaque=castor.exptime=1521220467&castor.pfn1=/diskserver/48/1633520448@castorns.6889849983&castor.pfn2=0:15511:ae8aa0b2-258
8-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95740033d0
180316 17:14:27 165099 stage.27009:30@tpsrv210 castor2ofs_open: 4200-40600 fn=/diskserver/48/1633520448@castorns.6889849983
180316 17:14:37 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=1 fn=/diskserver/48/1633520448@castorns.6889849983

The nice part of the last line of the logs is that we can see the "use=1" which is sort of reference count for this file. Now the close goes indeed in the CASTOR plugin which indeed seems to block the request. The reasons could be checksum recomputation or slowness of the diskmanager in replying to the close message. Nevertheless, this is blocked.

I've looked over the XRootD code and indeed there is a Link->Serialize() call done for open and close:
https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L482  - for close
https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L1353  - for open

While the link on the server side is blocked the client is trying to open multiple other files including also the one we see triggering the open retry, namely id 1633530374

[2018-03-16 17:16:21.366785 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.

This open is never replied to due to the link being blocked by the previous close. After 60 seconds the socket is declared dead and all requests sent through that socket are timed out:

[2018-03-16 17:17:22.007993 +0100][Dump   ][XRootDTransport   ] [tpsrv452.cern.ch:1095 #0.0] Stream inactive since 61 seconds, stream timeout: 60, allocated SIDs: 4, wait barrier: 2018-03-16 17:14:37 +0100
[2018-03-16 17:17:22.008001 +0100][Dump   ][File              ] [0x64008ce0@root://tpsrv441.cern.ch:1095//diskserver/91/1633530291@castorns.6889850068?castor.exptime=1521220636&castor.pfn1=/diskserver/91/
1633530291@castorns.6889850068&castor.pfn2=0:15511:260734cd-4198-4f88-9aca-276ed72935b0&castor.signature=a60v/QBb9YfC0wEfFv22P9xqg4LMXSGIlNaaYWG/+PvtIGH8lSiEVEEKQSrrvocGKcylMkmHKROPNmwWgFsgzw==&castor.txt
ype=tape] Got state response for message kXR_write (handle: 0x00000000, offset: 1630535680, size: 5242880)
[2018-03-16 17:17:22.008032 +0100][Debug  ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Closing the socket
[2018-03-16 17:17:22.008049 +0100][Debug  ][Poller            ] <[::ffff:137.138.222.81]:42468><--><[::ffff:10.32.133.140]:1095> Removing socket from the poller
[2018-03-16 17:17:22.008093 +0100][Debug  ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Recovering error for stream #0: [ERROR] Socket error.
[2018-03-16 17:17:22.008106 +0100][Debug  ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Reporting disconnection to queued message handlers.
[2018-03-16 17:17:22.008118 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: /diskserver/26/1633531326@castorns.6889850100?castor.exptime=152122055
5&castor.pfn1=/diskserver/26/1633531326@castorns.6889850100&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucC
kQXdGwa6f+NIM8SA==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.008130 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: /diskserver/26/1633531326@castorns.6889850100?castor.exptime=1521220
555&castor.pfn1=/diskserver/26/1633531326@castorns.6889850100&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnu
cCkQXdGwa6f+NIM8SA==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.
[2018-03-16 17:17:22.008146 +0100][Dump   ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: /diskserver/26/1633531326@castorns.6889850100?castor.exptime=1521220555&castor.pf
n1=/diskserver/26/1633531326@castorns.6889850100&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+N
IM8SA==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x7400a1a0) through substream 0 expecting answer at 0
[2018-03-16 17:17:22.009109 +0100][Debug  ][PostMaster        ] [tpsrv452.cern.ch:1095] Found 1 address(es): [::ffff:10.32.133.140]:1095
[2018-03-16 17:17:22.009149 +0100][Debug  ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Attempting connection to [::ffff:10.32.133.140]:1095
[2018-03-16 17:17:22.009182 +0100][Debug  ][Poller            ] Adding socket 0x70005d60 to the poller
[2018-03-16 17:17:22.009209 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: /diskserver/58/1633530558@castorns.6889850058?castor.exptime=152122051
6&castor.pfn1=/diskserver/58/1633530558@castorns.6889850058&castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884&castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh75W
lm9fEDYJrtryRDKg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.009221 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: /diskserver/58/1633530558@castorns.6889850058?castor.exptime=1521220
516&castor.pfn1=/diskserver/58/1633530558@castorns.6889850058&castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884&castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh7
5Wlm9fEDYJrtryRDKg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.
[2018-03-16 17:17:22.009244 +0100][Dump   ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: /diskserver/58/1633530558@castorns.6889850058?castor.exptime=1521220516&castor.pf
n1=/diskserver/58/1633530558@castorns.6889850058&castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884&castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh75Wlm9fEDYJrtr
yRDKg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x58005220) through substream 0 expecting answer at 0
[2018-03-16 17:17:22.009255 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_close (handle: 0x00000000)
[2018-03-16 17:17:22.009265 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_close (handle: 0x00000000): [ERROR] Socket error.
[2018-03-16 17:17:22.009275 +0100][Error  ][XRootD            ] [tpsrv452.cern.ch:1095] Unable to get the response to request kXR_close (handle: 0x00000000)
[2018-03-16 17:17:22.009295 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=152122058
1&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqk
vH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.009306 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.
[2018-03-16 17:17:22.009309 +0100][Debug  ][File              ] [0x68008a40@root://tpsrv452.cern.ch:1095//diskserver/48/1633520448@castorns.6889849983?castor.exptime=1521220467&castor.pfn1=/diskserver/48/1633520448@castorns.6889849983&castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txtype=tape] Close returned from tpsrv452.cern.ch:1095 with: [ERROR] Socket error

As can be seen from the logs above there are actually 3 open requests which are dropped and retried since the socket is declared dead and on close request which is the one from the first transfer that never returned.
This I think also matches some observation from Eric who was suspecting cross-talk in the client ...

The more interesting part is on the server side, where we see the effect of all these recoveries. For example for file id 1633531326

180316 17:17:22 time=1521217042.015626 func=open                     level=INFO  logid=81a93368-2935-11e8-8389-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140281188312832 source=XrdxCastor2Ofs:697   ti
dent=stage.27009:35@tpsrv210 path=/diskserver/26/1633531326@castorns.6889850100, opaque=castor.exptime=1521220555&castor.pfn1=/diskserver/26/1633531326@castorns.6889850100&castor.pfn2=0:15511:a665e40e-6c1
2-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+NIM8SA==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95ac104810
180316 17:17:22 153326 stage.27009:35@tpsrv210 castor2ofs_open: 4200-40600 fn=/diskserver/26/1633531326@castorns.6889850100

The above is the open that came through the new socket as the result of the recovery. And below at some point later on when the blocking close finished and the link mutex is released we actually see the first open going through.

180316 17:18:59 time=1521217139.965050 func=open                     level=INFO  logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:697   tident=stage.27009:30@tpsrv210 path=/diskserver/26/1633531326@castorns.6889850100, opaque=castor.exptime=1521220555&castor.pfn1=/diskserver/26/1633531326@castorns.6889850100&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+NIM8SA==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95740033d0
180316 17:18:59 165099 stage.27009:30@tpsrv210 castor2ofs_open: 4200-40600 fn=/diskserver/26/1633531326@castorns.6889850100
180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_open: attach use=2 fn=/diskserver/26/1633531326@castorns.6889850100
180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=2 fn=/diskserver/26/1633531326@castorns.6889850100
180316 17:19:00 time=1521217140.187768 func=close                    level=ERROR logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:974   tident=stage.27009:30@tpsrv210 path=/diskserver/26/1633531326@castorns.6889850100 msg="closed via destructor"
180316 17:19:00 165099 castor2ofs_close: stage.27009:30@tpsrv210 Unable to close file - delete via destructor ; connection reset by peer
180316 17:19:00 time=1521217140.187819 func=close                    level=INFO  logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:1007  tident=stage.27009:30@tpsrv210 msg="send to diskmanager" errc=104 errmsg="Unable to close file - delete via destructor ; connection reset by peer"
180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=0 fn=dummy
180316 17:19:00 165099 XrootdXeq: Output file /diskserver/26/1633531326@castorns.6889850100 is already opened by 1 writer; open denied.
180316 17:19:00 165099 stage.27009:30@tpsrv210 XrootdResponse: sending err 3003: Output file /diskserver/26/1633531326@castorns.6889850100 is already opened by 1 writer; open denied.

Surprisingly, or not, we can see that XRootD detected that this is an open of an already opened (notice the use=2) for write file and says open denied. But this happens after the CASTOR plugin is called therefore is useless since the damage is done. I suppose this is the thing that Andy fixed - that the check happens before the plugin call.

simonmichal · 2018-04-06T15:17:12Z

I've been able to reproduce the scenario described by Elvin with xrootd 4.8.1 and indeed I observed file corruption. Doing the same test having the 814e8b1 fix in place, confirmed that the problem has been addressed and that the patch prevents file corruption.

simonmichal · 2018-05-28T10:07:18Z

Just to be on the safe side I've add an envar to disable open recovery mechanism (for open+create and open+truncate) in case the request is retried on the same data node: 6ef8e30

simonmichal closed this as completed May 28, 2018

jmuf mentioned this issue Aug 13, 2019

XrdCl: if kXR_poscpend is set disable write recovery #1036

Closed

C++ API to XRootD should allow the "open file" retry logic to be turned off #673

C++ API to XRootD should allow the "open file" retry logic to be turned off #673

Comments

murrayc3 commented Mar 19, 2018

simonmichal commented Mar 19, 2018

abh3 commented Mar 19, 2018

murrayc3 commented Mar 20, 2018

murrayc3 commented Mar 20, 2018

xrootd-dev commented Mar 20, 2018 via email

murrayc3 commented Mar 20, 2018

abh3 commented Mar 20, 2018 via email

simonmichal commented Mar 20, 2018

jmuf commented Mar 21, 2018

ericcano commented Mar 21, 2018

jmuf commented Mar 21, 2018

ericcano commented Mar 21, 2018

esindril commented Mar 21, 2018

murrayc3 commented Mar 21, 2018

murrayc3 commented Mar 21, 2018 • edited

ljanyst commented Mar 21, 2018 • edited

simonmichal commented Mar 21, 2018

simonmichal commented Mar 21, 2018

jmuf commented Mar 21, 2018

simonmichal commented Mar 21, 2018

abh3 commented Mar 21, 2018 via email

esindril commented Mar 22, 2018

simonmichal commented Mar 22, 2018

ljanyst commented Mar 22, 2018

ljanyst commented Mar 22, 2018

simonmichal commented Mar 22, 2018

murrayc3 commented Mar 22, 2018 • edited

ljanyst commented Mar 22, 2018

abh3 commented Mar 22, 2018 via email

abh3 commented Mar 22, 2018 via email

murrayc3 commented Mar 22, 2018

xrootd-dev commented Mar 22, 2018 via email

abh3 commented Mar 22, 2018 via email

ericcano commented Mar 23, 2018

ljanyst commented Mar 23, 2018

ericcano commented Mar 23, 2018

ljanyst commented Mar 23, 2018

abh3 commented Mar 23, 2018 via email

abh3 commented Mar 23, 2018 via email

abh3 commented Mar 23, 2018 via email

simonmichal commented Mar 27, 2018

ljanyst commented Mar 27, 2018

abh3 commented Mar 27, 2018 via email

ljanyst commented Mar 27, 2018

xrootd-dev commented Mar 28, 2018 via email

simonmichal commented Apr 6, 2018

simonmichal commented Apr 6, 2018

simonmichal commented May 28, 2018

murrayc3 commented Mar 21, 2018 •

edited

ljanyst commented Mar 21, 2018 •

edited

murrayc3 commented Mar 22, 2018 •

edited