Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ API to XRootD should allow the "open file" retry logic to be turned off #673

Closed
murrayc3 opened this issue Mar 19, 2018 · 48 comments
Closed

Comments

@murrayc3
Copy link
Contributor

The XRootD client will retry a request to open a file if the first attempt takes too long. Unfortunately this can result in two open requests being sent from the client to the server. The way this can happen is as follows:

  1. The client application calls XrdCl::File::Open().
  2. The XRootD client library sends an "open file" request to the server over a TCP/IP connection.
  3. The XRootD client library times out the "open file" request.
  4. The XRootD client sends a second "open file" request over a different TCP/IP connection.
  5. The XRootD server eventually receives two "open file" requests because the first was just delayed on the network.

Please could the XRootD client and its API be modified so that the calling application can explicitly switch off the "open file" retry logic.

@simonmichal
Copy link
Contributor

After a quick glance I would say you should increase the value of the stream timeout.

Just to clarify, is the problem that the server responds to the second open with an error because the file has been already opened by the first one, right?

Let me think about it, maybe there's a better way ;-)

@abh3
Copy link
Member

abh3 commented Mar 19, 2018

This problem should become moot in 4.8.2. Assuming that multiple opens for reading causes no problems, then in 4.8.2 we make sure than only one open for writing will occur regardless of how many open requests were sent to the server during recovery. This corrects an optimization that allowed the server to have the file open multiple times (without immediate writing) to speed file access in 4.8.1. The new patches won't much affect normal opens but will slow, somewhat, error recovery opens. Fortunately, those shouldn't happen often.

I'd close this ticket but I would surmise you might want' to test 4.8.2 to make sure it actually works that way before I do.

@murrayc3
Copy link
Contributor Author

To reply to Michal. When the second "open file and truncate" request reaches the XRootD server it truncates the file and then "goes away". The data transfer of the first "open file and truncate" unknowingly continues writing data to the now corrupted file. When the file is finally written, there is a hole of zero valued bytes at the very beginning of the file.

@murrayc3
Copy link
Contributor Author

To reply to Andy. I'm not sure if XRootD 4.8.2 RC2 will fix the problem. My guess (and I could well be wrong) is that instead of ending up with files with zero valued bytes at their beginnings, we'll end up with cleanly truncated, zero length files instead.

@xrootd-dev
Copy link

xrootd-dev commented Mar 20, 2018 via email

@murrayc3
Copy link
Contributor Author

Hi Andy,

We have discussed a couple of concepts/ideas on our side of the pond and I am going to try to write them down for you here is in this ticket/issue.

Briefly the two concepts/ideas are:

  • Your server side fix of checking for mutual exclusivity before truncating a file won't work for small files.
  • We're try to find a TCP/IP option that will guarantee on the client side that a connection to the server is truly "dead" before creating a brand new TCP/IP connection.

Going into more details with your server side fix in XRootD 4.8.2 RC2. The server will now reject a concurrent "open the file and truncate" request when there is an on going file transfer. This is great for relatively long files. Eric has pointed out that very small files will not be protected however. Here is the scenario with XRootD 4.8.2 RC2:

  1. XRootD client sends a "open file and truncate" request to the server.
  2. The corresponding TCP/IP packet gets dropped/lost.
  3. The XRootD client library times out and closes the TCP/IP connection.
  4. The XRootD client library opens a brand new connection to the server.
  5. The XRootD client completes the entire transmission of the small file to the server (open , write, write... and then close).
  6. The lost "open file and truncate" TCP/IP packet gets retransmitted by the TCP/IP stack underneath the XRootD client library.
  7. The server lets the request take the mutual exclusion lock on the file for writing.
  8. The server truncates the file.
  9. The server receives the FIN packet of the close request on the connection made ages ago by the XRootD client library.
    We're done, the small file has now been truncated.

Going into more details about making sure the TCP/IP connection is really closed on the client side before creating a new one. We believe the first request to "open and truncate" the file is being replayed by the TCP/IP stack underneath the XRootD client library, even after the library has called close() on the connection's file descriptor. Eric has looked at some TCP/IP socket options and thinks there is a way to guarantee the connection is really closed before opening a new one. Eric sent the following e-mail to us here at CERN:

Maybe we could play with this (man 7 socket):
 
 
       SO_LINGER
              Sets or gets the SO_LINGER option.  The argument is a linger
              structure.
 
                  struct linger {
                      int l_onoff;    /* linger active */
                      int l_linger;   /* how many seconds to linger for */
                  };
 
              When enabled, a close(2) or shutdown(2) will not return until
              all queued messages for the socket have been successfully sent
              or the linger timeout has been reached.  Otherwise, the call
              returns immediately and the closing is done in the background.
              When the socket is closed as part of exit(2), it always
              lingers in the background.
 
 
Cheers,
Eric

@abh3
Copy link
Member

abh3 commented Mar 20, 2018 via email

@simonmichal
Copy link
Contributor

@abh3 : that sounds promising :-)

I was thinking about issuing shutdown when there is a close on error on client side, and then waiting for the EOF before opening a new connection (but I would need to do this within a reasonable timeout so it's imperfect). Issuing shutdown on your side on the endsess request sounds even better :-)
I will discuss this tomorrow with Steve and Eric.

@jmuf
Copy link
Contributor

jmuf commented Mar 21, 2018

Just a bit wider context:
spurious xrootd-internal retries have been blamed for data loss at CERN (truncated/empty files) since at least last summer, with LHC experiments struggling to turn this off. I understand that the internal retry mechanism (i.e drop TCP connection, open new connection, resend) is generic and that "just" turning it off just for writes or open+truncate is currently not an option.
I would advocate to disable this mechanism by default for now, and perhaps re-enable once it can be safely limited to a subset of idempotent operations (read?).
All LHC experiments have mechanisms for higher-level retry in case "xrdcp" gives some error. What they cannot really cope with is a silently corrupted file and "successful" exit code.

@ericcano
Copy link
Contributor

@abh3 : Your solution will solve the real problem (making sure the current connection will not retransmit anything before starting a new connection), but the timing will be very dependent on retransmissions in the TCP stack. As we saw, it sometimes takes minutes before the 1st open get retransmitted and corrupts the file, so the retry would take that amount of time in those situations (open will have to go through on the 1st TCP connection anyway before the endsess command goes through).

My proposed direction of investigation (as this linger option has not been tested in any way besides RTFMing) has the drawback that it might not be portable to all client architectures.

Overall, the client must never have 2 path to the server open at the same time, so both solutions would fit the bill. I have a slight preference for the linger solution as it will allow more aggressive retries (when we recall from tape with the computer center, we could have on open timeout of only 10s of seconds if not seconds, as the usual RTT is very small). In our case, this would prevent wasting tape drive bandwidth. We only have ~3 minutes of in-memory buffer in the tape servers, so blocking for that amount of time will eventually waste drive resources.

@jmuf
Copy link
Contributor

jmuf commented Mar 21, 2018

Please keep in mind that the "lingering" packet might not only be sent by the local client kernel, but could also come from some intermediate piece of network equipment (routers, multipath routing with one congested leg?). As such a client-side fix such as SO_LINGER might not reliably address the issue?

@ericcano
Copy link
Contributor

Agreed, but this will already severely cut the time window during which the 1st open can raise from the dead (from TCP retry timeouts to practical packet time of flight on the network). In Linux, the default TTL is 64, so a mourning period of 64s would ensure that an in-flight packet will not pop up. As this number is used for both seconds and hops count, we can hardly lower it. I suspect the practical value of this mourning is much lower, of the order of a few seconds if not less (but that's not fully rigorous).

@esindril
Copy link
Contributor

A trace of such an incident from the XrdCl client's perspective

[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] URL: root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.exptime=1521220581&castor.txtype=tape&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] Protocol:  root
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] User Name: 
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] Password:  
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] Host Name: tpsrv452.cern.ch
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] Port:      1095
[2018-03-16 17:16:21.366568 +0100][Dump   ][Utility           ] Path:      /diskserver/74/1633530374@castorns.6889850064
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] URL: root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.exptime=1521220581&castor.txtype=tape&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] Protocol:  root
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] User Name: 
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] Password:  
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] Host Name: tpsrv452.cern.ch
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] Port:      1095
[2018-03-16 17:16:21.366640 +0100][Dump   ][Utility           ] Path:      /diskserver/74/1633530374@castorns.6889850064
[2018-03-16 17:16:21.366657 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Sending an open command
[2018-03-16 17:16:21.366680 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Sending message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:16:21.366699 +0100][Dump   ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60) through substream 0 expecting answer at 0
[2018-03-16 17:16:21.366745 +0100][Dump   ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Wrote a message: kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60), 335 bytes
[2018-03-16 17:16:21.366762 +0100][Dump   ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Successfully sent message: kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60).
[2018-03-16 17:16:21.366785 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.
[2018-03-16 17:17:22.009295 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.009306 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.
[2018-03-16 17:17:22.009309 +0100][Debug  ][File              ] [0x68008a40@root://tpsrv452.cern.ch:1095//diskserver/48/1633520448@castorns.6889849983?castor.exptime=1521220467&castor.pfn1=/diskserver/48/1633520448@castorns.6889849983&castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txtype=tape] Close returned from tpsrv452.cern.ch:1095 with: [ERROR] Socket error
[2018-03-16 17:17:22.009317 +0100][Dump   ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60) through substream 0 expecting answer at 0
[2018-03-16 17:17:22.013987 +0100][Dump   ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Wrote a message: kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60), 335 bytes
[2018-03-16 17:17:22.013999 +0100][Dump   ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Successfully sent message: kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x70005e60).
[2018-03-16 17:17:22.014008 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.
[2018-03-16 17:17:22.019252 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Got a kXR_ok response to request kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.019267 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Parsing the response to kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) as OpenInfo
[2018-03-16 17:17:22.019276 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Parsing StatInfo in response to kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.019326 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Open has returned with status [SUCCESS] 
[2018-03-16 17:17:22.019336 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] successfully opened at tpsrv452.cern.ch:1095, handle: 0x2, session id: 6
[2018-03-16 17:17:22.019471 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Sending a write command for handle 0x2 to tpsrv452.cern.ch:1095

Then there is a continuous stream of writes and the close:

[2018-03-16 17:19:18.271335 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Got a kXR_ok response to request kXR_write (handle: 0x02000000, offset: 3046113280, size: 8316)
[2018-03-16 17:19:18.271347 +0100][Dump   ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011
674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Got state response for message kXR_write (handle: 0x02000000, offset: 3046113280, size: 8316)
[2018-03-16 17:19:18.271403 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011
674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Sending a close command for handle 0x2 to tpsrv452.cern.ch:1095
[2018-03-16 17:19:18.285227 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Got a kXR_ok response to request kXR_close (handle: 0x02000000)
[2018-03-16 17:19:18.285248 +0100][Debug  ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011
674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Close returned from tpsrv452.cern.ch:1095 with: [SUCCESS] 
[2018-03-16 17:19:18.285302 +0100][Dump   ][File              ] [0x7000f0a0@root://tpsrv452.cern.ch:1095//diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape] Items in the fly 0, queued for recovery 0

@murrayc3
Copy link
Contributor Author

These are the corresponding CASTOR logs courtesy of Vlado:

CASTOR file 1633530374 corrupted on tpsrv452:

180316 17:19:00 time=1521217140.404038 func=close                    level=ERROR logid=bc2d2eae-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:974   tident=stage.27009:30@tpsrv210 path=/diskserver/74/1633530374@castorns.6889850064 msg="closed via destructor"
180316 17:32:25 time=1521217945.688662 func=close                    level=ERROR logid=94173868-2937-11e8-a9a8-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280526202624 source=XrdxCastor2Ofs:981   tident=stage.87764:22@tpsrv416 path=/diskserver/74/1633530374@castorns.6889850064 msg="checksum mismatch"

@murrayc3
Copy link
Contributor Author

murrayc3 commented Mar 21, 2018

Here are the XRootD server logs on tpsrv452 showing the double open from the same client tpsrv210. We clearly see the unwanted open being followed by an immediate close with ERROR (closed socket).

[root@tpsrv452 ~]# zgrep -E '1633530374@castorns|81a998c6-2935-11e8-8389-a4bf0112f566' /var/log/xrootd/server/xrootd.log-20180317.gz | egrep 'func=open|func=close' | grep tpsrv210
180316 17:17:22 time=1521217042.018172 func=open                     level=INFO  logid=81a998c6-2935-11e8-8389-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140281188312832 source=XrdxCastor2Ofs:697   tident=stage.27009:35@tpsrv210 path=/diskserver/74/1633530374@castorns.6889850064, opaque=castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95ac105d30
180316 17:19:00 time=1521217140.188374 func=open                     level=INFO  logid=bc2d2eae-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:697   tident=stage.27009:30@tpsrv210 path=/diskserver/74/1633530374@castorns.6889850064, opaque=castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95740033d0
180316 17:19:00 time=1521217140.404038 func=close                    level=ERROR logid=bc2d2eae-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:974   tident=stage.27009:30@tpsrv210 path=/diskserver/74/1633530374@castorns.6889850064 msg="closed via destructor"
180316 17:19:18 time=1521217158.283782 func=close                    level=INFO  logid=81a998c6-2935-11e8-8389-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140281188312832 source=XrdxCastor2Ofs:1007  tident=stage.27009:35@tpsrv210 msg="send to diskmanager" errc=0 errmsg=""
[root@tpsrv452 ~]# 

@ljanyst
Copy link
Contributor

ljanyst commented Mar 21, 2018

Holy shit, that's a nasty one!

IMHO the only reliable solution to this kind of problems is to have a variation of Lamport Timestamps. This is how I would see it:

  1. Each client sends a random identifier during login (per channel).
  2. The server keeps a hash table with the identifiers and a 64 bit integer associated with each.
  3. The server purges this hash table once in a while so that it does not grow too large.
  4. The client has a 64 bit counter starting at 0.
  5. Before sending any message, the client increments the counter by 1.
  6. The client embeds the current counter value in every message.
  7. Every time the server receives a message from the client on a given channel it considers the counter.
  8. If the counter value in the message is lower than what the server has, it declares an error to the client.
  9. If the value is higher, the server performs the operation and associates the new counter value with the client id in its hash table.

This works because TCP guarantees ordering per connection. Recovery connections would be different TCP connections but would have the same client/channel ID.

@simonmichal
Copy link
Contributor

@abh3 : could you put some light on how endsess is handled on the server side?

On the client side when there is a stream timeout, the original connection gets closed, and then a new connection is opened. Once the connection is established a handshake happens, the last part of the handshake is the kXR_endsess request that is supposed to terminate the pre-existing connection. The "second open" from Steve/Eric's scenario is guaranteed to happen after the client receives the response to the kXR_endsess request.

I noticed in the XrdXrootdProtocol::do_Endsess():

// Extract out the FD and Instance from the session ID
//
   sp = (XrdXrootdSessID *)Request.endsess.sessid;
   memcpy((void *)&sessID.Pid,  &sp->Pid,  sizeof(sessID.Pid));
   memcpy((void *)&sessID.FD,   &sp->FD,   sizeof(sessID.FD));
   memcpy((void *)&sessID.Inst, &sp->Inst, sizeof(sessID.Inst));

// Trace this request
//
   TRACEP(LOGIN, "endsess " <<sessID.Pid <<':' <<sessID.FD <<'.' <<sessID.Inst);

// If this session id does not refer to us, ignore the request
//
   if (sessID.Pid != myPID) return Response.Send();

// Terminate the indicated session, if possible. This could also be a self-termination.
//
   if ((sessID.FD == 0 && sessID.Inst == 0) 
   ||  !(rc = Link->Terminate(Link, sessID.FD, sessID.Inst))) return -1;

and then further in XrdLink::Terminate(...):

// We can now disable the link and schedule a close
//
   snprintf(buff, sizeof(buff), "ended by %s", ID);
   buff[sizeof(buff)-1] = '\0';
   lp->Poller->Disable(lp, buff);
   lp->opMutex.UnLock();

// Now wait for the link to shutdown. This avoids lock problems.
//
   if (killDone.Wait(int(killWait))) wTime += killWait;
      else wTime = -EPIPE;
   killDone.UnLock();

At a first glance it seems to me that the server will terminate the original connection while handling the kXR_endsess request. Could you comment on that?

@simonmichal
Copy link
Contributor

@jmuf : regarding the data corruption at CERN, is it possible to check if they were triggered by the fuse client or xrdcp?

@jmuf
Copy link
Contributor

jmuf commented Mar 21, 2018

The corruptions reported e.g on EOSATLAS around Aug 2017 would have been via "xrdcp" (they got reported here as #552 - the "workaround on the EOS side" mentioned there prevented files from being deleted after such a truncate). The issues seen recently (March 2018) in CASTOR repack also would have been via "xrdcp". The recent (March 2018) ATLAS T0 job description test relied on FUSE for writing the files, and also saw spurious empty files without errors being reported to the application.

@simonmichal
Copy link
Contributor

@jmuf : the xrootd client will always try to recover at redirector if one is available so we are dealing here most likely with two different problems:

  • In case of EOSATLAS, xrdcp issues an open against MGM and then gets redirected to a FST, so if a stream timeout occurs the client will always retry at MGM.

  • In the two remaining cases it's different. CASTOR does a direct copy from data server to data server, so there is no redirector. FUSE client queries the MGM for the file location, and then issues a direct copy to FST, so from the client perspective there's no redirector either.

@abh3
Copy link
Member

abh3 commented Mar 21, 2018 via email

@esindril
Copy link
Contributor

But given this (endsess) mechanism and the fact that we never get to call any OFS plugin method in the error scenario described earlier, why didn't this already protect us form the double open?

@simonmichal
Copy link
Contributor

From what it see in XrdXrootdProtocol::do_Endsess() it can happen that the server actually responds with kXR_wait to an kXR_endsess request (@abh3 : could you confirm?):

https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L711-L712

I think the kXR_wait response is not taken into account on the client side logic:

https://github.com/xrootd/xrootd/blob/master/src/XrdCl/XrdClXRootDTransport.cc#L2012-L2036

If the response is not an error the client declares the handshake as successful (@ljanyst : could you give me another 'holy shit' if you think this might be true? ;-).

This could be the source of our problem.

@ljanyst
Copy link
Contributor

ljanyst commented Mar 22, 2018

From what I recall, the server was not supposed to stall the login sequence. This and the fact that you may break third-party plugins by force-closing sockets on running sessions is why I suggested using Lamport Timestamps in the first place, even though it's a disruptive change to the protocol.

These are just my random thoughts on a problem that I find interesting. Last time I looked at this code was three years ago, so you should definitely double-check on everything I say.

@ljanyst
Copy link
Contributor

ljanyst commented Mar 22, 2018

As you can see in the code, it only prints an error and succeeds anyways even if an error has been returned by the server. Since there's a comment there saying that we disregard the error intentionally, I am sure there was a good reason for it. I just don't remember what it was.

@simonmichal
Copy link
Contributor

Thanks for pointing it out, I actually somehow did not realized that the return statement was commented out.

@murrayc3
Copy link
Contributor Author

murrayc3 commented Mar 22, 2018

From what I have seen so far in these discussions, the current XRootD protocol relies on only one session being active per client at a time. If this rule is broken then a brand new XRootD protocol would need to be defined which could of course use Lamport Timestamps.

Assuming that the XRootD protocol will not be changed any time soon then keeping only one session active at a time per client must be enforced at all costs. This means the XRootD client must handle a kXR_wait response to an kXR_endsess request. Have I simplified this too far and am missing something?

@ljanyst
Copy link
Contributor

ljanyst commented Mar 22, 2018

For what it's worth, I think changing the client would be a reasonable course of action. However, it appears that the current handling of kXR_endsess is not a bug, so I would investigate whether the reasons it's been done this way are still good reasons. :)

@abh3
Copy link
Member

abh3 commented Mar 22, 2018 via email

@abh3
Copy link
Member

abh3 commented Mar 22, 2018 via email

@murrayc3
Copy link
Contributor Author

To Andy, Michal and Lukasz. It is clear that there are some very good ideas on how to improve the session handling of XRootD in the event of an XRootD client re-connecting to the server. However will it be possible to modify the XRootD client so that we can switch off the connection retry logic all together?

CASTOR's repack and the file transfer frameworks of the LHC experiments have fully functional policies for retrying failed file transfers. Given the fact that we are currently experiencing corrupted files, switching off the XRootD client reconnect logic would immediately prevent these corrupted files from being created.

@xrootd-dev
Copy link

xrootd-dev commented Mar 22, 2018 via email

@abh3
Copy link
Member

abh3 commented Mar 22, 2018 via email

@ericcano
Copy link
Contributor

@abh3, regarding the phantom session, I extrapolated that if the client gets kXR_wait as a reply to endsess, it should wait and try again, i.e. re-issue the endsess? Would that explain completely the problem?

If so, it is still worrying that the link could not be terminated in XrdXrootdProtocol::do_Endsess() one minute after establishement, and that more than one other minute after, the initial open finally got delivered. Would this mean the retry timing is dependent on the TCP session dying on its own? Could you kill the phantom/previous connection more drastically ("deep six" it as you were suggesting) so we can ensure the server will disregard anything coming from the phantom connection? I think this is needed to open the possibility for aggressive retries in low latency environments (or simply have control of our retry timings).

@ljanyst
Copy link
Contributor

ljanyst commented Mar 23, 2018

The client needs to reissue a request that got a kXR_wait response. As far as I recall, the idea here is not to wait for the TCP connection to die off on its own, but to wait for the plug-in code to stop acting on the requests that came through that channel. Perhaps kXR_waitresp would be a better option here since it would eliminate the wait time estimation guesswork and just notify the client asynchronously that it can proceed. Certainly it would be easier to implement in the client.

@ericcano
Copy link
Contributor

But in this very example, given that the open has not reached the server, do we have any plugin call in flight? Andy once mentioned something possibly badly handled in the Castor plug-in. Would this be a contributor to the problem? The open did not make it through yet at this time in the 1st connection.

@ljanyst
Copy link
Contributor

ljanyst commented Mar 23, 2018

Perhaps that's the bug that Andy has fixed.

@abh3
Copy link
Member

abh3 commented Mar 23, 2018 via email

@abh3
Copy link
Member

abh3 commented Mar 23, 2018 via email

@abh3
Copy link
Member

abh3 commented Mar 23, 2018 via email

@simonmichal
Copy link
Contributor

The handling of error/wait response for endsess request has been fixed in: 814e8b1

@ljanyst
Copy link
Contributor

ljanyst commented Mar 27, 2018

@abh3 Out of curiosity, would it be possible for the server to send kXR_waitresp instead kXR_wait? It would alleviate Eric's concerns about retry latency due to the wait time overestimation and the client-side implementation would be simpler.

@abh3
Copy link
Member

abh3 commented Mar 27, 2018 via email

@ljanyst
Copy link
Contributor

ljanyst commented Mar 27, 2018

Makes sense. Thanks for explaining. I am not trying to suggest that Michal found it problematic, just saying that from the client's perspective kXR_waitresp would be easier.

@xrootd-dev
Copy link

xrootd-dev commented Mar 28, 2018 via email

@simonmichal
Copy link
Contributor

I'll complement the thread with Elvin's findings as I think this is crucial to understand the problem:

Below I've retraced what happened from both the client and the server perspective.

First there is the open of file id 1633520448 for writing:

[2018-03-16 17:14:27.601768 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Message kXR_open (file: /diskserver/48/1633520448@castorns.6889849983?castor.exptime=1521220467&castor.pfn1=/diskserver/48/1633520448@castorns.6889849983&castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.

After this the client continues to write and then at 17:14:37 sends a close request for this file:

[2018-03-16 17:14:37.138279 +0100][Debug  ][File              ] [0x68008a40@root://tpsrv452.cern.ch:1095//diskserver/48/1633520448@castorns.6889849983?castor.exptime=1521220467&castor.pfn1=/diskserver/48/
1633520448@castorns.6889849983&castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txt
ype=tape] Sending a close command for handle 0x0 to tpsrv452.cern.ch:1095

On the server side we have a perfect match of the previous operations:

180316 17:14:27 time=1521216867.602143 func=open                     level=INFO  logid=19b3d1aa-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:697   ti
dent=stage.27009:30@tpsrv210 path=/diskserver/48/1633520448@castorns.6889849983, opaque=castor.exptime=1521220467&castor.pfn1=/diskserver/48/1633520448@castorns.6889849983&castor.pfn2=0:15511:ae8aa0b2-258
8-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95740033d0
180316 17:14:27 165099 stage.27009:30@tpsrv210 castor2ofs_open: 4200-40600 fn=/diskserver/48/1633520448@castorns.6889849983
180316 17:14:37 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=1 fn=/diskserver/48/1633520448@castorns.6889849983

The nice part of the last line of the logs is that we can see the "use=1" which is sort of reference count for this file. Now the close goes indeed in the CASTOR plugin which indeed seems to block the request. The reasons could be checksum recomputation or slowness of the diskmanager in replying to the close message. Nevertheless, this is blocked.

I've looked over the XRootD code and indeed there is a Link->Serialize() call done for open and close:
https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L482  - for close
https://github.com/xrootd/xrootd/blob/master/src/XrdXrootd/XrdXrootdXeq.cc#L1353  - for open

While the link on the server side is blocked the client is trying to open multiple other files including also the one we see triggering the open retry, namely id 1633530374

[2018-03-16 17:16:21.366785 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Message kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) has been successfully sent.

This open is never replied to due to the link being blocked by the previous close. After 60 seconds the socket is declared dead and all requests sent through that socket are timed out:

[2018-03-16 17:17:22.007993 +0100][Dump   ][XRootDTransport   ] [tpsrv452.cern.ch:1095 #0.0] Stream inactive since 61 seconds, stream timeout: 60, allocated SIDs: 4, wait barrier: 2018-03-16 17:14:37 +0100
[2018-03-16 17:17:22.008001 +0100][Dump   ][File              ] [0x64008ce0@root://tpsrv441.cern.ch:1095//diskserver/91/1633530291@castorns.6889850068?castor.exptime=1521220636&castor.pfn1=/diskserver/91/
1633530291@castorns.6889850068&castor.pfn2=0:15511:260734cd-4198-4f88-9aca-276ed72935b0&castor.signature=a60v/QBb9YfC0wEfFv22P9xqg4LMXSGIlNaaYWG/+PvtIGH8lSiEVEEKQSrrvocGKcylMkmHKROPNmwWgFsgzw==&castor.txt
ype=tape] Got state response for message kXR_write (handle: 0x00000000, offset: 1630535680, size: 5242880)
[2018-03-16 17:17:22.008032 +0100][Debug  ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Closing the socket
[2018-03-16 17:17:22.008049 +0100][Debug  ][Poller            ] <[::ffff:137.138.222.81]:42468><--><[::ffff:10.32.133.140]:1095> Removing socket from the poller
[2018-03-16 17:17:22.008093 +0100][Debug  ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Recovering error for stream #0: [ERROR] Socket error.
[2018-03-16 17:17:22.008106 +0100][Debug  ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Reporting disconnection to queued message handlers.
[2018-03-16 17:17:22.008118 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: /diskserver/26/1633531326@castorns.6889850100?castor.exptime=152122055
5&castor.pfn1=/diskserver/26/1633531326@castorns.6889850100&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucC
kQXdGwa6f+NIM8SA==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.008130 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: /diskserver/26/1633531326@castorns.6889850100?castor.exptime=1521220
555&castor.pfn1=/diskserver/26/1633531326@castorns.6889850100&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnu
cCkQXdGwa6f+NIM8SA==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.
[2018-03-16 17:17:22.008146 +0100][Dump   ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: /diskserver/26/1633531326@castorns.6889850100?castor.exptime=1521220555&castor.pf
n1=/diskserver/26/1633531326@castorns.6889850100&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+N
IM8SA==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x7400a1a0) through substream 0 expecting answer at 0
[2018-03-16 17:17:22.009109 +0100][Debug  ][PostMaster        ] [tpsrv452.cern.ch:1095] Found 1 address(es): [::ffff:10.32.133.140]:1095
[2018-03-16 17:17:22.009149 +0100][Debug  ][AsyncSock         ] [tpsrv452.cern.ch:1095 #0.0] Attempting connection to [::ffff:10.32.133.140]:1095
[2018-03-16 17:17:22.009182 +0100][Debug  ][Poller            ] Adding socket 0x70005d60 to the poller
[2018-03-16 17:17:22.009209 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: /diskserver/58/1633530558@castorns.6889850058?castor.exptime=152122051
6&castor.pfn1=/diskserver/58/1633530558@castorns.6889850058&castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884&castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh75W
lm9fEDYJrtryRDKg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.009221 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: /diskserver/58/1633530558@castorns.6889850058?castor.exptime=1521220
516&castor.pfn1=/diskserver/58/1633530558@castorns.6889850058&castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884&castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh7
5Wlm9fEDYJrtryRDKg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.
[2018-03-16 17:17:22.009244 +0100][Dump   ][PostMaster        ] [tpsrv452.cern.ch:1095 #0] Sending message kXR_open (file: /diskserver/58/1633530558@castorns.6889850058?castor.exptime=1521220516&castor.pf
n1=/diskserver/58/1633530558@castorns.6889850058&castor.pfn2=0:15511:c3ea1292-06c4-4051-ae16-63c9ffa84884&castor.signature=ekXuu3hq3/ZYlcqOf2zVouol9mM8wJZ3pUkaf1k5474mk38cJPk8hqTpBmUpVgQjNBh75Wlm9fEDYJrtr
yRDKg==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ) (0x58005220) through substream 0 expecting answer at 0
[2018-03-16 17:17:22.009255 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_close (handle: 0x00000000)
[2018-03-16 17:17:22.009265 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_close (handle: 0x00000000): [ERROR] Socket error.
[2018-03-16 17:17:22.009275 +0100][Error  ][XRootD            ] [tpsrv452.cern.ch:1095] Unable to get the response to request kXR_close (handle: 0x00000000)
[2018-03-16 17:17:22.009295 +0100][Dump   ][XRootD            ] [tpsrv452.cern.ch:1095] Stream event reported for msg kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=152122058
1&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqk
vH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat )
[2018-03-16 17:17:22.009306 +0100][Debug  ][XRootD            ] [tpsrv452.cern.ch:1095] Handling error while processing kXR_open (file: /diskserver/74/1633530374@castorns.6889850064?castor.exptime=1521220581&castor.pfn1=/diskserver/74/1633530374@castorns.6889850064&castor.pfn2=0:15511:660c04e5-491a-4d53-abe6-011674bb3fd4&castor.signature=M+zmafoPSQ1Dh9WQvf2j2s+Ppg3atlDffJPwQzvsTAlf1NnoxQjMl2SlFuTFOqEdlAcGqkvH5RvaK+vVhRQ/4g==&castor.txtype=tape, mode: 00, flags: kXR_delete kXR_seqio kXR_async kXR_retstat ): [ERROR] Socket error.
[2018-03-16 17:17:22.009309 +0100][Debug  ][File              ] [0x68008a40@root://tpsrv452.cern.ch:1095//diskserver/48/1633520448@castorns.6889849983?castor.exptime=1521220467&castor.pfn1=/diskserver/48/1633520448@castorns.6889849983&castor.pfn2=0:15511:ae8aa0b2-2588-42d5-b758-66526f24d8cc&castor.signature=QZLFVGTqrdWgudfJxCSROSDY/Se446SI26b4WjozzPe5olJk38zqr/OEZSSC1bgHTytJTjtjCBmqCvqFRlp7Mg==&castor.txtype=tape] Close returned from tpsrv452.cern.ch:1095 with: [ERROR] Socket error

As can be seen from the logs above there are actually 3 open requests which are dropped and retried since the socket is declared dead and on close request which is the one from the first transfer that never returned.
This I think also matches some observation from Eric who was suspecting cross-talk in the client ...

The more interesting part is on the server side, where we see the effect of all these recoveries. For example for file id 1633531326

180316 17:17:22 time=1521217042.015626 func=open                     level=INFO  logid=81a93368-2935-11e8-8389-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140281188312832 source=XrdxCastor2Ofs:697   ti
dent=stage.27009:35@tpsrv210 path=/diskserver/26/1633531326@castorns.6889850100, opaque=castor.exptime=1521220555&castor.pfn1=/diskserver/26/1633531326@castorns.6889850100&castor.pfn2=0:15511:a665e40e-6c1
2-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+NIM8SA==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95ac104810
180316 17:17:22 153326 stage.27009:35@tpsrv210 castor2ofs_open: 4200-40600 fn=/diskserver/26/1633531326@castorns.6889850100

The above is the open that came through the new socket as the result of the recovery. And below at some point later on when the blocking close finished and the link mutex is released we actually see the first open going through.

180316 17:18:59 time=1521217139.965050 func=open                     level=INFO  logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:697   tident=stage.27009:30@tpsrv210 path=/diskserver/26/1633531326@castorns.6889850100, opaque=castor.exptime=1521220555&castor.pfn1=/diskserver/26/1633531326@castorns.6889850100&castor.pfn2=0:15511:a665e40e-6c12-406e-abcd-cc770a7b5fd0&castor.signature=AXlRWEpWwCisj0dX1yhcJYywv/nJuBKiXSdNEJ3kjnEdFOT02T/6ysbA6Mi9lhGeWvnucCkQXdGwa6f+NIM8SA==&castor.txtype=tape, isRW=1, open_mode=4200, file_ptr=0x7f95740033d0
180316 17:18:59 165099 stage.27009:30@tpsrv210 castor2ofs_open: 4200-40600 fn=/diskserver/26/1633531326@castorns.6889850100
180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_open: attach use=2 fn=/diskserver/26/1633531326@castorns.6889850100
180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=2 fn=/diskserver/26/1633531326@castorns.6889850100
180316 17:19:00 time=1521217140.187768 func=close                    level=ERROR logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:974   tident=stage.27009:30@tpsrv210 path=/diskserver/26/1633531326@castorns.6889850100 msg="closed via destructor"
180316 17:19:00 165099 castor2ofs_close: stage.27009:30@tpsrv210 Unable to close file - delete via destructor ; connection reset by peer
180316 17:19:00 time=1521217140.187819 func=close                    level=INFO  logid=bc0b1b0c-2935-11e8-a8c6-a4bf0112f566 unit=ds@tpsrv452.cern.ch:1094 tid=140280503043840 source=XrdxCastor2Ofs:1007  tident=stage.27009:30@tpsrv210 msg="send to diskmanager" errc=104 errmsg="Unable to close file - delete via destructor ; connection reset by peer"
180316 17:19:00 165099 stage.27009:30@tpsrv210 castor2ofs_close: use=0 fn=dummy
180316 17:19:00 165099 XrootdXeq: Output file /diskserver/26/1633531326@castorns.6889850100 is already opened by 1 writer; open denied.
180316 17:19:00 165099 stage.27009:30@tpsrv210 XrootdResponse: sending err 3003: Output file /diskserver/26/1633531326@castorns.6889850100 is already opened by 1 writer; open denied.

Surprisingly, or not, we can see that XRootD detected that this is an open of an already opened (notice the use=2) for write file and says open denied. But this happens after the CASTOR plugin is called therefore is useless since the damage is done. I suppose this is the thing that Andy fixed - that the check happens before the plugin call.

@simonmichal
Copy link
Contributor

I've been able to reproduce the scenario described by Elvin with xrootd 4.8.1 and indeed I observed file corruption. Doing the same test having the 814e8b1 fix in place, confirmed that the problem has been addressed and that the patch prevents file corruption.

@simonmichal
Copy link
Contributor

Just to be on the safe side I've add an envar to disable open recovery mechanism (for open+create and open+truncate) in case the request is retried on the same data node: 6ef8e30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants