Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding xrootd -> FTS timeout errors (Third Party Copy - HTTP) #1736

Closed
kreczko opened this issue Jul 12, 2022 · 90 comments
Closed

Understanding xrootd -> FTS timeout errors (Third Party Copy - HTTP) #1736

kreczko opened this issue Jul 12, 2022 · 90 comments
Assignees

Comments

@kreczko
Copy link

kreczko commented Jul 12, 2022

Dear experts,

We are observing issues between xrootd (both redirector and disk node) and FTS servers like these:

220712 09:41:26 10649 XrdLink: Unable to send to 7dc42260.5721:353@fts-cms-10.cern.ch; connection timed out
220712 09:41:26 10649 XrootdXeq: 7dc42260.5721:353@fts-cms-10.cern.ch disc 2:12:21 (socket error)

which cause most of our transfers to fail (failures are 10-100 times larger than successes)

On the FTS side we see this:

INFO    Tue, 12 Jul 2022 10:14:02 +0200; [1657613642182] DEST http_plugin	CHECKSUM:ENTER	
WARNING Tue, 12 Jul 2022 10:29:52 +0200; Timeout stopped
ERR     Tue, 12 Jul 2022 10:29:52 +0200; Recoverable error: [112] DESTINATION CHECKSUM (Neon): Could not read status line: Connection timed out

In short: transfer succeeds (we can see the files on disk), but the checksum part always times out.
We've tried many things to make it work, including a custom plugin for calculation.
We do not observe huge delays in the checksum calculation - nothing that would explain 🟥 15 minute delays 🟥!
Both source and destination use davs:// for copy → HTTPS via XROOTD path on our side.

Logs

FTS log example:

Full xrootd.log:

Versions, config and operations

We are running both redirector (xrootd.phy.bris.ac.uk:1094) and disk server (io-37-02.acrc.bris.ac.uk:1194) via Docker with --net=host.
FTS servers are reachable from within the containers (and host) via IPv4 and IPv6.

Our config can be found on https://github.com/BristolComputing/xrootd-se/tree/main/etc/xrootd (clustered + config.d).

Installed xrootd versions and plugins:

|libmacaroons.x86_64           |          0.3.0-2.el7 |           epel |
|scitokens-cpp.x86_64          |          0.7.1-1.el7 |           epel |
|voms.x86_64                   |   2.1.0-0.24.rc2.el7 |           epel |
|xrootd.x86_64                 | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-client.x86_64          | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-client-libs.x86_64     | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-cmstfc.x86_64          |    1.5.2-6.osg36.el7 |    osg-contrib |
|xrootd-lcmaps.x86_64          |       99-1.osg36.el7 |            osg |
|xrootd-libs.x86_64            | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-scitokens.x86_64       | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-selinux.noarch         | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-server.x86_64          | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-server-libs.x86_64     | 1:5.4.3-1.1.osg36.el7 |            osg |
|xrootd-voms.x86_64            | 1:5.4.3-1.1.osg36.el7 |            osg |

xrootd-hdfs

Hadoop 3.3.1
Source code repository https://github.com/apache/hadoop.git -r a3b9c37a397ad4188041dd80621bdeefc46885f2
Compiled by ubuntu on 2021-06-15T05:13Z
Compiled with protoc 3.7.1
From source with checksum 88a4ddb2299aca054416d6b7f81ca55
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar

xrootd-hdfs 2.2.0
Source code repository https://github.com/uobdic/xrootd-hdfs.git -b kreczko-checksum-debug -r 66d7c97
Compiled by CentOS Linux release 7.9.2009 (Core) on 2022-07-08T13:59Z

Other monitoring

Failures

image

Successes

image

99% of the failures are due to the mentioned timeout. Please note the different scales for the y-axis.

@rob-c
Copy link

rob-c commented Jul 13, 2022

Hi @kreczko,

When I look at the disk node at about the same time (for the file in the fts transfer logs) I see errors like:

220712 09:31:36 11040 XrootdXeq: 7dc42260.6440:419@fts-cms-03.cern.ch disc 0:00:05 (link SSL read error)
Readahead buffer stats for /xrootd/cms/store/mc/RunIISummer20UL16RECOAPV/ST_t-channel_antitop_4f_InclusiveDecays_wtop1p3_TuneCP5_13TeV-powheg-madspin-pythia8/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v3/70000/69A1EEA6-3EA3-A44C-A1BD-FD19EF6E39B7.root : 0 misses, 0 hits, 0 partial hits, 0 unbuffered, 0 buffered bytes used of 0 read (0.00%)
220712 09:33:31 051 TPC_TempCA: Reloading the list of CAs and CRLs in directory

It looks from the logs like the redirector is passing on the connection to the disknode which matches the fact the transfer is running, but then there is some SSL related error which would match the FTS client not receiving the csum.

As a passing thought (and based on the last line of the log), you mention you're using docker, how are you configuring/updating the voms/crl for production?

@kreczko
Copy link
Author

kreczko commented Jul 13, 2022

Hi @rob-c

Thank you for looking into the logs

As a passing thought (and based on the last line of the log), you mention you're using docker, how are you configuring/updating the voms/crl for production?

voms/crl is mounted from /cvmfs/grid.cern.ch/etc/grid-security/, so it should be always up-to-date.

@kreczko
Copy link
Author

kreczko commented Jul 13, 2022

@rob-c I tried to extract one full transfer from the logs with more http info:
https://gist.github.com/kreczko/66f0d2fb47865821b8933b88045b87eb#file-one-full-transfer

This is what happens around checksum request, then time out:

220713 14:17:30 124 Xrd_Sched: running chksum inq=0
220713 14:17:30 124 Xrootd_jobXeq: Job chksum /xrootd/cms/store/data/Run2018A/EGamma/NANOAOD/UL2018_MiniAODv1_NanoAODv2_GT36-v1/70000/4DC52ADD-551B-7242-8334-B97E6348B997.root started
220713 14:17:30 124 http_Req:  XrdHttpReq::Data! final=True
220713 14:17:30 124 7dc42260.738:418@fts-cms-06.cern.ch http_Req: PostProcessHTTPReq req: 3 reqstate: 1
220713 14:17:30 124 7dc42260.738:418@fts-cms-06.cern.ch http_Req: Checksum for HEAD /xrootd/cms/store/data/Run2018A/EGamma/NANOAOD/UL2018_MiniAODv1_NanoAODv2_GT36-v1/70000/4DC52ADD-551B-7242-8334-B97E6348B997.root adler32=30032735
220713 14:17:30 124 7dc42260.738:418@fts-cms-06.cern.ch http_Protocol: Sending resp: 200 header len:97
220713 14:17:30 124 http_Protocol: Sending 97 bytes
220713 14:17:30 124 http_Req:  XrdHttpReq request ended.
220713 14:17:30 124 Xrd_Poll: Poller 1 enabled 7dc42260.738:418@fts-cms-06.cern.ch
220713 14:17:30 124 Xrootd_jobSendResult: sent async ok to 7dc42260.738:418@fts-cms-06.cern.ch
220713 14:17:30 124 Xrootd_jobXeq: Job chksum /xrootd/cms/store/data/Run2018A/EGamma/NANOAOD/UL2018_MiniAODv1_NanoAODv2_GT36-v1/70000/4DC52ADD-551B-7242-8334-B97E6348B997.root completed
220713 14:17:39 124 Xrd_Sched: running 7dc42260.94:352@fts-cms-04.cern.ch inq=0
220713 14:17:39 124 http_Protocol:  Cleanup
220713 14:17:39 124 XrdLink: Unable to send to 7dc42260.94:352@fts-cms-04.cern.ch; connection timed out
220713 14:17:39 124 http_Protocol:  SSL_shutdown failed
220713 14:17:39 124 http_Protocol:  Reset
220713 14:17:39 124 http_Req:  XrdHttpReq request ended.

@abh3
Copy link
Member

abh3 commented Jul 14, 2022 via email

@kreczko
Copy link
Author

kreczko commented Jul 14, 2022

Hi @abh3,

Of course:

Both redirector (xrootd.phy.bris.ac.uk:1094) and disk server (io-37-02.acrc.bris.ac.uk:1194) use the same config.

@xrootd-dev
Copy link

xrootd-dev commented Jul 14, 2022 via email

@kreczko
Copy link
Author

kreczko commented Jul 15, 2022

FYI - the SSL read error occurs on every HTTP connection. Haven’t decided if most HTTP clients close poorly or there’s something at the Xrootd level handling TLS incorrectly - but that read error alone might just be normal operations!

yes, we thought as much. Other sites are seeing things like this too.
We are more concerned with

220713 14:17:39 124 XrdLink: Unable to send to 7dc42260.94:352@fts-cms-04.cern.ch; connection timed out

as these sound exactly like the errors FTS is reporting:

INFO    Fri, 15 Jul 2022 13:27:46 +0200; [1657884466840] BOTH http_plugin	TRANSFER:EXIT	davs://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/mc/RunIISummer20UL17RECO/BsToKKMuMu_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mc2017_realistic_v6-v2/70000/C69EC049-4E1D-3240-82A0-E2476AE28F10.root => davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/mc/RunIISummer20UL17RECO/BsToKKMuMu_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mc2017_realistic_v6-v2/70000/C69EC049-4E1D-3240-82A0-E2476AE28F10.root
INFO    Fri, 15 Jul 2022 13:27:46 +0200; [1657884466840] DEST http_plugin	CHECKSUM:ENTER	
WARNING Fri, 15 Jul 2022 13:43:36 +0200; Timeout stopped
ERR     Fri, 15 Jul 2022 13:43:36 +0200; Recoverable error: [112] DESTINATION CHECKSUM (Neon): Could not read status line: Connection timed out

Transfer ends 13:27:46 → requests checksum → HTTP over Xrootd remains silent for 15 minutes → time out

Since CERN FTS is working for other sites, it must be someone on our end.

@bbockelm
Copy link
Contributor

You seem to be using the external checksum program, right?

We've seen some issues where a queue of checksum requests develop to the point where there's a standing backlog. Even when the average time to checksum a given file is modest, there are so many files in the queue that a timeout is triggered.

Is that a possibility here?

@kreczko
Copy link
Author

kreczko commented Jul 15, 2022

Hi @bbockelm,

We've previously been using the xrootd-hdfs built-in implementation, but due to the time outs we tried our own solution (since we can now store in xattr). But this did not help.

We are allowing for 100 simultaneous checksums and as far as I can see, neither redirector nor server come ever close to that limit. 2GB files take around 15s, so the largest files (~8GB) should not take much longer than 1 minute.
Nothing close to the 15min FTS sees.
The checksum log is updated relatively slowly → not many requests per minute.

It is hard to map 1:1, but it looks like we get a request at 13:37:30, checksum script returns result at 13:37:35 (was stored in xattr for this example), but we still see a timeout.

Is there a way to follow a connection through the logs to get a more accurate reading on the situation?
How did you figure out the congestion on your side?

@bbockelm
Copy link
Contributor

Ok, sounds like that we can rule the 'standing queue' out as a possibility! Since I don't use checksum scripts, I never debugged that on my own -- @wyang007 might have more insight.

In that case, I'm a bit befuddled where to turn next. I assume that you have already eliminated various standard network problems? Are there any idle timeouts in the server you've set that might be terminating the connection unexpectedly?

It's not really a step toward solution but, in terms of having fewer moving parts, can you reproduce in a data-server-only setup? Always hard to track the back and forth between the data server and redirector; would be nice to have the logs in one place.

@wyang007
Copy link
Member

I don't think I have much insight but experience from experimenting it. I do use external script because this allows me to clearly see that checksum requests are piling up. My GUESS from what I see is that when many checksum requests coming in, checksum calculation are actually slow (CPU limit or HD's priority to write over read? don't know). If a checksum is running, Xrootd (via HTTP or xroot protocol) will periodically update FTS. But if a checksum is not in running state (e.g. exceed the max checksum configured in xrootd), the xrootd won't update FTS with perf mark. In that case, FTS may timeout the transfer after 1 minute (and from the FTS logs I saw, all checksum timeout happened exactly after 1 minute). My only solution is to increase the max allowed checksums.

@kreczko
Copy link
Author

kreczko commented Jul 19, 2022

@bbockelm

In that case, I'm a bit befuddled where to turn next. I assume that you have already eliminated various standard network problems?

Firewall, connection limits, latency, tcp wrappers - yes, we tested a lot of them.
A month ago we still had some transfer issues as identified by our perfSonar (separate server on the same network), but fixing those upstream did not change the checksum issues.

Are there any idle timeouts in the server you've set that might be terminating the connection unexpectedly?

We are back to using defaults - could you please name any particular settings you would recommend we test?

It's not really a step toward solution but, in terms of having fewer moving parts, can you reproduce in a data-server-only setup?

We can try stand-alone again, but I expect the current redirector (we would use that, since it is registered everywhere) to struggle with the load (it is a tiny VM for just redirecting). I will synchronize the configs and we can test it for a few hours.

@wyang007

My GUESS from what I see is that when many checksum requests coming in, checksum calculation are actually slow (CPU limit or HD's priority to write over read? don't know).

We currently allow for 100 simultaneous checksums (which we have not yet reached). The data is read over the network and as far as I can see, we do not hit any hardware limitations (CPU, RAM, network, disk).

In that case, FTS may timeout the transfer after 1 minute (and from the FTS logs I saw, all checksum timeout happened exactly after 1 minute).

Where do you see that? From the FTS logs I can see a timeout after 15 minutes, not 1.
For very big files (e.g. 8 GB), we would expect the checksum to take a bit longer than 60s, but for most of the failing files, checksum finish within 20s.

@abh3
Copy link
Member

abh3 commented Jul 19, 2022

Frankly, my take is that you are over estimating what the file system can do. Also, my take is that FTS checksum timeout of one minute is totally wrong. The combination is making it unworkable. Frankly, the FTS timeout should be at least 3 minutes. Let me work with the FTS group to see what we can do here.

@kreczko
Copy link
Author

kreczko commented Jul 19, 2022

@abh3

Frankly, my take is that you are over estimating what the file system can do

Possibly. But before we switched to xrootd-only, we ran DMLite+HDFS on the same hardware. No checksum issues or general timeouts.

Deletions would occasionally stress the server if done via GridFTP, but worked fine via HTTP - the limitation has always been on the server side, not the file system.

Both redirector nor server are at low load (< 50% of cores) and are otherwise not very busy - nor is the file system.
That's the main reason I cannot understand the current issues.
I hope it is just a misconfiguration on our side or something with a workaround.

@mpatrascoiu
Copy link
Contributor

Hello,

I will upload the FTS log file here, as log files older than 5 days can no longer be viewed in the FTS Web Monitoring service.

The FTS log lines show a 15 minute timeout, not a one minute.

Something else mentioned by@wyang007 ... I don't think the statement about checksum and Performance Marker updates is correct. FTS receives Performance Markers during the transfer part, but not for the checksum. The checksum is a request-and-wait operation.

FTS (in fact, this is all done at the Gfal2 level) requests the checksum to be computed at the end of the transfer part. The checksum computation is included in the copy(..) operation. FTS decides the timeout for this according to the file size. In our case:

INFO    Tue, 12 Jul 2022 10:12:08 +0200; Timeout set to: 7049

All other HTTP operations have absurdly long timeouts, so we never hit an operation timeout. The 15 minute timeout we see here comes from libneon (HTTP library employed by Davix).

I am curious if you are able to reproduce the issue with Davix and/or Gfal2:

$davix-http -X HEAD "Want-Digest: adler32" -H "Authorization: Bearer ${TOKEN}" <surl>

or

$ gfal-copy -vvv --log-file=gfal2.log --checksum-mode target --checksum adler32:<checksum> <src> <dst> 

2022-07-12-0829__cmsxrootd.hep.wisc.edu__xrootd.phy.bris.ac.uk__3394306095__5055e2c2-01ba-11ed-845d-fa163eddde58.log

@kreczko
Copy link
Author

kreczko commented Jul 21, 2022

@mpatrascoiu thank you for the useful info.

FTS (in fact, this is all done at the Gfal2 level) requests the checksum to be computed at the end of the transfer part. The checksum computation is included in the copy(..) operation

Is there a way to verify this happens? The xrootd-hdfs plugin would save checksums in separate files (/cksums), but the new plugin stores it in xattr which should be faster.

From the logs I see

220714 00:00:22 4458 7dc42260.0 ofs_close: use=1 fn=/xrootd/cms/store/mc/Run3Summer21DRPremix/TT_TuneCP5_14TeV-powheg-pythia8/GEN-SIM-DIGI-RAW/120X_mcRun3_2021_realistic_v6_ext1-v5/40000/3d60585b-0bb8-42d5-993b-3115f168d728.root
<snip>
220714 00:00:22 7229 7dc42260.4100:387@fts-cms-09.cern.ch Xrootd_Protocol: 0000 Bridge waiting for resp; sid=0 wt=600 (Job scheduled.)
220714 00:00:22 7229 Xrd_Sched: running chksum inq=0
220714 00:00:22 7229 Xrootd_jobXeq: Job chksum /xrootd/cms/store/mc/Run3Summer21DRPremix/TT_TuneCP5_14TeV-powheg-pythia8/GEN-SIM-DIGI-RAW/120X_mcRun3_2021_realistic_v6_ext1-v5/40000/3d60585b-0bb8-42d5-993b-3115f168d728.root started

# checksum logs
2022-07-14 00:00:23,337 [xrdsum]  DEBUG: No checksum found in metadata (user.xrdsum.adler32) for file /xrootd/cms/store/mc/Run3Summer21DRPremix/TT_TuneCP5_14TeV-powheg-pythia8/GEN-SIM-DIGI-RAW/120X_mcRun3_2021_realistic_v6_ext1-v5/40000/3d60585b-0bb8-42d5-993b-3115f168d728.root: At least one of the attributes provided was not found.
2022-07-14 00:00:23,346 [xrdsum]  DEBUG: Reading /xrootd/cms/store/mc/Run3Summer21DRPremix/TT_TuneCP5_14TeV-powheg-pythia8/GEN-SIM-DIGI-RAW/120X_mcRun3_2021_realistic_v6_ext1-v5/40000/3d60585b-0bb8-42d5-993b-3115f168d728.root in chunks of 134217728 bytes (out of 3682337619)
2022-07-14 00:00:59,638 [xrdsum]  DEBUG: No checksum found in metadata (user.xrdsum.adler32) for file /xrootd/cms/store/mc/Run3Summer21DRPremix/TT_TuneCP5_14TeV-powheg-pythia8/GEN-SIM-DIGI-RAW/120X_mcRun3_2021_realistic_v6_ext1-v5/40000/3d60585b-0bb8-42d5-993b-3115f168d728.root: At least one of the attributes provided was not found.

Which kinda looks like it is doing the right thing.

So there should be no reason for FTS to not get a response quickly.
I will add more debug info.

@kreczko
Copy link
Author

kreczko commented Jul 21, 2022

Note: For some reason the redirector also calculates checksums from time to time (not sure why).
Here things take a little bit longer, but still below 1 minute:

2022-07-21 16:54:25,230 [xrdsum]  DEBUG: No checksum found in metadata (user.xrdsum.adler32) for file /xrootd/cms/store/mc/RunIISummer20UL16RECO/BsToKKMuMu_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_v13-v2/70000/4AB754E5-7109-D74C-99F9-91FD76BA0972.root: At least one of the attributes provided was not found.
2022-07-21 16:54:25,230 [xrdsum]  TIMING: HDFS get_xattr took 0.006s for /xrootd/cms/store/mc/RunIISummer20UL16RECO/BsToKKMuMu_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_v13-v2/70000/4AB754E5-7109-D74C-99F9-91FD76BA0972.root
2022-07-21 16:54:25,242 [xrdsum]  DEBUG: Reading /xrootd/cms/store/mc/RunIISummer20UL16RECO/BsToKKMuMu_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_v13-v2/70000/4AB754E5-7109-D74C-99F9-91FD76BA0972.root in chunks of 134217728 bytes (out of 2832177462)
2022-07-21 16:54:56,894 [xrdsum]  TIMING: HDFS checksum calculation took 31.664s for /xrootd/cms/store/mc/RunIISummer20UL16RECO/BsToKKMuMu_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_v13-v2/70000/4AB754E5-7109-D74C-99F9-91FD76BA0972.root
2022-07-21 16:54:56,894 [xrdsum]  TIMING: HDFS checksum took 31.684s for /xrootd/cms/store/mc/RunIISummer20UL16RECO/BsToKKMuMu_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_v13-v2/70000/4AB754E5-7109-D74C-99F9-91FD76BA0972.root
2022-07-21 16:54:56,902 [xrdsum]  DEBUG: No checksum found in metadata (user.xrdsum.adler32) for file /xrootd/cms/store/mc/RunIISummer20UL16RECO/BsToKKMuMu_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_v13-v2/70000/4AB754E5-7109-D74C-99F9-91FD76BA0972.root: At least one of the attributes provided was not found.
2022-07-21 16:54:56,957 [xrdsum]  TIMING: Storing checksum took 0.063s for /xrootd/cms/store/mc/RunIISummer20UL16RECO/BsToKKMuMu_TuneCP5_13TeV-pythia8-evtgen/AODSIM/106X_mcRun2_asymptotic_v13-v2/70000/4AB754E5-7109-D74C-99F9-91FD76BA0972.root

I cannot find a similar entry on the disk server, but I can see the entries in the xrootd.log.

What are the rules here? Why is the redirector doing any checksumming?
Both machines have the lines

ofs.tpc cksum adler32 fcreds ?gsi =X509_USER_PROXY autorm xfr 40 pgm /etc/xrootd/xrdcp-tpc.sh
xrootd.chksum max 100 adler32 /etc/xrootd/xrdsum.sh

As an experiment, I've now removed these lines from the redirector.

@kreczko
Copy link
Author

kreczko commented Aug 4, 2022

Since the mentioned change, the redirector no longer calculates checksums → good.
But FTS still seems to be failing. The current example:

INFO    Thu, 04 Aug 2022 13:06:24 +0200; [1659611184764] BOTH http_plugin	TRANSFER:EXIT	davs://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/mc/RunIISummer20UL16RECOAPV/ST_t-channel_antitop_4f_InclusiveDecays_wtop1p45_TuneCP5_fixWidth_13TeV-powheg-madspin-pythia8/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v2/40001/25D27F31-FF0C-4346-8A5E-286FE74440E9.root => davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/mc/RunIISummer20UL16RECOAPV/ST_t-channel_antitop_4f_InclusiveDecays_wtop1p45_TuneCP5_fixWidth_13TeV-powheg-madspin-pythia8/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v2/40001/25D27F31-FF0C-43
INFO    Thu, 04 Aug 2022 13:06:24 +0200; [1659611184764] DEST http_plugin	CHECKSUM:ENTER	
ERR     Thu, 04 Aug 2022 13:22:14 +0200; Recoverable error: [112] DESTINATION CHECKSUM (Neon): Could not read status line: Connection timed out

which is only 10 min ago as of writing this.

Out of curiosity, I ran gfal-sum on lxplus:

gfal-sum davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/mc/RunIISummer20UL16RECOAPV/ST_t-channel_antitop_4f_InclusiveDecays_wtop1p45_TuneCP5_fixWidth_13TeV-powheg-madspin-pythia8/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v2/40001/25D27F31-FF0C-4346-8A5E-286FE74440E9.root adler32

→ first query took a while → checksum has not been calculated at the end of transfer.

However, timing information shows that calculating the checksum took 53 seconds → timeout of 15 minutes should have not happened

2022-08-04 12:28:31,463 [xrdsum]  TIMING: HDFS checksum calculation took 53.469s for /xrootd/cms/store/mc/RunIISummer20UL16RECOAPV/ST_t-channel_antitop_4f_InclusiveDecays_wtop1p45_TuneCP5_fixWidth_13TeV-powheg-madspin-pythia8/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v2/40001/25D27F31-FF0C-4346-8A5E-286FE74440E9.root

Is there a hook in xrootd to force the calculation of a checksum after successful transfer?
In other words, trigger a checksum calculation job even if someone copies with gfal-copy --just-copy

@abh3
Copy link
Member

abh3 commented Aug 23, 2022

There is no "hook" to automatically calculate the checksum. Among other reasons it's not always clear that the transfer is an actual copy operation. So, xrootd simply requires that the person doing the copy ask for the checksum to be calculated. Anyway, doing it automatically wouldn't necessarily avoid a timeout.

@kreczko
Copy link
Author

kreczko commented Sep 1, 2022

In GGUS ticket 156902, Bockjoo discovered a possibly related TLS problem:

(base) [bockjoo@cms ~]$ xrdcp -d 1 -f root://xrootd.phy.bris.ac.uk//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root /dev/null
[2022-08-31 07:46:13.648599 -0400][Info   ][AsyncSock         ] [xrootd.phy.bris.ac.uk:1094.0] TLS hand-shake done.
[2022-08-31 07:46:13.766152 -0400][Error  ][TlsMsg            ] [] TLS error rc=-1 ec=5 (error_syscall) errno=104.
[2022-08-31 07:46:13.766192 -0400][Error  ][AsyncSock         ] [xrootd.phy.bris.ac.uk:1094.0] Socket error while handshaking: [ERROR] TLS error
[2022-08-31 07:46:13.766299 -0400][Error  ][PostMaster        ] [xrootd.phy.bris.ac.uk:1094] elapsed = 0, pConnectionWindow = 120 seconds.
[2022-08-31 07:46:13.766313 -0400][Info   ][PostMaster        ] [xrootd.phy.bris.ac.uk:1094] Attempting reconnection in 120 seconds.
^C
(base) [bockjoo@cms ~]$ xrdcp -d 1 -f root://xrootd.phy.bris.ac.uk//store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root /dev/null
[2022-08-31 07:47:12.718047 -0400][Info   ][AsyncSock         ] [xrootd.phy.bris.ac.uk:1094.0] TLS hand-shake done.
[2022-08-31 07:47:17.637166 -0400][Info   ][AsyncSock         ] [io-37-02.acrc.bris.ac.uk:1194.0] TLS hand-shake done.
[8MB/229.3MB][  3%][=>                                                ][1.143MB/s[144MB/229.3MB][ 62%][===============================>                  ][16MB/s]

It seems like the redirection, xrootd.phy → io02, does not always work - it can fail at the TLS handshake.
In our configuration, both redirector and server should accept HTTPS.
If I focus on FTS, I can see 122377 mentions in /var/log/clustered/xrootd.log on xrootd.phy and only 7669 on io02.
That approximately corresponds to the success rate in the FTS monitoring.

Is this something worth focusing on?
I've enabled xrd.trace tls -debug for now.

@kreczko
Copy link
Author

kreczko commented Sep 8, 2022

After exchanging configs, we removed the xrd.tls and xrd.tlsca from the redirector and replaced them with the http.cadir, http.cert, and http.key settings.
This seems to have decreased our FTS transfer failure rate from ≈94% to ≈38% (in the last two hours).
The mentions of fts-cms in /var/log/clustered/xrootd.log in redirector vs server have improved as well (more getting through).

So the updated config seems to be a step in the right direction.

I still see 15 min timeouts (FTS logs) after a successful transfer (checksum request):

TRANSFER [112] DESTINATION CHECKSUM (Neon): Could not read status line: Connection timed out

and I see (send failure) and (link SSL read error) in the redirector and server logs respectively.
But I also see them for requests that I know went well (e.g. ETF tests for CMS).

I've now also updated to Xrootd 5.5.0 (after confirming config change went into the right direction)

@abh3
Copy link
Member

abh3 commented Sep 9, 2022 via email

@kreczko kreczko changed the title Understanding xrootd -> FTS timeout errors Understanding xrootd -> FTS timeout errors (Third Party Copy - HTTP) Sep 29, 2022
@kreczko
Copy link
Author

kreczko commented Sep 29, 2022

xrd.tls and xrd.tlsca are treated identically to their http counterparts

yes, I can confirm that. Yesterday I've reset everything back to xrd.tls* without any issues.

Our university network team also had a look and identified a common theme:

The resets happen after the redirector sends a FIN to FTS server. I would expect FTS server to stop sending data/request to redirector after that stage but it continues to send requests. redirector resets (RST) any further connection attempt from FTS server after it has sent the FIN.

In our logs this looks like

# server
220928 14:10:37 5253 XrdLink: Unable to send to 7dc42260.360:500@fts-cms-08.cern.ch; connection timed out
220928 14:10:37 5253 XrootdXeq: 7dc42260.360:500@fts-cms-08.cern.ch disc 2:31:53 (socket error)

# redirector
220928 14:10:53 065 7dc42260.3507:661@fts-cms-09.cern.ch Xrootd_Protocol: redirecting to io-37-02.acrc.bris.ac.uk:1194
220928 14:10:53 065 XrootdXeq: 7dc42260.3507:661@fts-cms-09.cern.ch disc 0:00:00 (send failure)

We've tried to tune TPC with

ofs.tpc logok cksum adler32 fcreds ?gsi =X509_USER_PROXY autorm ttl 60 70 xfr 100 pgm /etc/xrootd/xrdcp-tpc.sh

but we were told that this has nothing to do with HTTP TPC - so we removed it yesterday.
We still get timeouts 80% of the time.

I also had a look at limits.conf and sysctl settings, but AFAIK there is nothing limiting the connections.
Do you have an idea what to check next?

@abh3
Copy link
Member

abh3 commented Sep 29, 2022

So, the message "connection timed out' appears to be coming from OpenSSL saying that it took too long to do the ssl handshake. The question is when that message is generated. So, what we need is all messages issued w.r.t. 7dc42260.360:500@fts-cms-08.cern.ch on the server. So, simply grep the log for this and send it over.

As for what happens in the redirector, it seems totlly unrelated. While the server is complaining about 7dc42260.360:500@fts-cms-08.cern.ch the redirector is complaining about 7dc42260.3507:661@fts-cms-09.cern.ch which is a comletely different client (i.e. tsts-cms-08 for the server and fts-cms09 for the redirector). However, please also grep the redirector log for all messages about 7dc42260.3507:661@fts-cms-09.cern.ch and send that over as well.

@kreczko
Copy link
Author

kreczko commented Sep 29, 2022

Hi @abh3,

on the server, grepping for 7dc42260.360:500@fts-cms-08.cern.ch:

# grep 7dc42260.360:500@fts-cms-08.cern.ch /var/log/xrootd/clustered/xrootd.log.20220928 
220928 11:38:44 2727 XrootdBridge: 7dc42260.360:500@fts-cms-08.cern.ch login as 7dc42260.0
220928 11:38:44 2727 7dc42260.360:500@fts-cms-08.cern.ch Xrootd_Protocol: rc=0 stat /xrootd/cms/store/data/Run2017G/SingleMuon/MINIAOD/15Feb2022_UL2017-v1/40000/B6823EDA-07F3-C442-946F-74B46C1FA94F.root
220928 11:38:44 2736 7dc42260.360:500@fts-cms-08.cern.ch Xrootd_Protocol: rc=0 stat /xrootd/cms/store/data/Run2017G/SingleMuon/MINIAOD/15Feb2022_UL2017-v1/40000/B6823EDA-07F3-C442-946F-74B46C1FA94F.root
220928 11:38:44 2153 7dc42260.360:500@fts-cms-08.cern.ch Xrootd_Protocol: rc=0 stat /xrootd/cms/store/data/Run2017G/SingleMuon/MINIAOD/15Feb2022_UL2017-v1/40000/B6823EDA-07F3-C442-946F-74B46C1FA94F.root
220928 11:38:44 2153 7dc42260.360:500@fts-cms-08.cern.ch Xrootd_Protocol: rc=0 rm /xrootd/cms/store/data/Run2017G/SingleMuon/MINIAOD/15Feb2022_UL2017-v1/40000/B6823EDA-07F3-C442-946F-74B46C1FA94F.root
220928 11:38:44 2076 7dc42260.360:500@fts-cms-08.cern.ch Xrootd_Protocol: rc=0 stat /xrootd/cms/store/data/Run2017G/SingleMuon/MINIAOD/15Feb2022_UL2017-v1/40000
220928 14:10:37 5253 XrdLink: Unable to send to 7dc42260.360:500@fts-cms-08.cern.ch; connection timed out
220928 14:10:37 5253 XrootdXeq: 7dc42260.360:500@fts-cms-08.cern.ch disc 2:31:53 (socket error)

Note that the transfer of the file mentioned at 11:38 seems to be complete at 11:59:

220928 11:59:19 2694 TPC_PullRequest: event=TRANSFER_SUCCESS, local=/xrootd/cms/store/data/Run2017G/SingleMuon/MINIAOD/15Feb2022_UL2017-v1/40000/B6823EDA-07F3-C442-946F-74B46C1FA94F.root, remote=https://xrootd-vanderbilt.sites.opensciencegrid.org:1094/store/data/Run2017G/SingleMuon/MINIAOD/15Feb2022_UL2017-v1/40000/B6823EDA-07F3-C442-946F-74B46C1FA94F.root, user=7dc42260.0, bytes_transferred=3635831365, tpc_status=200

On redirector grepping for 7dc42260.3507:661@fts-cms-09.cern.ch:

# grep 7dc42260.3507:661@fts-cms-09.cern.ch /var/log/xrootd/clustered/xrootd.log.20220928 
220928 14:10:53 065 XrootdBridge: 7dc42260.3507:661@fts-cms-09.cern.ch login as 7dc42260.0
220928 14:10:53 033 cms_Decode: xrootd redirects 7dc42260.3507:661@fts-cms-09.cern.ch to io-37-02.acrc.bris.ac.uk:1194 /xrootd/cms/store/mc/RunIISummer20UL17RECO/NMSSM_XToYHTo2G2Tau_MX-2800_MY-2500_TuneCP5_13TeV-madgraph-pythia8/AODSIM/106X_mc2017_realistic_v6-v2/2550000/612F6A04-59B1-3247-9F49-57EE4F9D6A19.root
220928 14:10:53 065 7dc42260.3507:661@fts-cms-09.cern.ch Xrootd_Protocol: rc=-256 stat /xrootd/cms/store/mc/RunIISummer20UL17RECO/NMSSM_XToYHTo2G2Tau_MX-2800_MY-2500_TuneCP5_13TeV-madgraph-pythia8/AODSIM/106X_mc2017_realistic_v6-v2/2550000/612F6A04-59B1-3247-9F49-57EE4F9D6A19.root
220928 14:10:53 065 7dc42260.3507:661@fts-cms-09.cern.ch Xrootd_Protocol: redirecting to io-37-02.acrc.bris.ac.uk:1194
220928 14:10:53 065 XrootdXeq: 7dc42260.3507:661@fts-cms-09.cern.ch disc 0:00:00 (send failure)

Full logs for that day can be found here:

@joaopblopes
Copy link

@kreczko
Copy link
Author

kreczko commented Sep 30, 2022

It seems someone has enabled debug for FTS transfers. Thank you!

The gfal checksum operation seems to

  1. find a bearer token (Found token in credential_map)
  2. find a cached ne_session (cached ne_session found ! taken from cache )
  3. send checksum request (Sending request-line and headers:) via Davix
  4. timeout after 950s [start 14:22:32, end 14:38:22] (Aborted request (-1): Could not read status line)

On my end, I do not see this checksum request, only the stat and rm (cleanup) after the timeout.

I looked through other issues, and I do not have http.tlsreuse in any of the configs.
I am running xrootd 5.5.0 on CentOS 7.9 with openssl 1.0.2k-25.el7_9.

CHECKSUM:ENTER details
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Event triggered: DESTINATION http_plugin CHECKSUM:ENTER 
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Impossible to get string_list parameter HTTP PLUGIN:HEADERS, set to a default value (null), err Key file does not have key “HEADERS” in group “HTTP PLUGIN”
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Impossible to get integer parameter HTTP PLUGIN:OPERATION_TIMEOUT, set to default value 8000, err Key file does not have key “OPERATION_TIMEOUT” in group “HTTP PLUGIN”
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Using client X509 for HTTPS session authorization
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: ssl: Match common name '1664259124' against ''
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: Identity match for '': bad
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: ssl: Match common name '1664259124' against ''
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: Identity match for '': bad
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; (SEToken) Found token in credential_map[davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/mc/Run3Summer22GS/MinBias_TuneCP5_13p6TeV-pythia8/GEN-SIM/124X_mcRun3_2022_realistic_v10-v1/40006/1d37f646-1a9d-4aa7-ad26-58c0d5b631ba.root] (access=write) (needed=read)
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Using bearer token for HTTPS request authorization
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix:  -> checksum
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: Create HttpRequest for davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/mc/Run3Summer22GS/MinBias_TuneCP5_13p6TeV-pythia8/GEN-SIM/124X_mcRun3_2022_realistic_v10-v1/40006/1d37f646-1a9d-4aa7-ad26-58c0d5b631ba.root
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix:  -> executeRequest
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix:  -> negotiateRequest
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: NEON start internal request
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: cached ne_session found ! taken from cache 
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: configure session...
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: define connection timeout to 30
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: define operation timeout to 1800
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: add CA PATH /etc/grid-security/certificates/
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: disable login/password authentication
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: enable client cert authentication by callback 
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: Running pre_send hooks
DEBUG   Fri, 30 Sep 2022 14:22:32 +0200; Davix: > HEAD /xrootd/cms/store/mc/Run3Summer22GS/MinBias_TuneCP5_13p6TeV-pythia8/GEN-SIM/124X_mcRun3_2022_realistic_v10-v1/40006/1d37f646-1a9d-4aa7-ad26-58c0d5b631ba.root HTTP/1.1
> User-Agent: fts_url_copy/3.12.1 gfal2/2.21.0 neon/0.0.29
> TE: trailers
> Host: xrootd.phy.bris.ac.uk:1094
> Want-Digest: ADLER32
> ClientInfo: job-id=9737ef08-40b9-11ed-af88-fa163e36d89b;file-id=3450963706;retry=0
> Authorization: Bearer MDAyNGxvY2F0aW9uIFVLSS1TT1VUSEdSSUQtQlJJUy1IRVAKMDAzNGlkZW50aWZpZXIgMzExOTZhYjctYjIwNC00ZDQ2LWJiZmYtZDdlYmQ4NmU5OTFhCjAwMThjaWQgbmFtZTo3ZGM0MjI2MC4wCjAwNTJjaWQgYWN0aXZpdHk6UkVBRF9NRVRBREFUQSxVUExPQUQsRE9XTkxPQUQsREVMRVRFLE1BTkFHRSxVUERBVEVfTUVUQURBVEEsTElTVAowMDM0Y2lkIGFjdGl2aXR5OkxJU1QsRE9XTkxPQUQsTUFOQUdFLFVQTE9BRCxERUxFVEUKMDBhYmNpZCBwYXRoOi94cm9vdGQvY21zL3N0b3JlL21jL1J1bjNTdW1tZXIyMkdTL01pbkJpYXNfVHVuZUNQNV8xM3A2VGVWLXB5dGhpYTgvR0VOLVNJTS8xMjRYX21jUnVuM18yMDIyX3JlYWxpc3RpY192MTAtdjEvNDAwMDYvMWQzN2Y2NDYtMWE5ZC00YWE3LWFkMjYtNThjMGQ1YjYzMWJhLnJvb3QKMDAyNGNpZCBiZWZvcmU6MjAyMi0wOS0zMFQxNToxNTo0MVoKMDAyZnNpZ25hdHVyZSCzrWofC-zAm7z_c3UBg9cYpdofgECNBIfNceJX15a2pgo
> 

DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: Sending request-line and headers:
DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: Request sent; retry is 1.
DEBUG Fri, 30 Sep 2022 14:38:22 +0200; Davix: Aborted request (-1): Could not read status line

Full FTS debug log

@kreczko
Copy link
Author

kreczko commented Oct 5, 2022

I've done some more testing and I can see some resemblance to #1252
With http.tlsreuse missing or on, I can replicate the error seen on FTS.
If I set export DAVIX_DISABLE_SESSION_CACHING=true, the command works.
Same is true if I set http.tlsreuse off (with export DAVIX_DISABLE_SESSION_CACHING=false):

client server works?
DAVIX_DISABLE_SESSION_CACHING=false http.tlsreuse missing no
DAVIX_DISABLE_SESSION_CACHING=false http.tlsreuse on no
DAVIX_DISABLE_SESSION_CACHING=true http.tlsreuse missing yes
DAVIX_DISABLE_SESSION_CACHING=true http.tlsreuse on yes
DAVIX_DISABLE_SESSION_CACHING=false http.tlsreuse off yes

I am pretty sure I also had a version of the config with http.tlsreuse off a few months back that did not work.
It will take a few days to check if FTS is happy now.

  • Command used:
gfal-copy -p -v -f -K adler32 --checksum-mode both --copy-mode pull davs://eoscms.cern.ch:443/eos/cms/store/data/Run2018B/ScoutingMonitor/RAW/v1/000/317/475/00000/F0A5417E-ED68-E811-B824-FA163ED2A73C.root davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/user/kreczko/FDCF2C44-F6CA-FB47-8E01-1D145FD01B31.root.DESTINATION+CHECKSUM++Neon
  • gfal version:
gfal-copy --version
gfal2-util version 1.8.0 (gfal2 2.21.0)
	dcap-2.21.0
	file-2.21.0
	gridftp-2.21.0
	http-2.21.0
	lfc-2.21.0
	mock-2.21.0
	rfio-2.21.0
	sftp-2.21.0
	srm-2.21.0
	xrootd-2.21.0
  • xrootd version on client side: 5.4.3
  • xrootd version on storage element (redirector + server): 5.5.0
  • openssl version on storage element (redirector + server): 1:1.0.2k-25

Update (EDIT)

FTS transfers are still failing due to timeout

@ccaffy
Copy link
Contributor

ccaffy commented Nov 9, 2022

OK thanks. Can you please try with no header at all except the Source one? :)

@matyasselmeci
Copy link
Contributor

Tried it both ways (once without the Overwrite header, and once with "Overwrite: F"). Same results.

@kreczko
Copy link
Author

kreczko commented Nov 10, 2022

If that's the case, I got the similar issue with a "redirector-server setup". I had to tell Davix to not use the redirection cache because Davix sent a HEAD request on the redirected server instead of the headnode....

That would explain the errors and why I do not see the checksum requests at the redirector.
Still don't get why 5% of transfers succeed and manual tests are fine.

Is there anything I can do on the xrootd side (e.g. config)?

@abh3
Copy link
Member

abh3 commented Nov 10, 2022

What do you mean "manual tests are fine"? How do they differ from the ones that fail?

@kreczko
Copy link
Author

kreczko commented Nov 10, 2022

What do you mean "manual tests are fine"? How do they differ from the ones that fail?

Manual tests

By manual tests, I mean myself running

gfal-copy -p -v -f -K adler32 --checksum-mode both --copy-mode pull davs://eoscms.cern.ch:443/eos/cms/store/data/Run2018B/ScoutingMonitor/RAW/v1/000/317/475/00000/F0A5417E-ED68-E811-B824-FA163ED2A73C.root davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/user/kreczko/FDCF2C44-F6CA-FB47-8E01-1D145FD01B31.root.DESTINATION+CHECKSUM++Neon

I can make these fail by setting tlsreuse on, etc, as described in #1736 (comment)
and they would fail consistently before the most recent updates.

Most recently I've tried different xrootd versions which also updated some dependencies - now this works too

Automated transfers

I compare these to automated FTS transfers that, to the best of my knowledge, execute the same command.
The automated tests fail right now 95% of the time due to the session cache issue (transfer OK, but checksum request right after fails).
I do not see any correlation as to site, protocol, FTS server used, etc for the successes or the failures.

@ccaffy
Copy link
Contributor

ccaffy commented Nov 10, 2022

Tried it both ways (once without the Overwrite header, and once with "Overwrite: F"). Same results.

OK thanks, I could now reproduce on my box... :)
Will fix this ASAP

@ccaffy
Copy link
Contributor

ccaffy commented Nov 10, 2022

I created the PR to fix the issue: #1826

@matyasselmeci
Copy link
Contributor

I applied your PR as a patch and verified that it fixed the issue for us. Thanks!

@ccaffy
Copy link
Contributor

ccaffy commented Nov 14, 2022

That's great news. Thanks for your quick feedback :)

@kreczko
Copy link
Author

kreczko commented Nov 14, 2022

@ccaffy Now that the other issue is done, could you please elaborate on

If that's the case, I got the similar issue with a "redirector-server setup". I had to tell Davix to not use the redirection cache because Davix sent a HEAD request on the redirected server instead of the headnode....

I've summarized what you said and what we observe in this graph:

sequenceDiagram
    autonumber
    davix client->>+redirector: Do a TPC (pull)
    redirector->>+server: Do a TPC (pull)
    server->>+davix client: will do
    loop status update
        server->>davix client: transferred x bytes out of N
    end
    server->>+davix client: copy is done
    note right of davix client: Let's reuse session cache
    davix client->>+server: give me the checksum
    note left of server: unknown connection (not from redirector)
    note right of davix client: timeout after 15min

There are two unknowns to me:

  1. Why do the transfers succeed 5% of the time (or when I run them manually)
  2. Is there a way to change the behaviour on the server side (tlsreuse, full auth on server, etc).

The random successes or low-volume, successful tests are the most confusing bits.
The reason I am curious about 2. is that disabling the session cache on the client is out of my reach for the FTS.

My full config is here:

@ccaffy
Copy link
Contributor

ccaffy commented Nov 14, 2022

Can you ask your davix client to talk to the redirector instead of the server directly in order to ask the checksum ?

@kreczko
Copy link
Author

kreczko commented Nov 14, 2022

Can you ask your davix client to talk to the redirector instead of the server directly in order to ask the checksum ?

The davix client only knows about the redirector - that's the point of entry.
But at some point the server responds and davix uses that for the consecutive checksum request.
The diagram is what I think happens (based on observation and your comment) - but I do not know for sure.

I can certify that the redirector does not see the checksum request in 95% of the cases (the failures) - so what you said would make sense.

@ccaffy
Copy link
Contributor

ccaffy commented Nov 14, 2022

OK thanks. Now, before launching Davix, if you do export DAVIX_DISABLE_REDIRECT_CACHING=true. What is the result?

@kreczko
Copy link
Author

kreczko commented Nov 14, 2022

@ccaffy Where would I set this? On the redirector, server, or both?

EDIT: For the client side I've tested these: #1736 (comment)

@ccaffy
Copy link
Contributor

ccaffy commented Nov 14, 2022

On the client side, if you set export DAVIX_DISABLE_REDIRECT_CACHING=True, does it work?

@kreczko
Copy link
Author

kreczko commented Nov 14, 2022

On the client side, if you set export DAVIX_DISABLE_REDIRECT_CACHING=True, does it work?

Manual tests without it as well right now:

gfal-copy -p -v -f -K adler32 --checksum-mode both --copy-mode pull davs://<src> davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/<dst>

With the latest upgrades (see #1736 (comment)) things started to work for the manual tests.

The diff between the logs for unmodified and DAVIX_DISABLE_REDIRECT_CACHING=True: the latter does not contain the line

DEBUG Davix: Add cached redirection <HEAD davs://xrootd.phy.bris.ac.uk:1094 ...

But if I take the line at face-value, it looks like the correct thing. The FTS debug log also contains a line like that.

Anyway, client-side changes are not useful to me as I have no control over what FTS does and since FTS serves a lot of sites they will not change their settings for just one. We tried fts-cms.cern.ch and lcgfts3.gridpp.rl.ac.uk and the only difference is the timeout time for the checksum request - 15min vs 9 min respectively.

We also have dteam enabled, if you want to have a go: davs://xrootd.phy.bris.ac.uk:1094/xrootd/dteam.

@ccaffy
Copy link
Contributor

ccaffy commented Nov 15, 2022

I can see this from the FTS debug logs before you ask the HEAD request: (SEToken) Found token in credential_map[davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/mc/Run3Summer22GS/MinBias_TuneCP5_13p6TeV-pythia8/GEN-SIM/124X_mcRun3_2022_realistic_v10-v1/40006/1d37f646-1a9d-4aa7-ad26-58c0d5b631ba.root] (access=write) (needed=read)

If you have write access, does it imply that you have read access too?

@kreczko
Copy link
Author

kreczko commented Nov 15, 2022

Good question. I would think so, but I am not sure how the tokens are segmented.

The only other place this is mentioned when searching is
https://its.cern.ch/jira/browse/DMC-1275?atlOrigin=eyJpIjoiYjM0MTA4MzUyYTYxNDVkY2IwMzVjOGQ3ZWQ3NzMwM2QiLCJwIjoianN3LWdpdGxhYlNNLWludCJ9
which does not seem related.

Given that manual tests (successful and failing) also include such a line, maybe it is a red herring:
manual test, davs vs https.

@ccaffy
Copy link
Contributor

ccaffy commented Nov 15, 2022

Would you have, by chance, the logs of both servers associated to those manual tests that you did?

@kreczko
Copy link
Author

kreczko commented Nov 15, 2022

Manual tests

Logs filtered with grep -10 kreczko:

FTS log examples (full trace)

@ccaffy
Copy link
Contributor

ccaffy commented Nov 15, 2022

Thanks, it looks to me that your filtered logs do not match the full traces:

From the filtered logs, the first line is:

 221114 17:08:47 8567 cmsrucio.411785:444@fts-cms-007.cern.ch Xrootd_Protocol: redirecting to io-37-02.acrc.bris.ac.uk:1194
$ cat 2022.09.28-xrootd-server.log | grep '221114'
$ cat 2022.09.28-xrootd-redirector.log | grep '221114'
$

@kreczko
Copy link
Author

kreczko commented Nov 15, 2022

The FTS logs are old, grep '221114' should show nothing.

As for the filtered logs: that's with grep -10, so it will show other transfers as well.
If you have a good trick how to trace just one transfer in the logs, please let me know.

The first few lines of my manual transfer are

221114 17:09:12 6194  XrdVomsFun: retrieval successful
221114 17:09:12 6194  XrdVomsFun: found VO: cms
221114 17:09:12 6194  XrdVomsFun:  ---> group: '/cms', role: 'NULL', cap: 'NULL'
221114 17:09:12 6194  XrdVomsFun:  ---> fqan: '/cms/Role=NULL/Capability=NULL'
221114 17:09:12 6194 XrootdBridge: zkreczko.411786:527@lxplus724.cern.ch login as zkreczko
221114 17:09:12 048 cms_Receive: xrootd 29 bytes on 510503933
221114 17:09:12 048 cms_Decode: xrootd redirects zkreczko.411786:527@lxplus724.cern.ch to io-37-02.acrc.bris.ac.uk:1194 /xrootd/cms/store/user/kreczko/FDCF2C44-F6CA-FB47-8E01-1D145FD01B31.root.try-https

for redirector and

221114 17:09:13 5633  XrdVomsFun:  ---> group: '/cms', role: 'NULL', cap: 'NULL'
221114 17:09:13 5633  XrdVomsFun:  ---> fqan: '/cms/Role=NULL/Capability=NULL'
221114 17:09:13 5633 XrootdBridge: zkreczko.219896:374@lxplus724.cern.ch login as zkreczko
221114 17:09:13 5633 zkreczko.219896:374@lxplus724.cern.ch Xrootd_Protocol: rc=0 stat /xrootd/cms/store/user/kreczko/FDCF2C44-F6CA-FB47-8E01-1D145FD01B31.root.try-https

for the server.

If you trace kreczko through the logs, it should get you to the right places.

@ccaffy
Copy link
Contributor

ccaffy commented Nov 15, 2022

OK thanks, let's discuss via Mattermost if you can, the conversation here is very long and it is not very efficient ;)

@kreczko
Copy link
Author

kreczko commented Nov 15, 2022

As suggested, updated to xrootd 5.5.2:

package version repo
libmacaroons.x86_64 0.3.0-2.el7 epel
scitokens-cpp.x86_64 0.7.3-1.el7 epel
voms.x86_64 2.1.0-0.24.rc2.el7 epel
xrootd.x86_64 1:5.5.2-0.experimental.4761454.e4f58a1a.el7.cern xrootd-experimental
xrootd-client.x86_64 1:5.5.2-0.experimental.4761454.e4f58a1a.el7.cern xrootd-experimental
xrootd-client-libs.x86_64 1:5.5.2-0.experimental.4761454.e4f58a1a.el7.cern xrootd-experimental
xrootd-cmstfc.x86_64 1.5.2-6.osg36.el7 osg-contrib
xrootd-libs.x86_64 1:5.5.2-0.experimental.4761454.e4f58a1a.el7.cern xrootd-experimental
xrootd-scitokens.x86_64 1:5.5.2-0.experimental.4761454.e4f58a1a.el7.cern xrootd-experimental
xrootd-selinux.noarch 1:5.5.2-0.experimental.4761454.e4f58a1a.el7.cern xrootd-experimental
xrootd-server.x86_64 1:5.5.2-0.experimental.4761454.e4f58a1a.el7.cern xrootd-experimental
xrootd-server-libs.x86_64 1:5.5.2-0.experimental.4761454.e4f58a1a.el7.cern xrootd-experimental
xrootd-voms.x86_64 1:5.5.2-0.experimental.4761454.e4f58a1a.el7.cern xrootd-experimental

Manual tests are working, nagios tests are passing, now waiting on FTS to produce some transfers.

@ccaffy
Copy link
Contributor

ccaffy commented Nov 17, 2022

After a lot of investigations, it turns out that there is a faulty machine that lose ~80% of the packets between Bristol and anywhere else:

image

@kreczko will follow-up on his side and will contact me again before re-doing his tests.

@ccaffy
Copy link
Contributor

ccaffy commented Jan 9, 2023

Hi @kreczko ,

Can this issue be closed ?

Thanks in advance.

@kreczko
Copy link
Author

kreczko commented Jan 9, 2023

Probably.
IT cannot find any issue on their side, I cannot find an issue on the server side, and we cannot find an issue on xrootd/config side.

I will try to get rid of the redirector next to see if that helps.

@ccaffy
Copy link
Contributor

ccaffy commented Jan 9, 2023

OK, closing !
Please re-open it if you find something else related to this.

Cheers,
Cedric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants