-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding xrootd -> FTS timeout errors (Third Party Copy - HTTP) #1736
Comments
Hi @kreczko, When I look at the disk node at about the same time (for the file in the fts transfer logs) I see errors like:
It looks from the logs like the redirector is passing on the connection to the disknode which matches the fact the transfer is running, but then there is some SSL related error which would match the FTS client not receiving the csum. As a passing thought (and based on the last line of the log), you mention you're using docker, how are you configuring/updating the voms/crl for production? |
Hi @rob-c Thank you for looking into the logs
voms/crl is mounted from |
@rob-c I tried to extract one full transfer from the logs with more http info: This is what happens around checksum request, then time out:
|
Hi Luke,
Could you send your config file? It seems we have a conflicting response
occurring here.
Andy
…On Wed, 13 Jul 2022, Luke Kreczko wrote:
@rob-c I tried to extract one full transfer from the logs with more http info:
https://gist.github.com/kreczko/66f0d2fb47865821b8933b88045b87eb#file-one-full-transfer
This is what happens around checksum request, then time out:
```
220713 14:17:30 124 Xrd_Sched: running chksum inq=0
220713 14:17:30 124 Xrootd_jobXeq: Job chksum /xrootd/cms/store/data/Run2018A/EGamma/NANOAOD/UL2018_MiniAODv1_NanoAODv2_GT36-v1/70000/4DC52ADD-551B-7242-8334-B97E6348B997.root started
220713 14:17:30 124 http_Req: XrdHttpReq::Data! final=True
220713 14:17:30 124 ***@***.*** http_Req: PostProcessHTTPReq req: 3 reqstate: 1
220713 14:17:30 124 ***@***.*** http_Req: Checksum for HEAD /xrootd/cms/store/data/Run2018A/EGamma/NANOAOD/UL2018_MiniAODv1_NanoAODv2_GT36-v1/70000/4DC52ADD-551B-7242-8334-B97E6348B997.root adler32=30032735
220713 14:17:30 124 ***@***.*** http_Protocol: Sending resp: 200 header len:97
220713 14:17:30 124 http_Protocol: Sending 97 bytes
220713 14:17:30 124 http_Req: XrdHttpReq request ended.
220713 14:17:30 124 Xrd_Poll: Poller 1 enabled ***@***.***
220713 14:17:30 124 Xrootd_jobSendResult: sent async ok to ***@***.***
220713 14:17:30 124 Xrootd_jobXeq: Job chksum /xrootd/cms/store/data/Run2018A/EGamma/NANOAOD/UL2018_MiniAODv1_NanoAODv2_GT36-v1/70000/4DC52ADD-551B-7242-8334-B97E6348B997.root completed
220713 14:17:39 124 Xrd_Sched: running ***@***.*** inq=0
220713 14:17:39 124 http_Protocol: Cleanup
220713 14:17:39 124 XrdLink: Unable to send to ***@***.***; connection timed out
220713 14:17:39 124 http_Protocol: SSL_shutdown failed
220713 14:17:39 124 http_Protocol: Reset
220713 14:17:39 124 http_Req: XrdHttpReq request ended.
```
--
Reply to this email directly or view it on GitHub:
#1736 (comment)
You are receiving this because you are subscribed to this thread.
Message ID: ***@***.***>
|
Hi @abh3, Of course:
Both redirector (xrootd.phy.bris.ac.uk:1094) and disk server (io-37-02.acrc.bris.ac.uk:1194) use the same config. |
yes, we thought as much. Other sites are seeing things like this too.
as these sound exactly like the errors FTS is reporting:
Transfer ends Since CERN FTS is working for other sites, it must be someone on our end. |
You seem to be using the external checksum program, right? We've seen some issues where a queue of checksum requests develop to the point where there's a standing backlog. Even when the average time to checksum a given file is modest, there are so many files in the queue that a timeout is triggered. Is that a possibility here? |
Hi @bbockelm, We've previously been using the xrootd-hdfs built-in implementation, but due to the time outs we tried our own solution (since we can now store in xattr). But this did not help. We are allowing for 100 simultaneous checksums and as far as I can see, neither redirector nor server come ever close to that limit. 2GB files take around 15s, so the largest files (~8GB) should not take much longer than 1 minute. It is hard to map 1:1, but it looks like we get a request at Is there a way to follow a connection through the logs to get a more accurate reading on the situation? |
Ok, sounds like that we can rule the 'standing queue' out as a possibility! Since I don't use checksum scripts, I never debugged that on my own -- @wyang007 might have more insight. In that case, I'm a bit befuddled where to turn next. I assume that you have already eliminated various standard network problems? Are there any idle timeouts in the server you've set that might be terminating the connection unexpectedly? It's not really a step toward solution but, in terms of having fewer moving parts, can you reproduce in a data-server-only setup? Always hard to track the back and forth between the data server and redirector; would be nice to have the logs in one place. |
I don't think I have much insight but experience from experimenting it. I do use external script because this allows me to clearly see that checksum requests are piling up. My GUESS from what I see is that when many checksum requests coming in, checksum calculation are actually slow (CPU limit or HD's priority to write over read? don't know). If a checksum is running, Xrootd (via HTTP or xroot protocol) will periodically update FTS. But if a checksum is not in running state (e.g. exceed the max checksum configured in xrootd), the xrootd won't update FTS with perf mark. In that case, FTS may timeout the transfer after 1 minute (and from the FTS logs I saw, all checksum timeout happened exactly after 1 minute). My only solution is to increase the max allowed checksums. |
Firewall, connection limits, latency, tcp wrappers - yes, we tested a lot of them.
We are back to using defaults - could you please name any particular settings you would recommend we test?
We can try stand-alone again, but I expect the current redirector (we would use that, since it is registered everywhere) to struggle with the load (it is a tiny VM for just redirecting). I will synchronize the configs and we can test it for a few hours.
We currently allow for 100 simultaneous checksums (which we have not yet reached). The data is read over the network and as far as I can see, we do not hit any hardware limitations (CPU, RAM, network, disk).
Where do you see that? From the FTS logs I can see a timeout after |
Frankly, my take is that you are over estimating what the file system can do. Also, my take is that FTS checksum timeout of one minute is totally wrong. The combination is making it unworkable. Frankly, the FTS timeout should be at least 3 minutes. Let me work with the FTS group to see what we can do here. |
Possibly. But before we switched to xrootd-only, we ran DMLite+HDFS on the same hardware. No checksum issues or general timeouts. Deletions would occasionally stress the server if done via GridFTP, but worked fine via HTTP - the limitation has always been on the server side, not the file system. Both redirector nor server are at low load (< 50% of cores) and are otherwise not very busy - nor is the file system. |
Hello, I will upload the FTS log file here, as log files older than 5 days can no longer be viewed in the FTS Web Monitoring service. The FTS log lines show a 15 minute timeout, not a one minute. Something else mentioned by@wyang007 ... I don't think the statement about checksum and Performance Marker updates is correct. FTS receives Performance Markers during the transfer part, but not for the checksum. The checksum is a request-and-wait operation. FTS (in fact, this is all done at the Gfal2 level) requests the checksum to be computed at the end of the transfer part. The checksum computation is included in the copy(..) operation. FTS decides the timeout for this according to the file size. In our case:
All other HTTP operations have absurdly long timeouts, so we never hit an operation timeout. The 15 minute timeout we see here comes from I am curious if you are able to reproduce the issue with Davix and/or Gfal2:
or
|
@mpatrascoiu thank you for the useful info.
Is there a way to verify this happens? The xrootd-hdfs plugin would save checksums in separate files ( From the logs I see
Which kinda looks like it is doing the right thing. So there should be no reason for FTS to not get a response quickly. |
Note: For some reason the redirector also calculates checksums from time to time (not sure why).
I cannot find a similar entry on the disk server, but I can see the entries in the xrootd.log. What are the rules here? Why is the redirector doing any checksumming?
As an experiment, I've now removed these lines from the redirector. |
Since the mentioned change, the redirector no longer calculates checksums → good.
which is only 10 min ago as of writing this. Out of curiosity, I ran gfal-sum davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/mc/RunIISummer20UL16RECOAPV/ST_t-channel_antitop_4f_InclusiveDecays_wtop1p45_TuneCP5_fixWidth_13TeV-powheg-madspin-pythia8/AODSIM/106X_mcRun2_asymptotic_preVFP_v8-v2/40001/25D27F31-FF0C-4346-8A5E-286FE74440E9.root adler32 → first query took a while → checksum has not been calculated at the end of transfer. However, timing information shows that calculating the checksum took 53 seconds → timeout of 15 minutes should have not happened
Is there a hook in xrootd to force the calculation of a checksum after successful transfer? |
There is no "hook" to automatically calculate the checksum. Among other reasons it's not always clear that the transfer is an actual copy operation. So, xrootd simply requires that the person doing the copy ask for the checksum to be calculated. Anyway, doing it automatically wouldn't necessarily avoid a timeout. |
In GGUS ticket 156902, Bockjoo discovered a possibly related TLS problem:
It seems like the redirection, xrootd.phy → io02, does not always work - it can fail at the TLS handshake. Is this something worth focusing on? |
After exchanging configs, we removed the So the updated config seems to be a step in the right direction. I still see 15 min timeouts (FTS logs) after a successful transfer (checksum request):
and I see I've now also updated to Xrootd 5.5.0 (after confirming config change went into the right direction) |
Frankly, I would say something else changed as well during that config.
Mechanically, xrd.tls and xrd.tlsca are treated **identically** to their
http counterparts as this is simply syntactic sugar to have a consistent
config file. I should also mention the http versions are deprecated and
will be removed at some point in the future.
…On Thu, 8 Sep 2022, Luke Kreczko wrote:
After exchanging configs, we removed the `xrd.tls` and `xrd.tlsca` from the redirector and replaced them with the `http.cadir`, `http.cert`, and `http.key` settings.
This seems to have decreased our FTS transfer failure rate from ≈94% to ≈38% (in the last two hours).
The mentions of `fts-cms` in `/var/log/clustered/xrootd.log` in redirector vs server have improved as well (more getting through).
So the [updated config](https://github.com/BristolComputing/xrootd-se/blob/kreczko-resolving-tls/etc/xrootd/config.d/20-https-and-security.cfg) seems to be a step in the right direction.
I still see 15 min timeouts (FTS logs) after a successful transfer (checksum request):
```
TRANSFER [112] DESTINATION CHECKSUM (Neon): Could not read status line: Connection timed out
```
and I see `(send failure)` and `(link SSL read error)` in the redirector and server logs respectively.
But I also see them for requests that I know went well (e.g. ETF tests for CMS).
I've now also updated to Xrootd 5.5.0 (after confirming config change went into the right direction)
--
Reply to this email directly or view it on GitHub:
#1736 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
yes, I can confirm that. Yesterday I've reset everything back to Our university network team also had a look and identified a common theme:
In our logs this looks like
We've tried to tune TPC with
but we were told that this has nothing to do with HTTP TPC - so we removed it yesterday. I also had a look at |
So, the message "connection timed out' appears to be coming from OpenSSL saying that it took too long to do the ssl handshake. The question is when that message is generated. So, what we need is all messages issued w.r.t. 7dc42260.360:500@fts-cms-08.cern.ch on the server. So, simply grep the log for this and send it over. As for what happens in the redirector, it seems totlly unrelated. While the server is complaining about 7dc42260.360:500@fts-cms-08.cern.ch the redirector is complaining about 7dc42260.3507:661@fts-cms-09.cern.ch which is a comletely different client (i.e. tsts-cms-08 for the server and fts-cms09 for the redirector). However, please also grep the redirector log for all messages about 7dc42260.3507:661@fts-cms-09.cern.ch and send that over as well. |
Hi @abh3, on the server, grepping for
Note that the transfer of the file mentioned at 11:38 seems to be complete at 11:59:
On redirector grepping for
Full logs for that day can be found here:
|
Hello @abh3 , Here is the corresponding FTS transfer log file: |
It seems someone has enabled debug for FTS transfers. Thank you! The gfal checksum operation seems to
On my end, I do not see this checksum request, only the stat and rm (cleanup) after the timeout. I looked through other issues, and I do not have CHECKSUM:ENTER detailsDEBUG Fri, 30 Sep 2022 14:22:32 +0200; Event triggered: DESTINATION http_plugin CHECKSUM:ENTER DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Impossible to get string_list parameter HTTP PLUGIN:HEADERS, set to a default value (null), err Key file does not have key “HEADERS” in group “HTTP PLUGIN” DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Impossible to get integer parameter HTTP PLUGIN:OPERATION_TIMEOUT, set to default value 8000, err Key file does not have key “OPERATION_TIMEOUT” in group “HTTP PLUGIN” DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Using client X509 for HTTPS session authorization DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: ssl: Match common name '1664259124' against '' DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: Identity match for '': bad DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: ssl: Match common name '1664259124' against '' DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: Identity match for '': bad DEBUG Fri, 30 Sep 2022 14:22:32 +0200; (SEToken) Found token in credential_map[davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/mc/Run3Summer22GS/MinBias_TuneCP5_13p6TeV-pythia8/GEN-SIM/124X_mcRun3_2022_realistic_v10-v1/40006/1d37f646-1a9d-4aa7-ad26-58c0d5b631ba.root] (access=write) (needed=read) DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Using bearer token for HTTPS request authorization DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: -> checksum DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: Create HttpRequest for davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/mc/Run3Summer22GS/MinBias_TuneCP5_13p6TeV-pythia8/GEN-SIM/124X_mcRun3_2022_realistic_v10-v1/40006/1d37f646-1a9d-4aa7-ad26-58c0d5b631ba.root DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: -> executeRequest DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: -> negotiateRequest DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: NEON start internal request DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: cached ne_session found ! taken from cache DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: configure session... DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: define connection timeout to 30 DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: define operation timeout to 1800 DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: add CA PATH /etc/grid-security/certificates/ DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: disable login/password authentication DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: enable client cert authentication by callback DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: Running pre_send hooks DEBUG Fri, 30 Sep 2022 14:22:32 +0200; Davix: > HEAD /xrootd/cms/store/mc/Run3Summer22GS/MinBias_TuneCP5_13p6TeV-pythia8/GEN-SIM/124X_mcRun3_2022_realistic_v10-v1/40006/1d37f646-1a9d-4aa7-ad26-58c0d5b631ba.root HTTP/1.1 > User-Agent: fts_url_copy/3.12.1 gfal2/2.21.0 neon/0.0.29 > TE: trailers > Host: xrootd.phy.bris.ac.uk:1094 > Want-Digest: ADLER32 > ClientInfo: job-id=9737ef08-40b9-11ed-af88-fa163e36d89b;file-id=3450963706;retry=0 > Authorization: Bearer MDAyNGxvY2F0aW9uIFVLSS1TT1VUSEdSSUQtQlJJUy1IRVAKMDAzNGlkZW50aWZpZXIgMzExOTZhYjctYjIwNC00ZDQ2LWJiZmYtZDdlYmQ4NmU5OTFhCjAwMThjaWQgbmFtZTo3ZGM0MjI2MC4wCjAwNTJjaWQgYWN0aXZpdHk6UkVBRF9NRVRBREFUQSxVUExPQUQsRE9XTkxPQUQsREVMRVRFLE1BTkFHRSxVUERBVEVfTUVUQURBVEEsTElTVAowMDM0Y2lkIGFjdGl2aXR5OkxJU1QsRE9XTkxPQUQsTUFOQUdFLFVQTE9BRCxERUxFVEUKMDBhYmNpZCBwYXRoOi94cm9vdGQvY21zL3N0b3JlL21jL1J1bjNTdW1tZXIyMkdTL01pbkJpYXNfVHVuZUNQNV8xM3A2VGVWLXB5dGhpYTgvR0VOLVNJTS8xMjRYX21jUnVuM18yMDIyX3JlYWxpc3RpY192MTAtdjEvNDAwMDYvMWQzN2Y2NDYtMWE5ZC00YWE3LWFkMjYtNThjMGQ1YjYzMWJhLnJvb3QKMDAyNGNpZCBiZWZvcmU6MjAyMi0wOS0zMFQxNToxNTo0MVoKMDAyZnNpZ25hdHVyZSCzrWofC-zAm7z_c3UBg9cYpdofgECNBIfNceJX15a2pgo > |
I've done some more testing and I can see some resemblance to #1252
I am pretty sure I also had a version of the config with
gfal-copy -p -v -f -K adler32 --checksum-mode both --copy-mode pull davs://eoscms.cern.ch:443/eos/cms/store/data/Run2018B/ScoutingMonitor/RAW/v1/000/317/475/00000/F0A5417E-ED68-E811-B824-FA163ED2A73C.root davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/user/kreczko/FDCF2C44-F6CA-FB47-8E01-1D145FD01B31.root.DESTINATION+CHECKSUM++Neon
Update (EDIT)FTS transfers are still failing due to timeout |
OK thanks. Can you please try with no header at all except the Source one? :) |
Tried it both ways (once without the Overwrite header, and once with "Overwrite: F"). Same results. |
That would explain the errors and why I do not see the checksum requests at the redirector. Is there anything I can do on the xrootd side (e.g. config)? |
What do you mean "manual tests are fine"? How do they differ from the ones that fail? |
Manual testsBy manual tests, I mean myself running gfal-copy -p -v -f -K adler32 --checksum-mode both --copy-mode pull davs://eoscms.cern.ch:443/eos/cms/store/data/Run2018B/ScoutingMonitor/RAW/v1/000/317/475/00000/F0A5417E-ED68-E811-B824-FA163ED2A73C.root davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/store/user/kreczko/FDCF2C44-F6CA-FB47-8E01-1D145FD01B31.root.DESTINATION+CHECKSUM++Neon I can make these fail by setting Most recently I've tried different xrootd versions which also updated some dependencies - now this works too Automated transfersI compare these to automated FTS transfers that, to the best of my knowledge, execute the same command. |
OK thanks, I could now reproduce on my box... :) |
I created the PR to fix the issue: #1826 |
I applied your PR as a patch and verified that it fixed the issue for us. Thanks! |
That's great news. Thanks for your quick feedback :) |
@ccaffy Now that the other issue is done, could you please elaborate on
I've summarized what you said and what we observe in this graph: sequenceDiagram
autonumber
davix client->>+redirector: Do a TPC (pull)
redirector->>+server: Do a TPC (pull)
server->>+davix client: will do
loop status update
server->>davix client: transferred x bytes out of N
end
server->>+davix client: copy is done
note right of davix client: Let's reuse session cache
davix client->>+server: give me the checksum
note left of server: unknown connection (not from redirector)
note right of davix client: timeout after 15min
There are two unknowns to me:
The random successes or low-volume, successful tests are the most confusing bits. My full config is here:
|
Can you ask your davix client to talk to the redirector instead of the server directly in order to ask the checksum ? |
The davix client only knows about the redirector - that's the point of entry. I can certify that the redirector does not see the checksum request in 95% of the cases (the failures) - so what you said would make sense. |
OK thanks. Now, before launching Davix, if you do |
@ccaffy Where would I set this? On the redirector, server, or both? EDIT: For the client side I've tested these: #1736 (comment) |
On the client side, if you set |
Manual tests without it as well right now: gfal-copy -p -v -f -K adler32 --checksum-mode both --copy-mode pull davs://<src> davs://xrootd.phy.bris.ac.uk:1094/xrootd/cms/<dst> With the latest upgrades (see #1736 (comment)) things started to work for the manual tests. The diff between the logs for unmodified and
But if I take the line at face-value, it looks like the correct thing. The FTS debug log also contains a line like that. Anyway, client-side changes are not useful to me as I have no control over what FTS does and since FTS serves a lot of sites they will not change their settings for just one. We tried fts-cms.cern.ch and lcgfts3.gridpp.rl.ac.uk and the only difference is the timeout time for the checksum request - 15min vs 9 min respectively. We also have |
I can see this from the FTS debug logs before you ask the HEAD request: If you have write access, does it imply that you have read access too? |
Good question. I would think so, but I am not sure how the tokens are segmented. The only other place this is mentioned when searching is Given that manual tests (successful and failing) also include such a line, maybe it is a red herring: |
Would you have, by chance, the logs of both servers associated to those manual tests that you did? |
Manual testsLogs filtered with FTS log examples (full trace)
|
Thanks, it looks to me that your filtered logs do not match the full traces: From the filtered logs, the first line is:
|
The FTS logs are old, As for the filtered logs: that's with The first few lines of my manual transfer are
for redirector and
for the server. If you trace |
OK thanks, let's discuss via Mattermost if you can, the conversation here is very long and it is not very efficient ;) |
As suggested, updated to xrootd 5.5.2:
Manual tests are working, nagios tests are passing, now waiting on FTS to produce some transfers. |
After a lot of investigations, it turns out that there is a faulty machine that lose ~80% of the packets between Bristol and anywhere else: @kreczko will follow-up on his side and will contact me again before re-doing his tests. |
Hi @kreczko , Can this issue be closed ? Thanks in advance. |
Probably. I will try to get rid of the redirector next to see if that helps. |
OK, closing ! Cheers, |
Dear experts,
We are observing issues between xrootd (both redirector and disk node) and FTS servers like these:
which cause most of our transfers to fail (failures are 10-100 times larger than successes)
On the FTS side we see this:
In short: transfer succeeds (we can see the files on disk), but the checksum part always times out.
We've tried many things to make it work, including a custom plugin for calculation.
We do not observe huge delays in the checksum calculation - nothing that would explain 🟥 15 minute delays 🟥!
Both source and destination use
davs://
for copy → HTTPS via XROOTD path on our side.Logs
FTS log example:
Full xrootd.log:
Versions, config and operations
We are running both redirector (xrootd.phy.bris.ac.uk:1094) and disk server (io-37-02.acrc.bris.ac.uk:1194) via Docker with
--net=host
.FTS servers are reachable from within the containers (and host) via IPv4 and IPv6.
Our config can be found on https://github.com/BristolComputing/xrootd-se/tree/main/etc/xrootd (clustered + config.d).
Installed xrootd versions and plugins:
xrootd-hdfs
Other monitoring
Failures
Successes
99% of the failures are due to the mentioned timeout. Please note the different scales for the y-axis.
The text was updated successfully, but these errors were encountered: