You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks to Petr Vokac who is doing extensive testing of the HTTP TPC implementation, we discovered a race condition in the HTTP TPC PULL implementation.
The symptom is that after a successful transfer a HEAD requests for the checksum returns "000000" which in EOS world means no checksum set. After a brief moment if the requests is reissued the checksum is correct. Below is a snippet from the transfer log which shows this problem with the important parts highlighted.
...
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Received: success: Created
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Davix::BackendRequest::readSegment: want to read 1024 bytes
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: <
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: End of headers.
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Running post_send hooks
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Destroy HttpRequest
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Running destroy hooks.
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Request ends.
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: add old session to cache httpst-048-bba7ee78.cern.ch:8443
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Copy succeeded using mode 3rd pull
INFO Fri, 23 Oct 2020 02:27:25 +0200; [1603412845359] BOTH http_plugin TRANSFER:EXIT davs://golias100.farm.particle.cz:443/dpm/farm.particle.cz/home/atlas/test/dpmpool20_truncate_10G/14 => davs://eosatlas.cern.ch:443/eos/atlas/atlasscratchdisk/SAM/vokac.dst/test/DPM-10G-PRAGUELCG2-20201023021733.00014
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Event triggered: BOTH http_plugin TRANSFER:EXIT davs://golias100.farm.particle.cz:443/dpm/farm.particle.cz/home/atlas/test/dpmpool20_truncate_10G/14 => davs://eosatlas.cern.ch:443/eos/atlas/atlasscratchdisk/SAM/vokac.dst/test/DPM-10G-PRAGUELCG2-20201023021733.00014
INFO Fri, 23 Oct 2020 02:27:25 +0200; [1603412845359] DEST http_plugin CHECKSUM:ENTER
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Event triggered: DESTINATION http_plugin CHECKSUM:ENTER
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Using client X509 for HTTPS session authorization
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: ssl: Match common name '2117989013' against ''
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Identity match for '': bad
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: ssl: Match common name '2117989013' against ''
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Identity match for '': bad
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Impossible to get string parameter BEARER:TOKEN, set to default value (null), err Key file does not have group “BEARER”
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Using bearer token for HTTPS request authorization
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Impossible to get string_list parameter HTTP PLUGIN:HEADERS, set to a default value (null), err Key file does not have key “HEADERS” in group “HTTP PLUGIN”
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Impossible to get integer parameter HTTP PLUGIN:OPERATION_TIMEOUT, set to default value 8000, err Key file does not have key “OPERATION_TIMEOUT” in group “HTTP PLUGIN”
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: -> checksum
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Create HttpRequest for davs://eosatlas.cern.ch:443/eos/atlas/atlasscratchdisk/SAM/vokac.dst/test/DPM-10G-PRAGUELCG2-20201023021733.00014
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: -> executeRequest
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: creating session keys... httpseosatlas.cern.ch
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: cached ne_session found ! taken from cache
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: configure session...
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: define connection timeout to 30
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: define operation timeout to 1800
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: add CA PATH /etc/grid-security/certificates/
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: disable login/password authentication
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: enable client cert authentication by callback
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: -> negotiateRequest
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: NEON start internal request
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Running pre_send hooks
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: > HEAD /eos/atlas/atlasscratchdisk/SAM/vokac.dst/test/DPM-10G-PRAGUELCG2-20201023021733.00014 HTTP/1.1
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: < ETag: "588571929254821888:00000000"
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: < Last-Modified: Fri, 23 Oct 2020 00:17:46 GMT
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: <
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: End of headers.
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Running post_headers hooks
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: <- negotiateRequest
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: NEON Read data flow
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: NEONRequest::readBlock read 0 bytes
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Running post_send hooks
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: <- executeRequest
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Destroy HttpRequest
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Running destroy hooks.
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Request ends.
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: add old session to cache httpseosatlas.cern.ch
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; <- Gfal::Transfer::FileCopy
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Ping stopped
WARNING Fri, 23 Oct 2020 02:27:25 +0200; Timeout stopped
ERR Fri, 23 Oct 2020 02:27:25 +0200; /builddir/build/BUILD/fts-3.9.5/src/url-copy/UrlCopyProcess.cpp; run; 574; Non recoverable error: [5] [gfalt_copy_file][perform_copy][gfal_http_copy] DESTINATION CHECKSUM MISMATCH Source and destination ADLER32 do not match (822b0001 != 00000000)
What happens on the EOS side is that the Curl command that runs on the diskserver completes successfully but the reply to the client is sent before the actual file on disk is closed - or better in XRootD terms, before the XrdOfsFile is closed. EOS sends in the close the information about the final size and checksum of the file to the manager node which will serve any future HEAD/GET requests. Therefore, this leads a a race condition when the file is not yet closed on disk but the client already got a successful reply. If the client is fast and then queries the manager node for a checksum, we end up in this situation.
Even though this might be an implementation detail of EOS, it would be better in general that the reply to the client is sent after the actual close of the file on disk is done since this might fail for various reasons.
The problem comes from the fact that the file handler fh at this line https://github.com/xrootd/xrootd/blob/master/src/XrdTpc/XrdTpcTPC.cc#L644 is only closed at the end of the TPCHandler::ProcessPullRequest after the RunCurlBasic method runs. The RunCurlBasic already sends the reply to the client and this leads to the race condition.
The text was updated successfully, but these errors were encountered:
@esindril - please review #1316. It ensures that Finalize is called prior to any of the HTTP responses are sent. There's also a mild cleanup of logging messages so that a local filesystem error makes it back to the client.
we tried to put in production CERN with HTTP-TPC but it is heavily affected by this problem. If this was closed because it has been fixed in xrootd5 I wonder if it is going to be fixed in xrootd4 too.
Thanks to Petr Vokac who is doing extensive testing of the HTTP TPC implementation, we discovered a race condition in the HTTP TPC PULL implementation.
You can find the FTS log for such a transfer below:
https://novastore.farm.particle.cz:8449/data/fts3/transfers/2020-10-23/golias100.farm.particle.cz__eosatlas.cern.ch/2020-10-23-0027__golias100.farm.particle.cz__eosatlas.cern.ch__7914757__29ca00fe-14c5-11eb-850e-0000c9eadf64
The symptom is that after a successful transfer a HEAD requests for the checksum returns "000000" which in EOS world means no checksum set. After a brief moment if the requests is reissued the checksum is correct. Below is a snippet from the transfer log which shows this problem with the important parts highlighted.
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Received: success: Created
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Copy succeeded using mode 3rd pull
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: Create HttpRequest for davs://eosatlas.cern.ch:443/eos/atlas/atlasscratchdisk/SAM/vokac.dst/test/DPM-10G-PRAGUELCG2-20201023021733.00014
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: > HEAD /eos/atlas/atlasscratchdisk/SAM/vokac.dst/test/DPM-10G-PRAGUELCG2-20201023021733.00014 HTTP/1.1
> Want-Digest: ADLER32
DEBUG Fri, 23 Oct 2020 02:27:25 +0200; Davix: < Digest: adler32=00000000
What happens on the EOS side is that the Curl command that runs on the diskserver completes successfully but the reply to the client is sent before the actual file on disk is closed - or better in XRootD terms, before the XrdOfsFile is closed. EOS sends in the close the information about the final size and checksum of the file to the manager node which will serve any future HEAD/GET requests. Therefore, this leads a a race condition when the file is not yet closed on disk but the client already got a successful reply. If the client is fast and then queries the manager node for a checksum, we end up in this situation.
Even though this might be an implementation detail of EOS, it would be better in general that the reply to the client is sent after the actual close of the file on disk is done since this might fail for various reasons.
The problem comes from the fact that the file handler
fh
at this line https://github.com/xrootd/xrootd/blob/master/src/XrdTpc/XrdTpcTPC.cc#L644 is only closed at the end of the TPCHandler::ProcessPullRequest after theRunCurlBasic
method runs. TheRunCurlBasic
already sends the reply to the client and this leads to the race condition.The text was updated successfully, but these errors were encountered: