XrdTpc curl nss/sssd strange interaction part 1 #1428

esindril · 2021-03-17T09:57:21Z

While running XrdTpc in production in EOSLHCB we've noticed some strange behavior during HTTP TPC transfers. We are running the following configuration, as an example:

[root@st-048-e99ba873 ~]# uname -a
Linux st-048-e99ba873.cern.ch 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@st-048-e99ba873 ~]# rpm -qa | grep "curl\|nss\|sssd"
collectd-curl-5.8.1-1.el7.cern8.x86_64
libcurl-7.29.0-59.el7_9.1.x86_64
libsss_nss_idmap-1.16.5-10.el7_9.7.x86_64
sssd-common-pac-1.16.5-10.el7_9.7.x86_64
nss-tools-3.53.1-3.el7_9.x86_64
curl-7.29.0-59.el7_9.1.x86_64
nss-softokn-3.53.1-6.el7_9.x86_64
sssd-ldap-1.16.5-10.el7_9.7.x86_64
sssd-proxy-1.16.5-10.el7_9.7.x86_64
sssd-1.16.5-10.el7_9.7.x86_64
nss-pem-1.0.3-7.el7.x86_64
nss-softokn-freebl-3.53.1-6.el7_9.x86_64
nss-util-3.53.1-1.el7_9.x86_64
sssd-common-1.16.5-10.el7_9.7.x86_64
sssd-ipa-1.16.5-10.el7_9.7.x86_64
openssh-7.4p1-21.el7.x86_64
python-sssdconfig-1.16.5-10.el7_9.7.noarch
jansson-2.10-1.el7.x86_64
sssd-client-1.16.5-10.el7_9.7.x86_64
sssd-ad-1.16.5-10.el7_9.7.x86_64
openssh-server-7.4p1-21.el7.x86_64
nss-3.53.1-3.el7_9.x86_64
openssh-clients-7.4p1-21.el7.x86_64
openssl-libs-1.0.2k-21.el7_9.x86_64
nss-sysinit-3.53.1-3.el7_9.x86_64
sssd-krb5-1.16.5-10.el7_9.7.x86_64
python-pycurl-7.19.0-19.el7.x86_64
openssl-1.0.2k-21.el7_9.x86_64
sssd-krb5-common-1.16.5-10.el7_9.7.x86_64

So this is the latest CentOS7 with all the packages up to date.
We see that sometimes, curl fails to connect throwing the following error:

[root@st-048-e99ba873 ~]# curl -v --capath /etc/grid-security/certificates/ https://eoslhcb.cern.ch
* About to connect() to eoslhcb.cern.ch port 443 (#0)
*   Trying 2001:1458:301:62::100:5...
* Connected to eoslhcb.cern.ch (2001:1458:301:62::100:5) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: /etc/grid-security/certificates/
* NSS error -5938 (PR_END_OF_FILE_ERROR)
* Encountered end of file
* Closing connection 0
curl: (35) Encountered end of file

This is not something 100% reproducible, sometimes it works, sometimes it doesn't. I suspect it is related to the load on the machine and the pressure on the nss subsystem.

The exact same command ran a few seconds later works as expected:

[root@st-048-e99ba873 ~]# curl -v --capath /etc/grid-security/certificates/ https://eoslhcb.cern.ch
* About to connect() to eoslhcb.cern.ch port 443 (#0)
*   Trying 2001:1458:301:62::100:5...
* Connected to eoslhcb.cern.ch (2001:1458:301:62::100:5) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: /etc/grid-security/certificates/
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* 	subject: CN=eoslhcb-qdb-52bedb7c30.cern.ch,OU=computers,DC=cern,DC=ch
* 	start date: Mar 13 10:48:02 2021 GMT
* 	expire date: Apr 17 10:48:02 2022 GMT
* 	common name: eoslhcb-qdb-52bedb7c30.cern.ch
* 	issuer: CN=CERN Grid Certification Authority,DC=cern,DC=ch
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: eoslhcb.cern.ch:443
> Accept: */*
>
< HTTP/1.1 403 token authorization failed
< Connection: Keep-Alive
< Content-Length: 26
<

* Connection #0 to host eoslhcb.cern.ch left intact

This shows up the FTS logs as the following error:

ERR     Tue, 16 Mar 2021 15:19:25 +0100; /builddir/build/BUILD/fts-3.10.1/src/url-copy/UrlCopyProcess.cpp; run; 599; Recoverable error: [5] [gfalt_copy_file][perform_copy] TRANSFER  [gfal_http_copy] ERROR: Copy failed with mode 3rd pull, with error: [davix2gliberr] Transfer failed: failure: SSL connect error

and in the XRootD logs as the following:

210316 15:19:25 10981 TPC_PullRequest: event=TRANSFER_FAIL, local=/eos/lhcb/grid/user/lhcb/user/c/chaen/finalCTA/CERN-CTA-TEST_CERN-USER_40.TPC.test, remote=https://eosctalhcb.cern.ch:8444/eos/ctalhcb/archivetest/lhcb/user/c/chaen/finalCTA/CERN-CTA-TEST_CERN-USER_40.TPC.test, user=(anonymous), bytes_transferred=0; HTTP library failure: SSL connect error

I read a bit about this type of error coming from NSS but there seems to be no good solution for this. Most of the time an upgrade of both curl and sssd are suggest but also this is not a bullet proof solution, apparently. I tried various things like clearing the sssd cache, restarting the daemon but nothing seems to help.
Has anyone been confronted with this? This is starting to affect 5-10% of the transfers on busy disk nodes in our instances.

Thanks,
Elvin

The text was updated successfully, but these errors were encountered:

bbockelm · 2021-03-17T15:17:59Z

Talking with @esindril - curl connects to the server, starts the handshake, then NSS starts loading certificates. When under load, he's observing it taking up to 10-20s to parse the /etc/grid-security/certificates directory.

Current working theory is that the remote server hangs up the TCP connection as the client is unresponsive and PR_END_OF_FILE_ERROR is the corresponding NSS error when it gets a hangup in the handshake. Trying to reproduce the idea in a lab setting.

esindril · 2021-03-17T15:21:29Z

I think anything above 10 sec triggers this. For example:

[root@st-048-e99ba873 fst]# time curl -v --capath /etc/grid-security/certificates/ https://eoslhcb.cern.ch
* About to connect() to eoslhcb.cern.ch port 443 (#0)
*   Trying 2001:1458:301:62::100:5...
* Connected to eoslhcb.cern.ch (2001:1458:301:62::100:5) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* failed to load '/etc/grid-security/certificates//56084620.namespaces' from CURLOPT_CAPATH
* failed to load '/etc/grid-security/certificates//4b3f5558.namespaces' from CURLOPT_CAPATH
* .....
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: /etc/grid-security/certificates/
* NSS error -5938 (PR_END_OF_FILE_ERROR)
* Encountered end of file
* Closing connection 0
curl: (35) Encountered end of file

real	0m13.713s
user	0m0.164s
sys	0m0.231s

simonmichal · 2021-06-04T09:59:26Z

I think this has been solved, hasn't it? Can we close the issue?

chaen · 2021-06-14T08:48:47Z

@simonmichal are you sure ? How ? From which version ?

abh3 · 2021-06-14T08:59:09Z

The current 5.2 version converts a directory based certs to a single file based cert which play mush nicer with curl. So, I would suggest going to 5.2 to avoid all the problems in this ticket. Mind you, there is still a problem depending on what the filesystem you choose to hold the certs and, according to some people, the centos release being used. That problem has been corrected in an upcoming patch or feature release. No matter, I suspect going to 5.2 will solve the problem.

abh3 · 2021-12-13T08:05:22Z

These issues have been resolved.

esindril added the question label Mar 17, 2021

esindril mentioned this issue Mar 17, 2021

XrdTpc curl nss/sssd strange interaction part 2 #1429

Closed

abh3 closed this as completed Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XrdTpc curl nss/sssd strange interaction part 1 #1428

XrdTpc curl nss/sssd strange interaction part 1 #1428

esindril commented Mar 17, 2021

bbockelm commented Mar 17, 2021

esindril commented Mar 17, 2021

simonmichal commented Jun 4, 2021 •

edited

chaen commented Jun 14, 2021

abh3 commented Jun 14, 2021

abh3 commented Dec 13, 2021

XrdTpc curl nss/sssd strange interaction part 1 #1428

XrdTpc curl nss/sssd strange interaction part 1 #1428

Comments

esindril commented Mar 17, 2021

bbockelm commented Mar 17, 2021

esindril commented Mar 17, 2021

simonmichal commented Jun 4, 2021 • edited

chaen commented Jun 14, 2021

abh3 commented Jun 14, 2021

abh3 commented Dec 13, 2021

simonmichal commented Jun 4, 2021 •

edited