Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XrdTpc curl nss/sssd strange interaction part 1 #1428

Closed
esindril opened this issue Mar 17, 2021 · 6 comments
Closed

XrdTpc curl nss/sssd strange interaction part 1 #1428

esindril opened this issue Mar 17, 2021 · 6 comments
Labels

Comments

@esindril
Copy link
Contributor

While running XrdTpc in production in EOSLHCB we've noticed some strange behavior during HTTP TPC transfers. We are running the following configuration, as an example:

[root@st-048-e99ba873 ~]# uname -a
Linux st-048-e99ba873.cern.ch 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@st-048-e99ba873 ~]# rpm -qa | grep "curl\|nss\|sssd"
collectd-curl-5.8.1-1.el7.cern8.x86_64
libcurl-7.29.0-59.el7_9.1.x86_64
libsss_nss_idmap-1.16.5-10.el7_9.7.x86_64
sssd-common-pac-1.16.5-10.el7_9.7.x86_64
nss-tools-3.53.1-3.el7_9.x86_64
curl-7.29.0-59.el7_9.1.x86_64
nss-softokn-3.53.1-6.el7_9.x86_64
sssd-ldap-1.16.5-10.el7_9.7.x86_64
sssd-proxy-1.16.5-10.el7_9.7.x86_64
sssd-1.16.5-10.el7_9.7.x86_64
nss-pem-1.0.3-7.el7.x86_64
nss-softokn-freebl-3.53.1-6.el7_9.x86_64
nss-util-3.53.1-1.el7_9.x86_64
sssd-common-1.16.5-10.el7_9.7.x86_64
sssd-ipa-1.16.5-10.el7_9.7.x86_64
openssh-7.4p1-21.el7.x86_64
python-sssdconfig-1.16.5-10.el7_9.7.noarch
jansson-2.10-1.el7.x86_64
sssd-client-1.16.5-10.el7_9.7.x86_64
sssd-ad-1.16.5-10.el7_9.7.x86_64
openssh-server-7.4p1-21.el7.x86_64
nss-3.53.1-3.el7_9.x86_64
openssh-clients-7.4p1-21.el7.x86_64
openssl-libs-1.0.2k-21.el7_9.x86_64
nss-sysinit-3.53.1-3.el7_9.x86_64
sssd-krb5-1.16.5-10.el7_9.7.x86_64
python-pycurl-7.19.0-19.el7.x86_64
openssl-1.0.2k-21.el7_9.x86_64
sssd-krb5-common-1.16.5-10.el7_9.7.x86_64

So this is the latest CentOS7 with all the packages up to date.
We see that sometimes, curl fails to connect throwing the following error:

[root@st-048-e99ba873 ~]# curl -v --capath /etc/grid-security/certificates/ https://eoslhcb.cern.ch
* About to connect() to eoslhcb.cern.ch port 443 (#0)
*   Trying 2001:1458:301:62::100:5...
* Connected to eoslhcb.cern.ch (2001:1458:301:62::100:5) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: /etc/grid-security/certificates/
* NSS error -5938 (PR_END_OF_FILE_ERROR)
* Encountered end of file
* Closing connection 0
curl: (35) Encountered end of file

This is not something 100% reproducible, sometimes it works, sometimes it doesn't. I suspect it is related to the load on the machine and the pressure on the nss subsystem.

The exact same command ran a few seconds later works as expected:

[root@st-048-e99ba873 ~]# curl -v --capath /etc/grid-security/certificates/ https://eoslhcb.cern.ch
* About to connect() to eoslhcb.cern.ch port 443 (#0)
*   Trying 2001:1458:301:62::100:5...
* Connected to eoslhcb.cern.ch (2001:1458:301:62::100:5) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: /etc/grid-security/certificates/
* NSS: client certificate not found (nickname not specified)
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* 	subject: CN=eoslhcb-qdb-52bedb7c30.cern.ch,OU=computers,DC=cern,DC=ch
* 	start date: Mar 13 10:48:02 2021 GMT
* 	expire date: Apr 17 10:48:02 2022 GMT
* 	common name: eoslhcb-qdb-52bedb7c30.cern.ch
* 	issuer: CN=CERN Grid Certification Authority,DC=cern,DC=ch
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: eoslhcb.cern.ch:443
> Accept: */*
>
< HTTP/1.1 403 token authorization failed
< Connection: Keep-Alive
< Content-Length: 26
<

* Connection #0 to host eoslhcb.cern.ch left intact

This shows up the FTS logs as the following error:

ERR     Tue, 16 Mar 2021 15:19:25 +0100; /builddir/build/BUILD/fts-3.10.1/src/url-copy/UrlCopyProcess.cpp; run; 599; Recoverable error: [5] [gfalt_copy_file][perform_copy] TRANSFER  [gfal_http_copy] ERROR: Copy failed with mode 3rd pull, with error: [davix2gliberr] Transfer failed: failure: SSL connect error

and in the XRootD logs as the following:

210316 15:19:25 10981 TPC_PullRequest: event=TRANSFER_FAIL, local=/eos/lhcb/grid/user/lhcb/user/c/chaen/finalCTA/CERN-CTA-TEST_CERN-USER_40.TPC.test, remote=https://eosctalhcb.cern.ch:8444/eos/ctalhcb/archivetest/lhcb/user/c/chaen/finalCTA/CERN-CTA-TEST_CERN-USER_40.TPC.test, user=(anonymous), bytes_transferred=0; HTTP library failure: SSL connect error

I read a bit about this type of error coming from NSS but there seems to be no good solution for this. Most of the time an upgrade of both curl and sssd are suggest but also this is not a bullet proof solution, apparently. I tried various things like clearing the sssd cache, restarting the daemon but nothing seems to help.
Has anyone been confronted with this? This is starting to affect 5-10% of the transfers on busy disk nodes in our instances.

Thanks,
Elvin

@bbockelm
Copy link
Contributor

Talking with @esindril - curl connects to the server, starts the handshake, then NSS starts loading certificates. When under load, he's observing it taking up to 10-20s to parse the /etc/grid-security/certificates directory.

Current working theory is that the remote server hangs up the TCP connection as the client is unresponsive and PR_END_OF_FILE_ERROR is the corresponding NSS error when it gets a hangup in the handshake. Trying to reproduce the idea in a lab setting.

@esindril
Copy link
Contributor Author

I think anything above 10 sec triggers this. For example:

[root@st-048-e99ba873 fst]# time curl -v --capath /etc/grid-security/certificates/ https://eoslhcb.cern.ch
* About to connect() to eoslhcb.cern.ch port 443 (#0)
*   Trying 2001:1458:301:62::100:5...
* Connected to eoslhcb.cern.ch (2001:1458:301:62::100:5) port 443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* failed to load '/etc/grid-security/certificates//56084620.namespaces' from CURLOPT_CAPATH
* failed to load '/etc/grid-security/certificates//4b3f5558.namespaces' from CURLOPT_CAPATH
* .....
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: /etc/grid-security/certificates/
* NSS error -5938 (PR_END_OF_FILE_ERROR)
* Encountered end of file
* Closing connection 0
curl: (35) Encountered end of file

real	0m13.713s
user	0m0.164s
sys	0m0.231s

@simonmichal
Copy link
Contributor

simonmichal commented Jun 4, 2021

I think this has been solved, hasn't it? Can we close the issue?

@chaen
Copy link
Contributor

chaen commented Jun 14, 2021

@simonmichal are you sure ? How ? From which version ?

@abh3
Copy link
Member

abh3 commented Jun 14, 2021

The current 5.2 version converts a directory based certs to a single file based cert which play mush nicer with curl. So, I would suggest going to 5.2 to avoid all the problems in this ticket. Mind you, there is still a problem depending on what the filesystem you choose to hold the certs and, according to some people, the centos release being used. That problem has been corrected in an upcoming patch or feature release. No matter, I suspect going to 5.2 will solve the problem.

@abh3
Copy link
Member

abh3 commented Dec 13, 2021

These issues have been resolved.

@abh3 abh3 closed this as completed Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants