Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP requests issue: http.secretkey directive and/or VOMS cert #1251

Closed
riccardodimaria opened this issue Jul 13, 2020 · 22 comments
Closed

HTTP requests issue: http.secretkey directive and/or VOMS cert #1251

riccardodimaria opened this issue Jul 13, 2020 · 22 comments

Comments

@riccardodimaria
Copy link

riccardodimaria commented Jul 13, 2020

Hello,
while testing XRootD HEAD (compiled as of July 13th, 2020) I notice some issues wrt HTTP requests, both CURL (using Tokens) and davix-get -P grid (using 2 different certificates).

  1. Starting with the latter (davix), the XCache is configured using:
  • /afs/cern.ch/user/r/rdimaria/public/Authfile-testing
  • /afs/cern.ch/user/r/rdimaria/public/xrootd-xcache.cfg-testing

I am using an ESCAPE VO cert, meaning, in my case, 3 attributes in the VO extension.
When trying to read data open to everyone:
davix-get -P grid https://escape-wp2-puppet-xcache-level0-02.cern.ch:1094//xroot://escape-wp2-puppet-mockdata-server.cern.ch:1213//mockdata/mockdata_origin12.txt_1024_0
I get: (Davix::HttpRequest) Error: Result Impossible to get the new redirected destination after 3 attempts
and the only thing I can spot from the logs (/afs/cern.ch/user/r/rdimaria/public/xrootd.log-xrootdticket-davix_escape) is 140654623500032:error:140D9115:SSL routines:ssl_get_prev_session:session id context uninitialized:ssl_sess.c:682:.

Commenting out the http.secretkey directive in the XCache cfg, the same request results in success (clean log at /afs/cern.ch/user/r/rdimaria/public/xrootd.log-xrootdticket-davix_escape-nosecretkey).
However, it seems that xrootd gets restarted after having fetched the file.

If I now request the same file using a CMS proxy, in my case meaning having only the default attribute in the VO extension,
the request succeeds with the http.secretkey directive in the XCache cfg (/afs/cern.ch/user/r/rdimaria/public/xrootd.log-xrootdticket-davix_cms).

  1. Moving to CURL, the XCache is still configured using:
  • /afs/cern.ch/user/r/rdimaria/public/Authfile-testing
  • /afs/cern.ch/user/r/rdimaria/public/xrootd-xcache.cfg-testing

The token I am using is only able to read CMS data.
When trying to make the following request:
curl -L -H "Authorization: Bearer $TOKEN" -k -XGET https://escape-wp2-puppet-xcache-level0-02.cern.ch:1094//cms/cms_origin2.txt
I get: curl: (52) NSS: client certificate not found (nickname not specified)
and the only thing I can spot from the logs (/afs/cern.ch/user/r/rdimaria/public/xrootd.log-xrootdticket-curl) is 200713 15:29:00 18515 anon.0:31@lxplus783 sysXrdHttp: Rejecting plain http with no valid token as we have a secretkey..

Commenting out the http.secretkey directive in the XCache cfg, the same request results in success (clean log at /afs/cern.ch/user/r/rdimaria/public/xrootd.log-xrootdticket-curl-nosecretkey).

IMPORTANT: AFAICT, even though I should perform more in-depth checks, this issue with CURL and davix requests affects also the "production" environment, meaning using 4.12 series (.2 and .3).

Please let me know if you need anything else from me.

Best,
Riccardo

@abh3
Copy link
Member

abh3 commented Jul 13, 2020

Let me comment on predicament 1: This is related to whether or not the server uses a session cache. After a long and tortured discussion on this, it was decided to disable the session cache for xrootd as well as http. You will see that for http this was done by commenting out the code (XrdHttpProtocol.cc:1830) as Fabrizio was very wary of the change. It would appear that self-redirection enabled by the secret key depends on getting information from the previous http connection and it's simply not there. Why that is the case is beyond me.

Now, as for xrootd "restarting"; what do you mean? Does it crash?

@abh3
Copy link
Member

abh3 commented Jul 13, 2020

As for predicament 2: It would seem that the problem is the curl command itself and it's interaction with secretkey. Note that based on your config file, "http.selfhttps2http yes", is in effect and there was an attempt to redirect from https to http but since here was no cert, no "opaque" token was generated. That would explain the subsequent error. In general, redirection from https to http using a bearer token is not workable as the token would likely be presented again using a non-encrypted connection. Finally, I know this is only testing, but including the actual secret key in the config file is bad form as people can actually display the config file and steal the secret key.

@riccardodimaria
Copy link
Author

Predicament 1: I understand now the point, many thanks. I am not really sure how to proceed on this though. I was planning to exploit the "fast unencrypted data access" using both http.selfhttps2http and http.secretkey. Any advice?
Regarding xrootd "restarting", this was my guess since from /afs/cern.ch/user/r/rdimaria/public/xrootd.log-xrootdticket-davix_escape-nosecretkey it can be seen a new XCache initialisation after the request.

However, I am still puzzled by the fact that using a CMS cert with 1 attribute hides this issue.

@riccardodimaria
Copy link
Author

Predicament 2: Many thanks for the explanation. However, in view of Tokens replacing x509, I am not sure how to solve this from the user PoV. Trying to explain, the user will always have only his token while contacting either the central redirector or the cache itself. AFAIU, the best way to set the cluster for a "fast redirection, distributed authentication, fast unencrypted data access" is to have the setup I am using ATM.
Am I missing something in the picture?

@bbockelm
Copy link
Contributor

Hi @riccardodimaria!

I'm fairly unfamiliar with this feature but would recommend against using it - you lose integrity checks and the cost for encryption is fairly minimal (it's in-hardware these days).

Could you post the output of curl with the -v option? That should dump the headers to stderr and illuminate the situation. Particularly, your redirected URL should contain a xrdhttptk. You can redact the secrets in the output - I'm just looking for the presence of headers.

Also to verify - modern versions of curl are supposed to drop the Authorization header on redirect for obvious security reasons.

Finally:

since here was no cert, no "opaque" token was generated

The opaque token is not generated from the certificate but rather from the XrdSecEntity object. The code appears to do some questionable items such as dump the credentials unencrypted into the URL. I don't understand that but it was done in 581b26b by @ffurano so he might have some insight.

Brian

@xrootd-dev
Copy link

xrootd-dev commented Jul 14, 2020 via email

@riccardodimaria
Copy link
Author

Hi @bbockelm ,
many thanks for this.
I am not sure I am following you. Are you advising not having http.selfhttps2http and http.secretkey at all? Meaning no "fast redirection, distributed authentication, fast unencrypted data access"...

The output of curl -v ... is attached. The token is indeed there, as well as:
xrdhttptk=/VOvMeUy9pSIuT/QOsyXYQ==&xrdhttptime=1594740344&xrdhttphost=%5B%3A%3Affff%3A137.138.62.33%5D HTTP/1.1 straight after the token.

The CURL version is:
curl --version curl 7.29.0 (x86_64-redhat-linux-gnu) libcurl/7.29.0 NSS/3.44 zlib/1.2.7 libidn/1.28 libssh2/1.8.0 Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp scp sftp smtp smtps telnet tftp Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz unix-sockets

curl.log

@bbockelm
Copy link
Contributor

@riccardodimaria - thank you for the debugging information: this is exactly what was needed!

Are you advising not having http.selfhttps2http and http.secretkey at all? Meaning no "fast redirection, distributed authentication, fast unencrypted data access"...

Yes, I'd advise not to use that at all - simply because we also have fast encrypted data access. Even beyond this apparent bug, there are so many ways that secrets can leak out in unexpected ways I would remain wary.

The underlying problem appears to be in XrdHttpReq::appendOpaque making assumptions that the URL fed into the method will never already contain query parameters embedded in the URL (and I count at least two code paths that already do; you hit one, the other has to do with EOS). Hence, it does not prefix the additional arguments with & and the xrdhttptk key is instead appended to the prior argument (resulting in both being invalid!).

@ffurano - the fix looks easy (simply see if there is an existing ? in the string before deciding whether to add the &): would you be able to address it?

@abh3
Copy link
Member

abh3 commented Jul 15, 2020 via email

@abh3
Copy link
Member

abh3 commented Jul 15, 2020 via email

@bbockelm
Copy link
Contributor

my last measurements at the time were giving a 4-5 times difference in the metadata rate by enabling it for the internal control channels.

This makes very little sense to me unless you are somehow disabling HTTP keepalive (and hence incurring a few extra roundtrip overheads per metadata operation when HTTPS is used). If that's true, the more reasonable approach would be to fix to make HTTP keepalive work - in that case, you would also gain speed because you avoid the TCP handshake as well.

The feature was untouched.

Agreed - there's nothing to do with the TLS layer here. This bug has existed since early on in the 4.x series (whenever we first added macaroon support I suspect ... the second code path is probably 6 months old as we improved EOS support). No one noticed it because, outside @riccardodimaria, no one has tried using both tokens and this secretkey feature.

@rob-c
Copy link

rob-c commented Jul 30, 2020

Hi all,

After a lot of debugging of xrootd(4.11.3 and 4.12.3) and the related plugins for DPM(1.13) it looks like ECDF is potentially hitting this problem in production with our DPM head node.

GGUS: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146771

Our solution has been to restart our (production) xrootd service in isolation once every 5min with the intent to clear cached connections to avoid hitting deletion errors for the ATLAS VO. This appears to have worked as we have not had a flood of errors relating to http requests against our xrootd service in > 24hr now.

Regards,

Rob

@bbockelm
Copy link
Contributor

@rob-c - I'm not seeing any overlap between that ticket and this one. This ticket was tracked down to a missing character in the redirect URL, making it invalid.

In the GGUS ticket you reference, don't get hung up on this line:

200727 18:39:31 8213 XrdAccept: Unable to allocate new link for srm.glite.ecdf.ed.ac.uk; cannot allocate memory

"Cannot allocate memory" is printed for any failure in the prior function. The real error is this:

200727 18:39:31 8213 XrdLink: attempt to reuse active link

From the code, it appears that this is triggered by a new file descriptor being created that Xrootd thinks is already in use. I have no clue what can cause this but doesn't look like this ticket. Appears to be a serious issue from the GGUS ticket - but might be best to open a new ticket.

@rob-c
Copy link

rob-c commented Jul 30, 2020

Hi,

Unfortunately the actual error causing most of the grief for us in trying to support ATLAS is that connections to the http side of xrootd fail due to an authentication issue internally.
The authn-error seems to happen when a worker within xrootd attempts to re-use a http connection which has previously been authenticated using an external x509 cert.

from our xrootd logs:
step1) Incoming http connection requests a keep alive and is authenticated using external x509 using a worker within xroot (this completes according to the logs)
step2) xroot does other things on other workers
step3) Connection is attempted using worker from step1 which should be authenticated with http.secretkey. This then fails due to a authn problem.

Using debug logging I've verified that the rejection is occurring within the xrootd code and not the dpm plugins which is why tracking this down from a DPM perspective took so long as requests from one DPM component never reach the other leading to strange errors.

From what I can gather step3 seems to be failing due to trying to use external (untrusted in this context) x509 credentials to perform an action that only trusts internal connections authenticated by http.secretkey.
(Otherwise there is some undiagnosed bug in the authentication within http(s) handling and I've no idea how to dig deeply enough to diagnose this).

The above accounts for some 90%+ of the deletion problems at our site.

The 2nd issue (in the same GGUS ticket) which seem to be impacting us is also a random issue to do with file descriptors which given is at the level of connection handling unfortunately goes beyond the available manpower/expertise required to accurately diagnose.
It has been suggested this is due to the glibc malloc we observe the same intermittent problem on our headnode using tcmalloc unless this is potentially a problem introduced by both.
Happy to open an issue about this but diagnosing a rare random bug in the connection handling but there is no more manpower at the site to track down something so fundamental.

Both of these errors seem to be strongly mitigated by restarting the service very frequently even though this is a last extreme resort.

@ffurano
Copy link
Contributor

ffurano commented Jul 30, 2020 via email

@rob-c
Copy link

rob-c commented Jul 30, 2020

Hi Fabrizio,

Webdav deletion requests from ATLAS are coming into the site via apache:443 and apache has all dome_unlink requests rejected from xrootd:1094 with a 403 from xrootd.

This persists until a full restart of xrootd.

From the xrootd logs of failing dome_unlink requests the same worker has previously handled http request with a keepalive before the internal dome request fails. (this could be unrelated but to the untrained eye this appears related and hence why I'm reporting it here).

There appears to be at least some demonstrable (bad) cache somewhere within xrootd, or the dpm/dome xrootd plugin somehow, but the logs don't suggest the connection reaches any code higher up than the xrootd http part of the framework so the (incorrect) rejection is coming from core xrootd somehow.

Restarting xrootd on the headnode in isolation fixes this potentially cached credential problem.

Rob

@ffurano
Copy link
Contributor

ffurano commented Jul 30, 2020 via email

@rob-c
Copy link

rob-c commented Jul 30, 2020

I've had the enhanced debugging in xrootd on http.trace debug for a while.

I've attached extracts from the services, dmlite, xrootd and httpd for a failing deletion in dpm.
xrootd.log
dmlite.log
ssl_err_log.log

The httpd log doesn't show much more for the deletion other than the 403 error from xrootd.
The xrootd logs show that the dome_unlink request is received by xrootd (I've included the previous command run by the same worker id as an example, it may be relevant but it may not).
Unfortunately the dmlite logs have a gap of 4min which may be because the log level not being high enough (I think it was set to 1 in this case).

I'll look to increase the logging on dmlite if it's needed. I had it manually set higher for a short period but can't seem to find the appropriate logs it confirmed a request for an unlink from apache was made but then just the error from the 403.

I can send more logs via email if it would help.

@ffurano
Copy link
Contributor

ffurano commented Jul 31, 2020 via email

@rob-c
Copy link

rob-c commented Jul 31, 2020

I'm wanting to give ATLAS a few more days to clear a backlog of broken deletions.
I will attempt to give xrootd a chance to run until failure next week with all of the settings turned up to debug to see if it sheds any more light on this.

(I can report all systems at the site are running and are correctly updated via ntp)

@abh3
Copy link
Member

abh3 commented Nov 13, 2020

I think all of this has been resolved, right? I will wait a couple of more days to see if that is not the case and if no other posts appear, the ticket will be closed.

@abh3
Copy link
Member

abh3 commented Nov 13, 2020

I am closing this ticket as there has been no other requested activity on it. If that is an error, please reopen it with an explanation.

@abh3 abh3 closed this as completed Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants