-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP requests issue: http.secretkey directive and/or VOMS cert #1251
Comments
Let me comment on predicament 1: This is related to whether or not the server uses a session cache. After a long and tortured discussion on this, it was decided to disable the session cache for xrootd as well as http. You will see that for http this was done by commenting out the code (XrdHttpProtocol.cc:1830) as Fabrizio was very wary of the change. It would appear that self-redirection enabled by the secret key depends on getting information from the previous http connection and it's simply not there. Why that is the case is beyond me. Now, as for xrootd "restarting"; what do you mean? Does it crash? |
As for predicament 2: It would seem that the problem is the curl command itself and it's interaction with secretkey. Note that based on your config file, "http.selfhttps2http yes", is in effect and there was an attempt to redirect from https to http but since here was no cert, no "opaque" token was generated. That would explain the subsequent error. In general, redirection from https to http using a bearer token is not workable as the token would likely be presented again using a non-encrypted connection. Finally, I know this is only testing, but including the actual secret key in the config file is bad form as people can actually display the config file and steal the secret key. |
Predicament 1: I understand now the point, many thanks. I am not really sure how to proceed on this though. I was planning to exploit the "fast unencrypted data access" using both However, I am still puzzled by the fact that using a CMS cert with 1 attribute hides this issue. |
Predicament 2: Many thanks for the explanation. However, in view of Tokens replacing x509, I am not sure how to solve this from the user PoV. Trying to explain, the user will always have only his token while contacting either the central redirector or the cache itself. AFAIU, the best way to set the cluster for a "fast redirection, distributed authentication, fast unencrypted data access" is to have the setup I am using ATM. |
Hi @riccardodimaria! I'm fairly unfamiliar with this feature but would recommend against using it - you lose integrity checks and the cost for encryption is fairly minimal (it's in-hardware these days). Could you post the output of Also to verify - modern versions of Finally:
The opaque token is not generated from the certificate but rather from the Brian |
Hi Brian,
the redirect to HTTP feature is used heavily in the DPM core,
and my last measurements at the time were giving a 4-5 times difference
in the metadata rate by enabling it for the internal control channels.
About it being questionable... I don't know, it does what it does
and it's documented since years.
This to tell that it's very important that the feature was not broken
in recent releases. Was this touched only for xrootd 5 or shall
DPM sites be worried also for xrootd 4?
Thanks
Fabrizio
…On 14.07.20 15:50, Brian P Bockelman wrote:
Hi @riccardodimaria <https://github.com/riccardodimaria>!
I'm fairly unfamiliar with this feature but would recommend against
using it - you lose integrity checks and the cost for encryption is
fairly minimal (it's in-hardware these days).
Could you post the output of |curl| with the |-v| option? That should
dump the headers to stderr and illuminate the situation. Particularly,
your redirected URL should contain a |xrdhttptk|. You can redact the
secrets in the output - I'm just looking for the presence of headers.
Also to verify - modern versions of |curl| are supposed to drop the
|Authorization| header on redirect for obvious security reasons.
Finally:
since here was no cert, no "opaque" token was generated
The opaque token is not generated from the certificate but rather from
the |XrdSecEntity| object. The code appears to do some questionable
items such as dump the credentials unencrypted into the URL. I don't
understand that but it was done in 581b26b
<581b26b>
by @ffurano <https://github.com/ffurano> so he might have some insight.
Brian
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1251 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7NRDXIAGYU6WOOWBCDU63R3RPCRANCNFSM4OYPYKGA>.
------------------------------------------------------------------------
Use REPLY-ALL to reply to list
To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
|
Hi @bbockelm , The output of The CURL version is: |
@riccardodimaria - thank you for the debugging information: this is exactly what was needed!
Yes, I'd advise not to use that at all - simply because we also have fast encrypted data access. Even beyond this apparent bug, there are so many ways that secrets can leak out in unexpected ways I would remain wary. The underlying problem appears to be in @ffurano - the fix looks easy (simply see if there is an existing |
Well, this is an issue for sure. The only solution I see is that the HTTP
engine would have to cache the token for a subsequent reconnection and
somehow convince the client to not resend the token in the clear. However,
many people say that TLS is a trivial overhead. If it is, then this kind
of shenanigans (and they are not trivial to pull of correctly) is hardly
worth considering.
…On Tue, 14 Jul 2020, riccardodimaria wrote:
Predicament 2: Many thanks for the explanation. However, in view of Tokens replacing x509, I am not sure how to solve this from the user PoV. Trying to explain, the user will always have only his token while contacting either the central redirector or the cache itself. AFAIU, the best way to set the cluster for a "fast redirection, distributed authentication, fast unencrypted data access" is to have the setup I am using ATM.
Am I missing something in the picture?
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#1251 (comment)
|
The feature was untouched. The only "touching" was turning off the OpenSSL
connection cache (if you recall the heated discussion on that). It would
appear that is what broke the feature but only in certain cases. So, we
don't understand the problem but of course, if you could, look into it!
The R4 version is as is so there should be no need to worry about that
but likely needs veriification.
…On Tue, 14 Jul 2020, xrootd-dev wrote:
Hi Brian,
the redirect to HTTP feature is used heavily in the DPM core,
and my last measurements at the time were giving a 4-5 times difference
in the metadata rate by enabling it for the internal control channels.
About it being questionable... I don't know, it does what it does
and it's documented since years.
This to tell that it's very important that the feature was not broken
in recent releases. Was this touched only for xrootd 5 or shall
DPM sites be worried also for xrootd 4?
Thanks
Fabrizio
On 14.07.20 15:50, Brian P Bockelman wrote:
> Hi @riccardodimaria <https://github.com/riccardodimaria>!
>
> I'm fairly unfamiliar with this feature but would recommend against
> using it - you lose integrity checks and the cost for encryption is
> fairly minimal (it's in-hardware these days).
>
> Could you post the output of |curl| with the |-v| option? That should
> dump the headers to stderr and illuminate the situation. Particularly,
> your redirected URL should contain a |xrdhttptk|. You can redact the
> secrets in the output - I'm just looking for the presence of headers.
>
> Also to verify - modern versions of |curl| are supposed to drop the
> |Authorization| header on redirect for obvious security reasons.
>
> Finally:
>
> since here was no cert, no "opaque" token was generated
>
> The opaque token is not generated from the certificate but rather from
> the |XrdSecEntity| object. The code appears to do some questionable
> items such as dump the credentials unencrypted into the URL. I don't
> understand that but it was done in 581b26b
> <581b26b>
> by @ffurano <https://github.com/ffurano> so he might have some insight.
>
> Brian
>
> ?
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#1251 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AA7NRDXIAGYU6WOOWBCDU63R3RPCRANCNFSM4OYPYKGA>.
>
>
> ------------------------------------------------------------------------
>
> Use REPLY-ALL to reply to list
>
> To unsubscribe from the XROOTD-DEV list, click the following link:
> https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1
>
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#1251 (comment)
|
This makes very little sense to me unless you are somehow disabling HTTP keepalive (and hence incurring a few extra roundtrip overheads per metadata operation when HTTPS is used). If that's true, the more reasonable approach would be to fix to make HTTP keepalive work - in that case, you would also gain speed because you avoid the TCP handshake as well.
Agreed - there's nothing to do with the TLS layer here. This bug has existed since early on in the 4.x series (whenever we first added macaroon support I suspect ... the second code path is probably 6 months old as we improved EOS support). No one noticed it because, outside @riccardodimaria, no one has tried using both tokens and this secretkey feature. |
Hi all, After a lot of debugging of xrootd(4.11.3 and 4.12.3) and the related plugins for DPM(1.13) it looks like ECDF is potentially hitting this problem in production with our DPM head node. GGUS: https://ggus.eu/index.php?mode=ticket_info&ticket_id=146771 Our solution has been to restart our (production) xrootd service in isolation once every 5min with the intent to clear cached connections to avoid hitting deletion errors for the ATLAS VO. This appears to have worked as we have not had a flood of errors relating to http requests against our xrootd service in > 24hr now. Regards, Rob |
@rob-c - I'm not seeing any overlap between that ticket and this one. This ticket was tracked down to a missing character in the redirect URL, making it invalid. In the GGUS ticket you reference, don't get hung up on this line:
"Cannot allocate memory" is printed for any failure in the prior function. The real error is this:
From the code, it appears that this is triggered by a new file descriptor being created that Xrootd thinks is already in use. I have no clue what can cause this but doesn't look like this ticket. Appears to be a serious issue from the GGUS ticket - but might be best to open a new ticket. |
Hi, Unfortunately the actual error causing most of the grief for us in trying to support ATLAS is that connections to the http side of xrootd fail due to an authentication issue internally. from our xrootd logs: Using debug logging I've verified that the rejection is occurring within the xrootd code and not the dpm plugins which is why tracking this down from a DPM perspective took so long as requests from one DPM component never reach the other leading to strange errors. From what I can gather step3 seems to be failing due to trying to use external (untrusted in this context) x509 credentials to perform an action that only trusts internal connections authenticated by http.secretkey. The above accounts for some 90%+ of the deletion problems at our site. The 2nd issue (in the same GGUS ticket) which seem to be impacting us is also a random issue to do with file descriptors which given is at the level of connection handling unfortunately goes beyond the available manpower/expertise required to accurately diagnose. Both of these errors seem to be strongly mitigated by restarting the service very frequently even though this is a last extreme resort. |
Hi Rob,
I don't understand step1), as DPM works differently.
External https/davs clients go to Apache, port 443. Then the Apache
plugin contacts dome ( = xrootd daemon with plugins ) to get
metadata, but this is just internal gymnastics.
Could you please clarify the workflow of a failing client?
Is someone sending DAV clients to port 1094? That would be wrong.
Thanks
Fabrizio
Il 30/07/20 18:10, Robert Currie ha scritto:
… Hi,
Unfortunately the actual error causing most of the grief for us in
trying to support ATLAS is that connections to the http side of xrootd
fail due to an authentication issue internally.
The authn-error seems to happen when a worker within xrootd attempts to
re-use a http connection which has previously been authenticated using
an external x509 cert.
from our xrootd logs:
step1) Incoming http connection requests a keep alive and is
authenticated using external x509 using a worker within xroot (this
completes according to the logs)
step2) xroot does other things on other workers
step3) Connection is attempted using worker from step1 which should be
authenticated with http.secretkey. This then fails due to a authn problem.
Using debug logging I've verified that the rejection is occurring within
the xrootd code and not the dpm plugins which is why tracking this down
from a DPM perspective took so long as requests from one DPM component
never reach the other leading to strange errors.
From what I can gather step3 seems to be failing due to trying to use
external (untrusted in this context) x509 credentials to perform an
action that only trusts internal connections authenticated by
http.secretkey.
(Otherwise there is some undiagnosed bug in the authentication within
http(s) handling and I've no idea how to dig deeply enough to diagnose
this).
The above accounts for some 90%+ of the deletion problems at our site.
The 2nd issue (in the same GGUS ticket) which seem to be impacting us is
also a random issue to do with file descriptors which given is at the
level of connection handling unfortunately goes beyond the available
manpower/expertise required to accurately diagnose.
It has been suggested this is due to the glibc malloc we observe the
same intermittent problem on our headnode using tcmalloc unless this is
potentially a problem introduced by both.
Happy to open an issue about this but diagnosing a rare random bug in
the connection handling but there is no more manpower at the site to
track down something so fundamental.
Both of these errors seem to be strongly mitigated by restarting the
service very frequently even though this is a last extreme resort.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1251 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJBUT5HLOKFJVR2ARO2IXLR6GLOHANCNFSM4OYPYKGA>.
|
Hi Fabrizio, Webdav deletion requests from ATLAS are coming into the site via apache:443 and apache has all dome_unlink requests rejected from xrootd:1094 with a 403 from xrootd. This persists until a full restart of xrootd. From the xrootd logs of failing dome_unlink requests the same worker has previously handled http request with a keepalive before the internal dome request fails. (this could be unrelated but to the untrained eye this appears related and hence why I'm reporting it here). There appears to be at least some demonstrable (bad) cache somewhere within xrootd, or the dpm/dome xrootd plugin somehow, but the logs don't suggest the connection reaches any code higher up than the xrootd http part of the framework so the (incorrect) rejection is coming from core xrootd somehow. Restarting xrootd on the headnode in isolation fixes this potentially cached credential problem. Rob |
Hi,
I see your points... however the 403 puzzles me.
Would you mind to boost the log level for a short time in order
to catch such an error? It's the first time I see such a thing.
You should set:
- the dmlite log level to 4
- add "http.trace all" to the xrootd config
... and then isolate a manageable part of the dmlite.log
and xrootd.log, I will have a look at it.
Fabrizio
Il 30/07/20 18:52, Robert Currie ha scritto:
… Hi Fabrizio,
Webdav deletion requests from ATLAS are coming into the site via
apache:443 and apache has all dome_unlink requests rejected from
xrootd:1094 with a 403 from xrootd.
This persists until a full restart of xrootd.
From the xrootd logs of failing dome_unlink requests the same worker has
previously handled http request with a keepalive before the internal
dome request fails. (this could be unrelated but to the untrained eye
this appears related and hence why I'm reporting it here).
There appears to be at least some demonstrable (bad) cache somewhere
within xrootd, or the dpm/dome xrootd plugin somehow, but the logs don't
suggest the connection reaches any code higher up than the xrootd http
part of the framework so the (incorrect) rejection is coming from core
xrootd somehow.
Restarting xrootd on the headnode in isolation fixes this potentially
cached credential problem.
Rob
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1251 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJBUT3XAEQHRPUMJUQNT2TR6GQN7ANCNFSM4OYPYKGA>.
|
I've had the enhanced debugging in xrootd on I've attached extracts from the services, dmlite, xrootd and httpd for a failing deletion in dpm. The httpd log doesn't show much more for the deletion other than the 403 error from xrootd. I'll look to increase the logging on dmlite if it's needed. I had it manually set higher for a short period but can't seem to find the appropriate logs it confirmed a request for an unlink from apache was made but then just the error from the 403. I can send more logs via email if it would help. |
Hi,
unfortunately the xrootd log does not include the responses to those
two requests, these look OK to me. You may want to select a larger snippet.
The dmlite log will have more details, and I see that the loglevel
is still low.
To higher it you have to set it to 4 in /etc/dmlite.conf
Please send it through email, I'll be happy to have a look out of this
thread.
Last question... could you please double check that the
headnode and disk nodes have the correct datetime?
Thanks
Fabrizio
Il 30/07/20 21:49, Robert Currie ha scritto:
… I've had the enhanced debugging in xrootd on |http.trace debug| for a while.
I've attached extracts from the services, dmlite, xrootd and httpd for a
failing deletion in dpm.
xrootd.log <https://github.com/xrootd/xrootd/files/5003476/xrootd.log>
dmlite.log <https://github.com/xrootd/xrootd/files/5003477/dmlite.log>
ssl_err_log.log
<https://github.com/xrootd/xrootd/files/5003480/ssl_err_log.log>
The httpd log doesn't show much more for the deletion other than the 403
error from xrootd.
The xrootd logs show that the |dome_unlink| request is received by
xrootd (I've included the previous command run by the same worker id as
an example, it may be relevant but it may not).
Unfortunately the dmlite logs have a gap of 4min which may be because
the log level not being high enough (I think it was set to 1 in this case).
I'll look to increase the logging on dmlite if it's needed. I had it
manually set higher for a short period but can't seem to find the
appropriate logs it confirmed a request for an unlink from apache was
made but then just the error from the 403.
I can send more logs via email if it would help.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1251 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJBUT3OWJXS4SQOEJVSZODR6HFF7ANCNFSM4OYPYKGA>.
|
I'm wanting to give ATLAS a few more days to clear a backlog of broken deletions. (I can report all systems at the site are running and are correctly updated via ntp) |
I think all of this has been resolved, right? I will wait a couple of more days to see if that is not the case and if no other posts appear, the ticket will be closed. |
I am closing this ticket as there has been no other requested activity on it. If that is an error, please reopen it with an explanation. |
Hello,
while testing XRootD HEAD (compiled as of July 13th, 2020) I notice some issues wrt HTTP requests, both CURL (using Tokens) and davix-get -P grid (using 2 different certificates).
I am using an ESCAPE VO cert, meaning, in my case, 3 attributes in the VO extension.
When trying to read data open to everyone:
davix-get -P grid https://escape-wp2-puppet-xcache-level0-02.cern.ch:1094//xroot://escape-wp2-puppet-mockdata-server.cern.ch:1213//mockdata/mockdata_origin12.txt_1024_0
I get:
(Davix::HttpRequest) Error: Result Impossible to get the new redirected destination after 3 attempts
and the only thing I can spot from the logs (/afs/cern.ch/user/r/rdimaria/public/xrootd.log-xrootdticket-davix_escape) is
140654623500032:error:140D9115:SSL routines:ssl_get_prev_session:session id context uninitialized:ssl_sess.c:682:
.Commenting out the http.secretkey directive in the XCache cfg, the same request results in success (clean log at /afs/cern.ch/user/r/rdimaria/public/xrootd.log-xrootdticket-davix_escape-nosecretkey).
However, it seems that xrootd gets restarted after having fetched the file.
If I now request the same file using a CMS proxy, in my case meaning having only the default attribute in the VO extension,
the request succeeds with the http.secretkey directive in the XCache cfg (/afs/cern.ch/user/r/rdimaria/public/xrootd.log-xrootdticket-davix_cms).
The token I am using is only able to read CMS data.
When trying to make the following request:
curl -L -H "Authorization: Bearer $TOKEN" -k -XGET https://escape-wp2-puppet-xcache-level0-02.cern.ch:1094//cms/cms_origin2.txt
I get:
curl: (52) NSS: client certificate not found (nickname not specified)
and the only thing I can spot from the logs (/afs/cern.ch/user/r/rdimaria/public/xrootd.log-xrootdticket-curl) is
200713 15:29:00 18515 anon.0:31@lxplus783 sysXrdHttp: Rejecting plain http with no valid token as we have a secretkey.
.Commenting out the http.secretkey directive in the XCache cfg, the same request results in success (clean log at /afs/cern.ch/user/r/rdimaria/public/xrootd.log-xrootdticket-curl-nosecretkey).
IMPORTANT: AFAICT, even though I should perform more in-depth checks, this issue with CURL and davix requests affects also the "production" environment, meaning using 4.12 series (.2 and .3).
Please let me know if you need anything else from me.
Best,
Riccardo
The text was updated successfully, but these errors were encountered: