Client fails to clean up session after a authentication failure #1033

abh3 · 2019-08-02T01:51:58Z

We start a proxy server that needs to GSI authenticate to the origin. We
start it with a certificate that is about to expire. We wait after the
expiration time and then do the following:

user 'u7' (as in u7@host) attempts to open a file (or do anything that
requires login/auth).
The client opens a connection and logs in. All well and good.
The client attempts to authenticate but authentication fails. The
specific message is rather opaque/ The log here contains:
[2019-07-02 16:05:59.204401 -0700][Debug ][XRootDTransport ]
[u7@griddev04.slac.stanford.edu:1094 #0.0] Logged in, session:
01000000682200000700000001000000
[2019-07-02 16:05:59.204448 -0700][Debug ][XRootDTransport ]
[u7@griddev04.slac.stanford.edu:1094 #0.0] Authentication is required:
&P=gsi,v:10400,c:ssl,ca:ba240aa8.0|f5f0dfc2.0
[2019-07-02 16:05:59.204483 -0700][Debug ][XRootDTransport ]
[u7@griddev04.slac.stanford.edu:1094 #0.0] Sending authentication data
[2019-07-02 16:05:59.206272 -0700][Debug ][XRootDTransport ]
[u7@griddev04.slac.stanford.edu:1094 #0.0] Trying to authenticate using gsi
[2019-07-02 16:05:59.238792 -0700][Debug ][XRootDTransport ]
[u7@griddev04.slac.stanford.edu:1094 #0.0] Cannot get credentials for
protocol gsi: Secgsi: ErrParseBuffer: error getting user proxies: kXGS_init
[2019-07-02 16:05:59.238898 -0700][Error ][XRootDTransport ]
[u7@griddev04.slac.stanford.edu:1094 #0.0] No protocols left to try
[2019-07-02 16:05:59.238929 -0700][Error ][AsyncSock ]
[u7@griddev04.slac.stanford.edu:1094 #0.0] Socket error while handshaking:
[FATAL] Auth failed
[2019-07-02 16:05:59.238951 -0700][Debug ][AsyncSock ]
[u7@griddev04.slac.stanford.edu:1094 #0.0] Closing the socket
[2019-07-02 16:05:59.238980 -0700][Debug ][Poller ]
<[::ffff:134.79.120.194]:43466><--><[::ffff:134.79.129.117]:1094> Removing
socket from the poller
[2019-07-02 16:05:59.239066 -0700][Error ][PostMaster ]
[u7@griddev04.slac.stanford.edu:1094 #0] elapsed = 0, pConnectionWindow =
120 seconds.
[2019-07-02 16:05:59.239120 -0700][Error ][PostMaster ]
[u7@griddev04.slac.stanford.edu:1094 #0] Unable to recover: [FATAL] Auth
failed.
The client closes the connection and returns auth failed (this has been
verified by looking at the origin server, u7 no longer has a connection).
Since the request came through the Posix layer the auth failed is
translated to "permission denied" as that's the only errno code available.
This will explain why authentication errors may appear as permission errors.
The proxy cert is refreshed. We run the test again using user u20, as
expected, it succeeds. The client creates a socket, does the handshake,
a login, and a successful authentication and proceeds to open the file.
Now, we run the the test again but using user u7 (the one that failed to
authenticate).

The following sequence appears in the log:

190702 16:06:47 1227 yangw.19150:7@atlint01 XrootdProtocol: 0100 open rat
/tmp/README
190702 16:06:47 1227 yangw.19150:7@atlint01 ofs_open: 0-600 fn=/tmp/README
[2019-07-02 16:06:47.251377 -0700][Debug ][File ]
[0x1600@root://u7@griddev04.slac.stanford.edu:1094//tmp/README?oss.lcl=1&pss.tid=yangw.19150:7@atlint01]
Sending an open command
[2019-07-02 16:06:47.251433 -0700][Debug ][ExDbgMsg ]
[u7@griddev04.slac.stanford.edu:1094] MsgHandler created: 0x1e00 (message:
kXR_open (file: /tmp/README?oss.lcl=1&pss.tid=yangw.19150:7@atlint01, mode:
00, flags: kXR_open_read kXR_async kXR_retstat ) ).

[2019-07-02 16:06:47.251463 -0700][Error ][XRootD ]
[u7@griddev04.slac.stanford.edu:1094] Unable to send the message kXR_open
(file: /tmp/README?oss.lcl=1&pss.tid=yangw.19150:7@atlint01, mode: 00,
flags: kXR_open_read kXR_async kXR_retstat ): [FATAL] Auth failed

[2019-07-02 16:06:47.251478 -0700][Debug ][ExDbgMsg ]
[u7@griddev04.slac.stanford.edu:1094] Destroying MsgHandler: 0x1e00.
190702 16:06:47 1227 Posix_Open: [FATAL] Auth failed open
root://u7@griddev04.slac.stanford.edu:1094//tmp/README?pss.tid=yangw.19150:7@atlint01&oss.lcl=1
190702 16:06:47 1227 ofs_open: yangw.19150:7@atlint01 Unable to open
/tmp/README; permission denied
190702 16:06:47 1227 yangw.19150:7@atlint01 XrootdResponse: 0100 sending err
3010: Unable to open /tmp/README; permission denied

Note that the client never tried to open a socket and do the login auth
sequence. I should have since there is no socket for u7. However, it
remembers that the auth failed and refuses to do anything else but return
an authentication error. In essence, user u7 is permanently dead.

The 'u7' situation might be reasonable if an actual physical socket
existed to the server. In that case, one can't force re-authentication
and the only choice is to simply fail or force the user to do a forced
disconnect. However, in this case, the client closed the socket and removed
it from the poller yet it thinks it still has a session. There was no
attempt to connect and login. It immediately fails (note the origin server
never received a connection).

Oddly, this was partly corrected by patch

70569c3#diff-860d6aebc8e74302575c6f5ffa954c68

but that only helps to reload the client creds on a new connection. Prior to
that it would use whatever it did on the previous connection to the server
(which would likely fail). This problem is similar in that the
authentication failed but the client remembers that it did even though it
still needs to subsequently connect and login.

For completeness, here is a successful login/auth after the failed one.

[2019-07-02 16:06:31.320314 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Sending out kXR_login request,
username: u20, cgi:
?xrd.cc=us&xrd.tz=-8&xrd.appname=xrootd&xrd.info=&xrd.hostname=rhel6-64n.slac.stanford.edu&xrd.rn=v20190702-5cdfe18,
dual-stack: true, private IPv4: false, private IPv6: false
[2019-07-02 16:06:31.320970 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Logged in, session:
02000000682200001800000002000000
[2019-07-02 16:06:31.321024 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Authentication is required:
&P=gsi,v:10400,c:ssl,ca:ba240aa8.0|f5f0dfc2.0
[2019-07-02 16:06:31.321061 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Sending authentication data
[2019-07-02 16:06:31.321134 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Trying to authenticate using gsi
[2019-07-02 16:06:31.407977 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Sending more authentication data
for gsi
[2019-07-02 16:06:31.453803 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Authenticated with gsi.
[2019-07-02 16:06:31.453918 -0700][Debug ][PostMaster ]
[u20@griddev04.slac.stanford.edu:1094 #0] Stream 0 connected.
[2019-07-02 16:06:31.454005 -0700][Debug ][ExDbgMsg ]
[u20@griddev04.slac.stanford.edu:1094] Moving MsgHandler: 0x4c004180
(message: kXR_open (file:
/tmp/README?oss.lcl=1&pss.tid=yangw.19139:20@atlint01, mode: 00, flags:
kXR_open_read kXR_async kXR_retstat ) ) from out-queu to in-queue.
[2019-07-02 16:06:31.455264 -0700][Debug ][ExDbgMsg ]
[u20@griddev04.slac.stanford.edu:1094] Calling MsgHandler: 0x4c004180
(message: kXR_open (file:
/tmp/README?oss.lcl=1&pss.tid=yangw.19139:20@atlint01, mode: 00, flags:
kXR_open_read kXR_async kXR_retstat ) ) with status: [SUCCESS]

The text was updated successfully, but these errors were encountered:

abh3 · 2019-08-02T03:57:26Z

OK, I now see what the issue is. Here is the scenario:

User gets authentication error. This is an error during the handshake phase so it's marked as fatal and the socket is disconnected but the stream persists because, hey, it might need to be reconnected at some future time.
Indeed, at some later time the user attempts another request.
The request gets routed to the stream and the stream eventually calls Stream::EnableLink.
EnableLink notices the socket for he main stream is not connected. It check if there was a fatal error at the time of disconnection (there was, auth error) and whether enough time has passed since he last connection attempt. It hasn't because by default 1800 seconds must have passed (i.e. 30 minutes) so it simply returns the last error (auth failed).

We can see that this logic has many undesirable effect among which the default retry window is far too long. Essentially, all requests that started with a bad certificate during login are prohibited by the user who got that error for 30 minutes., even if this was a transient error. OK, we don't know how transient it really is.

While this may be OK for a regular client it wreaks havoc with a proxy server which essentially unusable for a very long time if a) a certificate is not renewed on time, or b) a transient authentication error occurs.

I suppose the expedient thing here is to reduce the time window but that really is not a good solution in the long run. I suppose I'm looking for alternative suggestions here.

gganis · 2019-08-02T08:09:31Z

Regular clients attached to a tty are given the possibility to renew the proxy. In this case that's not possible. Perhaps the classification as 'fatal error' is excessive in this case? What other options do we have?

simonmichal · 2019-08-05T09:37:36Z

@abh3 : the length of time that needs to elapse after a fatal error before the client may attempt to reconnect to the server is controlled by XRD_STREAMERRORWINDOW and not the XRD_CONNECTIONWINDOW, have a look at:

http://xrootd.org/doc/xrdcl-docs/www/xrdcldocs.html#x1-410004.2.31
http://xrootd.org/doc/xrdcl-docs/www/xrdcldocs.html#x1-550004.3.3

The default value of StreamErrorWindow as you correctly pointed out is 1800 seconds, you just need to tune it.

(alternatively, we can make the StreamErrorWindow authentication error aware)

xrootd-dev · 2019-08-07T07:27:52Z

Yes, even making it authentication error aware doesn't really addess the fundamental issue that any value above 0 allows a client to disable a proxy server by just causing an error to happen when connecting. So, it may be that a proxy simply needs to set it to zero.

…

On Mon, 5 Aug 2019, simonmichal wrote: @abh3 : the length of time that needs to elapse after a fatal error before the client may attempt to reconnect to the server is controlled by `XRD_STREAMERRORWINDOW` and not the `XRD_CONNECTIONWINDOW`, have a look at: http://xrootd.org/doc/xrdcl-docs/www/xrdcldocs.html#x1-410004.2.31 http://xrootd.org/doc/xrdcl-docs/www/xrdcldocs.html#x1-550004.3.3 The default value of `StreamErrorWindow` as you correctly pointed out is 1800 seconds, you just need to tune it. (alternatively, we can make the `StreamErrorWindow` authentication error aware) -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #1033 (comment) ######################################################################## Use REPLY-ALL to reply to list To unsubscribe from the XROOTD-DEV list, click the following link: https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

simonmichal · 2019-08-07T08:32:26Z

It should be relatively safe to disable it, e.g. the EOS data servers run with the StreamErrorWindow disabled.

abh3 · 2020-02-18T14:46:56Z

I think the solution here is to close the socket when authentication fails as you won't be able to do anything on that stream anyway.

abh3 · 2020-03-31T07:19:12Z

Hey Michal, any movement on this?

simonmichal · 2020-04-15T17:11:15Z

@abh3 : should be fixed in dc16af6, could you give it a try?

simonmichal · 2020-04-16T09:02:37Z

Got a confirmation from @abh3 that is works, hence I'm closing the ticket.

abh3 assigned simonmichal Feb 18, 2020

abh3 added the bug (non-critical) label Feb 18, 2020

simonmichal closed this as completed Apr 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client fails to clean up session after a authentication failure #1033

Client fails to clean up session after a authentication failure #1033

abh3 commented Aug 2, 2019

abh3 commented Aug 2, 2019

gganis commented Aug 2, 2019

simonmichal commented Aug 5, 2019

xrootd-dev commented Aug 7, 2019 via email

simonmichal commented Aug 7, 2019

abh3 commented Feb 18, 2020

abh3 commented Mar 31, 2020

simonmichal commented Apr 15, 2020

simonmichal commented Apr 16, 2020

Client fails to clean up session after a authentication failure #1033

Client fails to clean up session after a authentication failure #1033

Comments

abh3 commented Aug 2, 2019

abh3 commented Aug 2, 2019

gganis commented Aug 2, 2019

simonmichal commented Aug 5, 2019

xrootd-dev commented Aug 7, 2019 via email

simonmichal commented Aug 7, 2019

abh3 commented Feb 18, 2020

abh3 commented Mar 31, 2020

simonmichal commented Apr 15, 2020

simonmichal commented Apr 16, 2020