Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client fails to clean up session after a authentication failure #1033

Closed
abh3 opened this issue Aug 2, 2019 · 9 comments
Closed

Client fails to clean up session after a authentication failure #1033

abh3 opened this issue Aug 2, 2019 · 9 comments
Assignees

Comments

@abh3
Copy link
Member

abh3 commented Aug 2, 2019

We start a proxy server that needs to GSI authenticate to the origin. We
start it with a certificate that is about to expire. We wait after the
expiration time and then do the following:

  1. user 'u7' (as in u7@host) attempts to open a file (or do anything that
    requires login/auth).

  2. The client opens a connection and logs in. All well and good.

  3. The client attempts to authenticate but authentication fails. The
    specific message is rather opaque/ The log here contains:
    [2019-07-02 16:05:59.204401 -0700][Debug ][XRootDTransport ]
    [u7@griddev04.slac.stanford.edu:1094 #0.0] Logged in, session:
    01000000682200000700000001000000
    [2019-07-02 16:05:59.204448 -0700][Debug ][XRootDTransport ]
    [u7@griddev04.slac.stanford.edu:1094 #0.0] Authentication is required:
    &P=gsi,v:10400,c:ssl,ca:ba240aa8.0|f5f0dfc2.0
    [2019-07-02 16:05:59.204483 -0700][Debug ][XRootDTransport ]
    [u7@griddev04.slac.stanford.edu:1094 #0.0] Sending authentication data
    [2019-07-02 16:05:59.206272 -0700][Debug ][XRootDTransport ]
    [u7@griddev04.slac.stanford.edu:1094 #0.0] Trying to authenticate using gsi
    [2019-07-02 16:05:59.238792 -0700][Debug ][XRootDTransport ]
    [u7@griddev04.slac.stanford.edu:1094 #0.0] Cannot get credentials for
    protocol gsi: Secgsi: ErrParseBuffer: error getting user proxies: kXGS_init
    [2019-07-02 16:05:59.238898 -0700][Error ][XRootDTransport ]
    [u7@griddev04.slac.stanford.edu:1094 #0.0] No protocols left to try
    [2019-07-02 16:05:59.238929 -0700][Error ][AsyncSock ]
    [u7@griddev04.slac.stanford.edu:1094 #0.0] Socket error while handshaking:
    [FATAL] Auth failed
    [2019-07-02 16:05:59.238951 -0700][Debug ][AsyncSock ]
    [u7@griddev04.slac.stanford.edu:1094 #0.0] Closing the socket
    [2019-07-02 16:05:59.238980 -0700][Debug ][Poller ]
    <[::ffff:134.79.120.194]:43466><--><[::ffff:134.79.129.117]:1094> Removing
    socket from the poller
    [2019-07-02 16:05:59.239066 -0700][Error ][PostMaster ]
    [u7@griddev04.slac.stanford.edu:1094 #0] elapsed = 0, pConnectionWindow =
    120 seconds.
    [2019-07-02 16:05:59.239120 -0700][Error ][PostMaster ]
    [u7@griddev04.slac.stanford.edu:1094 #0] Unable to recover: [FATAL] Auth
    failed.

  4. The client closes the connection and returns auth failed (this has been
    verified by looking at the origin server, u7 no longer has a connection).

  5. Since the request came through the Posix layer the auth failed is
    translated to "permission denied" as that's the only errno code available.
    This will explain why authentication errors may appear as permission errors.

  6. The proxy cert is refreshed. We run the test again using user u20, as
    expected, it succeeds. The client creates a socket, does the handshake,
    a login, and a successful authentication and proceeds to open the file.

  7. Now, we run the the test again but using user u7 (the one that failed to
    authenticate).

The following sequence appears in the log:

190702 16:06:47 1227 yangw.19150:7@atlint01 XrootdProtocol: 0100 open rat
/tmp/README
190702 16:06:47 1227 yangw.19150:7@atlint01 ofs_open: 0-600 fn=/tmp/README
[2019-07-02 16:06:47.251377 -0700][Debug ][File ]
[0x1600@root://u7@griddev04.slac.stanford.edu:1094//tmp/README?oss.lcl=1&pss.tid=yangw.19150:7@atlint01]
Sending an open command
[2019-07-02 16:06:47.251433 -0700][Debug ][ExDbgMsg ]
[u7@griddev04.slac.stanford.edu:1094] MsgHandler created: 0x1e00 (message:
kXR_open (file: /tmp/README?oss.lcl=1&pss.tid=yangw.19150:7@atlint01, mode:
00, flags: kXR_open_read kXR_async kXR_retstat ) ).

[2019-07-02 16:06:47.251463 -0700][Error ][XRootD ]
[u7@griddev04.slac.stanford.edu:1094] Unable to send the message kXR_open
(file: /tmp/README?oss.lcl=1&pss.tid=yangw.19150:7@atlint01, mode: 00,
flags: kXR_open_read kXR_async kXR_retstat ): [FATAL] Auth failed

[2019-07-02 16:06:47.251478 -0700][Debug ][ExDbgMsg ]
[u7@griddev04.slac.stanford.edu:1094] Destroying MsgHandler: 0x1e00.
190702 16:06:47 1227 Posix_Open: [FATAL] Auth failed open
root://u7@griddev04.slac.stanford.edu:1094//tmp/README?pss.tid=yangw.19150:7@atlint01&oss.lcl=1
190702 16:06:47 1227 ofs_open: yangw.19150:7@atlint01 Unable to open
/tmp/README; permission denied
190702 16:06:47 1227 yangw.19150:7@atlint01 XrootdResponse: 0100 sending err
3010: Unable to open /tmp/README; permission denied

Note that the client never tried to open a socket and do the login auth
sequence. I should have since there is no socket for u7. However, it
remembers that the auth failed and refuses to do anything else but return
an authentication error. In essence, user u7 is permanently dead.

The 'u7' situation might be reasonable if an actual physical socket
existed to the server. In that case, one can't force re-authentication
and the only choice is to simply fail or force the user to do a forced
disconnect. However, in this case, the client closed the socket and removed
it from the poller yet it thinks it still has a session. There was no
attempt to connect and login. It immediately fails (note the origin server
never received a connection).

Oddly, this was partly corrected by patch

70569c3#diff-860d6aebc8e74302575c6f5ffa954c68

but that only helps to reload the client creds on a new connection. Prior to
that it would use whatever it did on the previous connection to the server
(which would likely fail). This problem is similar in that the
authentication failed but the client remembers that it did even though it
still needs to subsequently connect and login.

For completeness, here is a successful login/auth after the failed one.

[2019-07-02 16:06:31.320314 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Sending out kXR_login request,
username: u20, cgi:
?xrd.cc=us&xrd.tz=-8&xrd.appname=xrootd&xrd.info=&xrd.hostname=rhel6-64n.slac.stanford.edu&xrd.rn=v20190702-5cdfe18,
dual-stack: true, private IPv4: false, private IPv6: false
[2019-07-02 16:06:31.320970 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Logged in, session:
02000000682200001800000002000000
[2019-07-02 16:06:31.321024 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Authentication is required:
&P=gsi,v:10400,c:ssl,ca:ba240aa8.0|f5f0dfc2.0
[2019-07-02 16:06:31.321061 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Sending authentication data
[2019-07-02 16:06:31.321134 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Trying to authenticate using gsi
[2019-07-02 16:06:31.407977 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Sending more authentication data
for gsi
[2019-07-02 16:06:31.453803 -0700][Debug ][XRootDTransport ]
[u20@griddev04.slac.stanford.edu:1094 #0.0] Authenticated with gsi.
[2019-07-02 16:06:31.453918 -0700][Debug ][PostMaster ]
[u20@griddev04.slac.stanford.edu:1094 #0] Stream 0 connected.
[2019-07-02 16:06:31.454005 -0700][Debug ][ExDbgMsg ]
[u20@griddev04.slac.stanford.edu:1094] Moving MsgHandler: 0x4c004180
(message: kXR_open (file:
/tmp/README?oss.lcl=1&pss.tid=yangw.19139:20@atlint01, mode: 00, flags:
kXR_open_read kXR_async kXR_retstat ) ) from out-queu to in-queue.
[2019-07-02 16:06:31.455264 -0700][Debug ][ExDbgMsg ]
[u20@griddev04.slac.stanford.edu:1094] Calling MsgHandler: 0x4c004180
(message: kXR_open (file:
/tmp/README?oss.lcl=1&pss.tid=yangw.19139:20@atlint01, mode: 00, flags:
kXR_open_read kXR_async kXR_retstat ) ) with status: [SUCCESS]

@abh3
Copy link
Member Author

abh3 commented Aug 2, 2019

OK, I now see what the issue is. Here is the scenario:

  1. User gets authentication error. This is an error during the handshake phase so it's marked as fatal and the socket is disconnected but the stream persists because, hey, it might need to be reconnected at some future time.
  2. Indeed, at some later time the user attempts another request.
  3. The request gets routed to the stream and the stream eventually calls Stream::EnableLink.
  4. EnableLink notices the socket for he main stream is not connected. It check if there was a fatal error at the time of disconnection (there was, auth error) and whether enough time has passed since he last connection attempt. It hasn't because by default 1800 seconds must have passed (i.e. 30 minutes) so it simply returns the last error (auth failed).

We can see that this logic has many undesirable effect among which the default retry window is far too long. Essentially, all requests that started with a bad certificate during login are prohibited by the user who got that error for 30 minutes., even if this was a transient error. OK, we don't know how transient it really is.

While this may be OK for a regular client it wreaks havoc with a proxy server which essentially unusable for a very long time if a) a certificate is not renewed on time, or b) a transient authentication error occurs.

I suppose the expedient thing here is to reduce the time window but that really is not a good solution in the long run. I suppose I'm looking for alternative suggestions here.

@gganis
Copy link
Member

gganis commented Aug 2, 2019

Regular clients attached to a tty are given the possibility to renew the proxy. In this case that's not possible. Perhaps the classification as 'fatal error' is excessive in this case? What other options do we have?

@simonmichal
Copy link
Contributor

@abh3 : the length of time that needs to elapse after a fatal error before the client may attempt to reconnect to the server is controlled by XRD_STREAMERRORWINDOW and not the XRD_CONNECTIONWINDOW, have a look at:

http://xrootd.org/doc/xrdcl-docs/www/xrdcldocs.html#x1-410004.2.31
http://xrootd.org/doc/xrdcl-docs/www/xrdcldocs.html#x1-550004.3.3

The default value of StreamErrorWindow as you correctly pointed out is 1800 seconds, you just need to tune it.

(alternatively, we can make the StreamErrorWindow authentication error aware)

@xrootd-dev
Copy link

xrootd-dev commented Aug 7, 2019 via email

@simonmichal
Copy link
Contributor

It should be relatively safe to disable it, e.g. the EOS data servers run with the StreamErrorWindow disabled.

@abh3
Copy link
Member Author

abh3 commented Feb 18, 2020

I think the solution here is to close the socket when authentication fails as you won't be able to do anything on that stream anyway.

@abh3
Copy link
Member Author

abh3 commented Mar 31, 2020

Hey Michal, any movement on this?

@simonmichal
Copy link
Contributor

@abh3 : should be fixed in dc16af6, could you give it a try?

@simonmichal
Copy link
Contributor

Got a confirmation from @abh3 that is works, hence I'm closing the ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants