New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

throttles the number of simultaneous kerberos library invocations during authentication #536

Merged
merged 6 commits into from Jan 10, 2019

Conversation

Projects
None yet
3 participants
@shamsimam
Copy link
Contributor

shamsimam commented Dec 14, 2018

Changes proposed in this PR

  • throttles the number of simultaneous kerberos library invocations during authentication
  • reports a 503 temporarily unavailable when there are too many pending kerberos authentications

Why are we making these changes?

This prevents a router from crashing due to too many concurrent calls to the GSS library. The concurrency level controls the number of threads present in the Thread pool and hence throttles the number of concurrent calls to the GSS library. Using the custom thread pool also offloads the relatively expensive native GSS calls to not block the core.async threads. In addition, the PR also responds with 503s to requests that cause the pending kerberos authentication requests to grow beyond a pre-configured limit.

The Semaphore based solution looks much simpler but can potentially block async threads waiting on the acquire call. As this scenario can lead to poor performance in processing requests through Waiter, the Semaphore based approach was avoided.

Screenshot of Kerberos metrics:
image

@shamsimam shamsimam added the wip label Dec 14, 2018

@shamsimam shamsimam force-pushed the throttle-kerberos-calls branch 7 times, most recently from 1483763 to e3b8033 Dec 17, 2018

@shamsimam shamsimam removed the wip label Dec 24, 2018

@shamsimam shamsimam self-assigned this Dec 24, 2018

@shamsimam shamsimam force-pushed the throttle-kerberos-calls branch 4 times, most recently from 5038a14 to a72f866 Dec 26, 2018

@shamsimam shamsimam requested a review from sradack Dec 26, 2018

{:pre [(not-empty password)
(integer? concurrency-level)
(pos? concurrency-level)]}
(let [thread-pool (Executors/newFixedThreadPool concurrency-level)]

This comment has been minimized.

@scrosby

scrosby Dec 27, 2018

Member

I don't like creating a possibly idle pool of 30+ extra threads. This should be self-scaling, up to the limit you set. I'd also rename the config value max-concurrency-level, not concurrency-level.

What about 'new ThreadPoolExecutor' directly to construct it, that way you can make the pool auto-scale. Set a core pool size of, say, 1 thread, scale to max-concurrency-level maximum, with a 5 minute idle thread timeout. You can also put in a ThreadFactory and name the threads as they're created! :) That'll make stack traces and monitoring work nicer.

This comment has been minimized.

@shamsimam

shamsimam Dec 28, 2018

Contributor

Thanks for the suggestion. Agreed and implemented (except the custom thread factory).

user (first (str/split principal #"@" 2))
response (auth/handle-request-auth request-handler request user principal password)]
(log/debug "added cookies to response")
(if token

This comment has been minimized.

@scrosby

scrosby Dec 27, 2018

Member

I don't know my core-async to review this. Please have someone else do this part.

Show resolved Hide resolved waiter/src/waiter/auth/spnego.clj

@shamsimam shamsimam force-pushed the throttle-kerberos-calls branch from a72f866 to 23d72b3 Dec 28, 2018

@shamsimam

This comment has been minimized.

Copy link
Contributor

shamsimam commented Dec 28, 2018

@scrosby ready for another round of review.

@scrosby
Copy link
Member

scrosby left a comment

Please have someone else review the core-async code.

Show resolved Hide resolved waiter/src/waiter/auth/spnego.clj

@shamsimam shamsimam force-pushed the throttle-kerberos-calls branch from 23d72b3 to 1bf581e Jan 2, 2019

(if (.isEstablished gss-context)
(let [principal (gss-get-principal gss-context)
user (first (str/split principal #"@" 2))
response (auth/handle-request-auth request-handler request user principal password)]

This comment has been minimized.

@sradack

sradack Jan 2, 2019

Contributor

Processing the request should happen outside of the threadpool. The async action should only cover authentication, returning only the principal through the channel, after which processing request can proceed.

This comment has been minimized.

@shamsimam

shamsimam Jan 3, 2019

Contributor

Good catch! Done.

@@ -93,6 +94,11 @@
[classifier & nested-path]
`(counters/counter ~(metric-name (concat ["waiter" classifier "counters"] nested-path))))

(defmacro waiter-gauge

This comment has been minimized.

@sradack

sradack Jan 2, 2019

Contributor

Using gauges is obviously new. We avoided guages in the past, but unsure if we need to continue to avoid them.

This comment has been minimized.

@shamsimam

shamsimam Jan 3, 2019

Contributor

Imo, the metrics we are reporting here are best represented as gauges.

I suspect we can convert a few of our counters to gauges also.

@sradack

This comment has been minimized.

Copy link
Contributor

sradack commented Jan 3, 2019

@shamsimam What are your thoughts on an approach that rejects requests that exceed the limit (with a request limit exceeded error) instead of allowing them to queue up inside Waiter?

@shamsimam
Copy link
Contributor

shamsimam left a comment

What are your thoughts on an approach that rejects requests that exceed the limit (with a request limit exceeded error) instead of allowing them to queue up inside Waiter?

I'm okay adding an additional step to keep the queue size in limit. I think we still need the throttle as the previous issue we ran into was because of the rate at which we were making Kerberos API calls (not sure if there was any sizable queue build-up when the error occurred).

@@ -93,6 +94,11 @@
[classifier & nested-path]
`(counters/counter ~(metric-name (concat ["waiter" classifier "counters"] nested-path))))

(defmacro waiter-gauge

This comment has been minimized.

@shamsimam

shamsimam Jan 3, 2019

Contributor

Imo, the metrics we are reporting here are best represented as gauges.

I suspect we can convert a few of our counters to gauges also.

(if (.isEstablished gss-context)
(let [principal (gss-get-principal gss-context)
user (first (str/split principal #"@" 2))
response (auth/handle-request-auth request-handler request user principal password)]

This comment has been minimized.

@shamsimam

shamsimam Jan 3, 2019

Contributor

Good catch! Done.

@shamsimam

This comment has been minimized.

Copy link
Contributor

shamsimam commented Jan 3, 2019

@sradack added support for failing requests when the kerberos queue length reaches a configured limit.

It is ready for review.

Show resolved Hide resolved waiter/src/waiter/auth/spnego.clj Outdated
Show resolved Hide resolved waiter/src/waiter/auth/spnego.clj Outdated
Show resolved Hide resolved waiter/src/waiter/auth/spnego.clj
Show resolved Hide resolved waiter/src/waiter/auth/spnego.clj
@scrosby

scrosby approved these changes Jan 4, 2019

Copy link
Member

scrosby left a comment

Looks good for me, modulo steven's feedback on the async parts.

@shamsimam

This comment has been minimized.

Copy link
Contributor

shamsimam commented Jan 4, 2019

@sradack any further feedback?

shamsimam added some commits Jan 2, 2019

avoids reflection by explicit type hinting
specifies text/plain content types
changes message to 'Too many Kerberos authentication requests'

@shamsimam shamsimam force-pushed the throttle-kerberos-calls branch from b1d21a7 to 0709e19 Jan 8, 2019

@sradack sradack merged commit da4261b into master Jan 10, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@sradack sradack deleted the throttle-kerberos-calls branch Jan 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment