Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka keycloak authentication issues possible token expiry #232

Open
msillence opened this issue Mar 18, 2024 · 19 comments
Open

Kafka keycloak authentication issues possible token expiry #232

msillence opened this issue Mar 18, 2024 · 19 comments

Comments

@msillence
Copy link

We've been running kafka authenticating with keycloak for about a month without issue when last week all authentication suddendly failed

Clients were getting the error:
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TopicAuthorizationException: Topic authorization failed.

and there were no logs showing anything interesting on keycloak or kafka

Turning on debugging for kafka I saw hundreds of mesages of the form:
"2024-03-15 10:20:57,439 DEBUG Authorization DENIED due to token expiry - The token expired at: 1710498016000 (2024-03-15T10:20:16 UTC), for token: eyJh**86eA (io.strimzi.kafka.oauth.server.authorizer.KeycloakRBACAuthorizer.deny) [data-plane-kafka-request-handler-7]"

I tried restarting the clients and that made no difference but I believe that when I restarted all the brokers with the new config the error seemed to stop

I'm a little confused how restarting the brokers could fix this yet also see this error after the restart.

If this is a non recoverable error then I the logging level of debug is perhaps too low?

I already have set "connections.max.reauth.ms" = 120000
and checked on the brokers and I can see that config in the /tmp/strimzi.properties

Our config is:

          {
            "authentication" = {
              "enableOauthBearer" = false
              "enablePlain"       = true
              "jwksEndpointUri"  = "${local.keycloak_base_url}/protocol/openid-connect/certs"
              "tokenEndpointUri" = "${local.keycloak_base_url}/protocol/openid-connect/token"
              "type"             = "oauth"
              "userNameClaim"    = "preferred_username"
              "validIssuerUri"   = local.keycloak_base_url
            }

Keycloak settings - I think these are the relevant ones:

Acccess Token Lifespan = 5 minutes
SSO session Max = 7 days

Do you know what timeout would be causing the token expiry I would have assumed the token lifespan and why the token isn't auto refreshing, It sounded from the config for oauth.max.token.expiry.seconds like it should automatically refresh?

@mstruk
Copy link
Contributor

mstruk commented Mar 18, 2024

A full stack trace would help here, and all the log files you can provide.

Do you see any exceptions on your clients indicating that they were unable to obtain a new access token during re-login? Do you see any other errors or warnings in your broker logs? Your connections.max.reauth.ms looks like it's set at 2 minutes. That's quite short. If Keycloak issues tokens with 5 minutes expiry but the broker requires clients to re-authenticate every 2 minutes then maybe some dynamic might possibly occur where the client (which re-logins based on the 5 minute expiry with some jitter window) might authenticate to the broker with a token that has very short validity time left. Just guessing.

A full DEBUG log from broker would be required to try and get to the root of the problem.

@mstruk
Copy link
Contributor

mstruk commented Mar 18, 2024

One thing I just noticed in your configuration ... You have enableOauthBearer = false, and enablePlain = true.

I wonder if it has something to do with your issue. What it means is that there is no client re-login happening on the client since clients don't use SASL_OAUTHBEARER to connect to kafka, but use SASL_PLAIN and send client id and secret as username and password.

Are you by any chance configuring your clients with access tokens rather than secrets?

@msillence
Copy link
Author

The

One thing I just noticed in your configuration ... You have enableOauthBearer = false, and enablePlain = true.

I wonder if it has something to do with your issue. What it means is that there is no client re-login happening on the client since clients don't use SASL_OAUTHBEARER to connect to kafka, but use SASL_PLAIN and send client id and secret as username and password.

Are you by any chance configuring your clients with access tokens rather than secrets?

The clients are using the secrets from the service account and there are no access tokens in our configuration

@msillence
Copy link
Author

A full stack trace would help here, and all the log files you can provide.

Do you see any exceptions on your clients indicating that they were unable to obtain a new access token during re-login? Do you see any other errors or warnings in your broker logs? Your connections.max.reauth.ms looks like it's set at 2 minutes. That's quite short. If Keycloak issues tokens with 5 minutes expiry but the broker requires clients to re-authenticate every 2 minutes then maybe some dynamic might possibly occur where the client (which re-logins based on the 5 minute expiry with some jitter window) might authenticate to the broker with a token that has very short validity time left. Just guessing.

A full DEBUG log from broker would be required to try and get to the root of the problem.

I was hoping that making the max auth significantly shorter would avoid any race conditions compared to setting it to the same value are there any recommendations on what to do here - should I use 4 minutes?

@msillence
Copy link
Author

A full stack trace would help here, and all the log files you can provide.

Do you see any exceptions on your clients indicating that they were unable to obtain a new access token during re-login? Do you see any other errors or warnings in your broker logs? Your connections.max.reauth.ms looks like it's set at 2 minutes. That's quite short. If Keycloak issues tokens with 5 minutes expiry but the broker requires clients to re-authenticate every 2 minutes then maybe some dynamic might possibly occur where the client (which re-logins based on the 5 minute expiry with some jitter window) might authenticate to the broker with a token that has very short validity time left. Just guessing.

A full DEBUG log from broker would be required to try and get to the root of the problem.

Sadly debug wasn't running at the beginning of the issue and then when I turned it on the logs were flooded with these failure messages until it resolved itself. It looks like I have around 30 minutes of these failure messages.
Looking at one example there are 37 debug messages for the same token (in one second). Searching for that token it doesn't appear in the logs appart from those deny token expired messages.

The messages in the log start and end with the same message there is no token initiation message.
Could one of the other brokers be using the old token against this restarted broker? Otherwise I would expect to see the token in the logs.

I've left debug on for KeycloakRBACAuthorizer would you need anything else in terms of debugging?

I'm not against supplying the logs but there are some 11million log messages just for KeycloakRBACAuthorizer in this time period and from my initial analysis I'm not sure I've captured what we need to diagnose this - this time.

The client stack trace was:

figaro-connector-58d9799f94-jt249 figaro-connector {"timestamp":"2024-03-15T10:09:54.039Z","severity":"ERROR","message":"[Worker clientId=connect-1, groupId=1] Uncaught exception in herder work thread, exiting: ","logging.googleapis.com/sourceLocation":{"function":"org.apache.kafka.connect.runtime.distributed.DistributedHerder.run"},"logging.googleapis.com/insertId":"54","_exception":{"class":"org.apache.kafka.connect.errors.ConnectException","message":"Failed to list offsets for topic partitions.","stackTrace":"org.apache.kafka.connect.errors.ConnectException: Failed to list offsets for topic partitions.
	at org.apache.kafka.connect.util.TopicAdmin.retryEndOffsets(TopicAdmin.java:726) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.util.KafkaBasedLog.readEndOffsets(KafkaBasedLog.java:514) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.util.KafkaBasedLog.readToLogEnd(KafkaBasedLog.java:466) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:281) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:266) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.runtime.Worker.start(Worker.java:208) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:150) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:362) [connect-runtime-3.5.1.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: org.apache.kafka.connect.errors.ConnectException: Not authorized to get the end offsets for topic 'as400_connect_offsets' on brokers at poa-kafka-external-bootstrap.strimzi.svc.cluster.local:9094
	at org.apache.kafka.connect.util.TopicAdmin.endOffsets(TopicAdmin.java:672) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.util.TopicAdmin.lambda$retryEndOffsets$6(TopicAdmin.java:718) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.util.RetryUtil.retryUntilTimeout(RetryUtil.java:87) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.util.RetryUtil.retryUntilTimeout(RetryUtil.java:56) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.util.TopicAdmin.retryEndOffsets(TopicAdmin.java:717) ~[connect-runtime-3.5.1.jar:?]
	... 12 more
Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TopicAuthorizationException: Topic authorization failed.
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073) ~[?:?]
	at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165) ~[kafka-clients-3.5.1.jar:?]
	at org.apache.kafka.connect.util.TopicAdmin.endOffsets(TopicAdmin.java:665) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.util.TopicAdmin.lambda$retryEndOffsets$6(TopicAdmin.java:718) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.util.RetryUtil.retryUntilTimeout(RetryUtil.java:87) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.util.RetryUtil.retryUntilTimeout(RetryUtil.java:56) ~[connect-runtime-3.5.1.jar:?]
	at org.apache.kafka.connect.util.TopicAdmin.retryEndOffsets(TopicAdmin.java:717) ~[connect-runtime-3.5.1.jar:?]

@mstruk
Copy link
Contributor

mstruk commented Mar 18, 2024

Another approach would be if you can replicate the issue in a smaller setup - only one client.

Setting io.strimzi logger to DEBUG would be ideal in terms of logs.

@msillence
Copy link
Author

It happened again today I can find an example of a token first log entry at 9:09:48 when it was working and then expired at 09:11:56
I've changed the connections.max.reauth.ms to 4 minutes
I'm not sure how I will extract the logs in any meaningful way as they are so big with just this one class enabled at debug

first entry for this token:

"2024-03-19 09:09:48,936 DEBUG Got grants for 'OAuthKafkaPrincipal(User:service-account-kafka-all-access, groups: null, session: 1113386382, token: eyJh**NIZA)': [{"scopes":["Describe","Alter","IdempotentWrite","ClusterAction","Create","DescribeConfigs","AlterConfigs"],"rsid":"c817583a-9a9d-4acf-9c3c-111f51e06cf4","rsname":"Cluster:"},{"scopes":["Write","Describe","Read","Alter","Delete","Create","DescribeConfigs","AlterConfigs"],"rsid":"cc0009ab-6f78-4737-993e-adf09e67dd1e","rsname":"Topic:"},{"scopes":["Describe","Read","Delete"],"rsid":"781a7395-fe5b-4a4a-a372-f6a4bf6d0797","rsname":"Group:"},{"scopes":["Write","Describe"],"rsid":"6282daff-d792-457b-8eb8-b2760b88d82a","rsname":"TransactionalId:"}] (io.strimzi.kafka.oauth.server.authorizer.KeycloakRBACAuthorizer) [data-plane-kafka-request-handler-7]"

last working entry:
textPayload: "2024-03-19 09:11:55,534 DEBUG Authorization GRANTED - cluster: poa, user: OAuthKafkaPrincipal(User:service-account-kafka-all-access, groups: null, session: 1113386382, token: eyJh**NIZA), operation: READ, resource: TOPIC:mm2-configs.source.internal"

then failure:
textPayload: "2024-03-19 09:11:56,036 DEBUG Authorization DENIED due to token expiry - The token expired at: 1710839516000 (2024-03-19T09:11:56 UTC), for token: eyJh**NIZA (io.strimzi.kafka.oauth.server.authorizer.KeycloakRBACAuthorizer.deny) [data-plane-kafka-request-handler-5]"

There are a lot of failures for a lot of tokens

Shouldn't the broker refresh the token when it get's this error?

@mstruk
Copy link
Contributor

mstruk commented Mar 19, 2024

Shouldn't the broker refresh the token when it get's this error?

The way it works is that when the token expires, the broker should automatically require the Kafka client to reauthenticate (provided that connections.max.reauth.ms is set to the value greater than zero). That by itself should ensure that the situation you're seeing does not happen - the requests with expired token don't get through to being processed. But then, you are using SASL_PLAIN where there is no notion of the token and session expiration time. It didn't occur to me right away, and of course it makes perfect sense. The connections.max.reauth.ms should always be set to less than time left to token expiry. Another possibility is that connections.max.reauth.ms is ignored and no reauthentication is ever triggered resulting in session with an expired token.

When in this OAuth over Plain mode, the token is obtained by the broker in client's name at authentication time, but there is no mechanism to refresh it automatically - it relies on reauthentication to trigger a fresh session start as if authenticating for the first time.

So, one option to try is to set connections.max.reauth.ms to less than token expiry.

Another option to try is to catch the exception on the client and close you current KafkaProducer / KafkaConsumer and create a new instance which will start a new session. See: README

Yet another option is to enable OAUTHBEARER mode rather than using PLAIN.

@msillence
Copy link
Author

Should the connections.max.reauth.ms be set on the broker or client?

@msillence
Copy link
Author

One thing I note is I'm using the new KRAFT mode and from the readme it should be using KeycloakAuthorizer from the logging it looks like I'm using KeycloakRBACAuthorizer

I see there is also the oauth.max.token.expiry.seconds but it says this would only be used to force more frequent refreshes than necessary.

@mstruk
Copy link
Contributor

mstruk commented Mar 19, 2024

Should the connections.max.reauth.ms be set on the broker or client?

That's a broker-side setting. If using Strimzi Operator use maxSecondsWithoutReauthentication.

@mstruk
Copy link
Contributor

mstruk commented Mar 19, 2024

One thing I note is I'm using the new KRAFT mode and from the readme it should be using KeycloakAuthorizer from the logging it looks like I'm using KeycloakRBACAuthorizer

You should see both in the log because KeycloakAuthorizer instantiates and delegates to KeycloakRBACAuthorizer. At the start of Kafka broker log there is a full dump of the strimzi.properties file. In there the authorizer installed should be KeycloakAuthorizer.

I see there is also the oauth.max.token.expiry.seconds but it says this would only be used to force more frequent refreshes than necessary.

That's a client-side setting for OAUTHBEARER which would not have any effect for you since your clients use PLAIN.

@msillence
Copy link
Author

One thing I note is I'm using the new KRAFT mode and from the readme it should be using KeycloakAuthorizer from the logging it looks like I'm using KeycloakRBACAuthorizer

You should see both in the log because KeycloakAuthorizer instantiates and delegates to KeycloakRBACAuthorizer. At the start of Kafka broker log there is a full dump of the strimzi.properties file. In there the authorizer installed should be KeycloakAuthorizer.

Great, that is as you say

I see there is also the oauth.max.token.expiry.seconds but it says this would only be used to force more frequent refreshes than necessary.

That's a client-side setting for OAUTHBEARER which would not have any effect for you since your clients use PLAIN.

OK thanks.

We have another 3 deployments of strimzi and they have not experienced this issue as far as I know, but this is the only deployment using kraft.

We are using the operator I'll add maxSecondsWithoutReauthentication to the authentication secton:

            "authentication" = {
              "enableOauthBearer" = false
              "enablePlain"       = true
              # see https://login.<schema-nickname>.platform-of-apps.jhc.uk/auth/realms/platform/.well-known/openid-configuration
              "jwksEndpointUri"  = "${local.keycloak_base_url}/protocol/openid-connect/certs"
              "tokenEndpointUri" = "${local.keycloak_base_url}/protocol/openid-connect/token"
              "type"             = "oauth"
              "userNameClaim"    = "preferred_username"
              "validIssuerUri"   = local.keycloak_base_url
              "maxSecondsWithoutReauthentication" = 120
            }

@mstruk
Copy link
Contributor

mstruk commented Mar 19, 2024

We have another 3 deployments of strimzi and they have not experienced this issue as far as I know, but this is the only deployment using kraft.

That's an interesting observation.

@msillence
Copy link
Author

what's the differnce between the maxSecondsWithoutReauthentication and connections.max.reauth.ms?
FYI no token timeouts so far since adding it

@mstruk
Copy link
Contributor

mstruk commented Mar 20, 2024

The maxSecondsWithoutReauthentication option is part of Strimzi Kafka CR. It is translated into Kafka broker's configuration property connections.max.reauth.ms (multiplied by 1000 to convert seconds into milliseconds). Operator sets it on a per-listener basis.

@msillence
Copy link
Author

The issue is back even with the maxSecondsWithoutReauthentication setting - there must be some race condition here.

@mstruk
Copy link
Contributor

mstruk commented Mar 26, 2024

Have you tried disabling PLAIN mode, and only using OAUTHBEARER on your clients?
Can you reproduce it with small timeouts, small numbers of clients, relatively quickly.

I can only troubleshoot further if I can reproduce it locally, so you have to be able to reliably reproduce it locally and provide instructions on how to reproduce. Ideally you put together a reproducer - an example setup that anyone can start up locally and see the failures happen.

@msillence
Copy link
Author

It might be that this only happens under load and it's not a simple quite setup that reproduces this.
A distinct possibility is that keycloak is overlaoded and something happens where keycloak fails to respond and results in the token not being replaced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants