-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kip 368 (SASL reauthentication) #13822
Conversation
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37992#018ae277-6921-4461-937f-2137dc806f2b |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/38068#018aed6f-ef00-4c36-b28b-b07e601cb990 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just some minor comments.
src/v/config/configuration.cc
Outdated
@@ -2513,7 +2513,14 @@ configuration::configuration() | |||
"The sample period for the CPU profiler", | |||
{.needs_restart = needs_restart::no, .visibility = visibility::user}, | |||
100ms, | |||
{.min = 1ms}) {} | |||
{.min = 1ms}) | |||
, connections_max_reauth_ms( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should this be
kafka_conn...
? - Is this specific to SASL in a way that makes sense to express in the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Named to match the kafka equivalent, but the more I poke around that doesn't seem necessary or expected at all. It is indeed specific to SASL, so maybe something like kafka_sasl_max_reauth_ms
src/v/config/configuration.cc
Outdated
, connections_max_reauth_ms( | ||
*this, | ||
"connections_max_reauth_ms", | ||
"If gtz, the maximum time between client reauthentication", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's write this out like we might expect to see in documentation (e.g. avoiding short hand gtz).
"The maximum time between Kafka client reauthentication. <maybe a short sentence about what/why>. Reauthentication is disabled if the value is 0".
src/v/config/configuration.cc
Outdated
"If gtz, the maximum time between client reauthentication", | ||
{.needs_restart = needs_restart::yes, .visibility = visibility::user}, | ||
0ms, | ||
{.min = 0ms}) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also common to use std::nullopt to disable a feature, but I don't have a strong opinion about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, sort of aping the kafka version. I think null is more clear in this case though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, sort of aping the kafka version. I think null is more clear in this case though.
ahh yeh makes sense. sometimes i go ping product and get their opinion on stuff like this.
vlog( | ||
klog.trace, | ||
"SASL session_lifetime_ms: {}", | ||
sasl() ? sasl()->session_lifetime_ms() : -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is connection oriented option, is it necessary to log it on its own on a per-request granularity? perhaps there is another request-level trace message we could expand? no specific suggestion right now--maybe this is the best we can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convenient during development as a countdown, but in retrospect I don't think we need a trace log for this at all.
src/v/kafka/server/requests.cc
Outdated
@@ -249,6 +249,14 @@ process_result_stages process_request( | |||
request_context&& ctx, | |||
ss::smp_service_group g, | |||
const session_resources& sres) { | |||
auto& key = ctx.header().key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for a mutable reference here, key is a small integer just grab a copy?
src/v/kafka/server/requests.cc
Outdated
unlikely(ctx.sasl() && ctx.sasl()->complete() && ctx.sasl()->expired()) | ||
&& key != sasl_handshake_handler::api::key | ||
&& key != sasl_authenticate_handler::api::key) { | ||
throw sasl_session_expired_exception("Session expired"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the information is handy, it'd probably worth adding some context to this like how long the expiration was set to, and maybe the client id if isn't already logged at a higher level (or maybe higher level is where that extra context should go).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/v/kafka/server/server.cc
Outdated
sasl_authenticate_response_data data{ | ||
.error_code = error_code::none, | ||
.error_message = std::nullopt, | ||
.auth_bytes = std::move(result.value()), | ||
.session_lifetime_ms = ctx.sasl()->session_lifetime_ms(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's generally preferred to maintain std::chrono types as long as possible, so having session_lifetime_ms
return a std::chrono::duration and then applying the serialization logic for the response here (ie std::chrono::duration_cast<ms>(session_lifetime()).count()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/v/kafka/server/server.cc
Outdated
// Only initialise sasl state if sasl is enabled | ||
auto sasl = authn_method == config::broker_authn_method::sasl | ||
? std::make_optional<security::sasl_server>( | ||
security::sasl_server::sasl_state::initial) | ||
security::sasl_server::sasl_state::initial, | ||
conn_max_reauth_ms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an argument for making this max parameter be a config::binding<ms>
is that since the configuration is locked in when the connection is created, changes to the max reauth only take affect for new connections. it's probably not a big deal, but in principle someone might want to lower the max reauth but wouldn't have a way to forcefully shutdown a client connection without restarting redpanda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config is currently set up to require a restart, but again that's just to match what kafka offers. I figure the reasoning there is that a live misconfiguration could be fairly disruptive to connected clients, but the scenario you're describing makes sense from a usability standpoint. Not sure about this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this one.
I think it is probably ok as-is. What I'd do is just give a little thought to the possible scenarios that might occur and if any of those scenarios have the property that we wish it were live updatable in the context of an incident.
src/v/kafka/server/requests.cc
Outdated
// This is a client-driven reauthentication | ||
if (unlikely( | ||
ctx.sasl() && ctx.sasl()->complete() | ||
&& key == sasl_handshake_handler::api::key)) { | ||
vlog( | ||
klog.debug, | ||
"SASL reauthentication detected - resetting authn server"); | ||
ctx.sasl()->reset(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could combine this with the conditional above?
if (sasl.complete()) {
if (handshake()) {
reset()
} else if (expired && !authenticate()) {
....
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Had a separate state in a previous iteration I think.
assert ( | ||
any(reauths[n.name] > 0 for n in self.redpanda.nodes) | ||
), f"Expected client reauth on some broker...Reauths: {json.dumps(reauths, indent=1)}" | ||
|
||
exps = {} | ||
for node in self.redpanda.nodes: | ||
exps[node.name] = self.redpanda.count_log_node( | ||
node, "Session expired") | ||
|
||
assert ( | ||
all(exps[n.name] == 0 for n in self.redpanda.nodes) | ||
), f"Client should reauth before session expiry...Expirations: {json.dumps(exps, indent=1)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these two separate tests, in that exps
could be empty because full session expiry and connection shutdown isn't tested explicitly, so the final assert condition is trivially true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this works is that the SASLAuthenticateRequest
returns a lifetime field for the time to session expiry, then librdkafka
will always try to reauthenticate once 90% of that lifetime has elapsed.
So this is asserting (albeit obliquely) that the broker didn't lie about session expiry and didn't tear tings down prematurely.
I don't see an obvious way to explicitly test this type of connection shutdown, since configuring connections_max_reauth_ms
--> lifetime_ms
in response --> client proactively reauths @ T + 0.9 * lifetime_ms
.
Thanks for the review @dotnwat. Several decisions in this PR were made for compatibility with the corresponding kafka config (naming, semantics, etc). I may have made that up as a goal though...can you confirm one way or the other? In a general sense, not necessarily for this feature. |
I think this is a valid reason for making decisions most of the time. |
be39aa6
to
9cac240
Compare
force push to add a kerberos test |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/38670#018b1c00-a378-4738-af66-f3c83bdaedc4 |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/38670#018b1bfd-8cdc-4a38-8ecf-6815c57e5a15 |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/38670#018b1c00-a371-474a-8b83-f61de76c4c9b |
force push to take up ducktape version revert (merge conflict) |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40256#018b8c90-a295-4926-ad3a-480972393398: "rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40256#018b8c90-a28c-486b-b5c0-fc9042843fd3: "rptest.tests.cluster_features_test.FeaturesSingleNodeTest.test_get_features" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40256#018b8c90-a28f-41e1-80c0-2e4af4861d18: "rptest.tests.cluster_features_test.FeaturesSingleNodeUpgradeTest.test_upgrade" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
d51d432
force push for a last bit of cleanup in ducktape tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40323#018b9151-268c-4172-8844-d7635d91f304: "rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=delete" |
SCRAM, GSSAPI, OIDC Reauth tests are added alongside other integration tests for the associated authn mechanism. sasl_reauth_test.py contains tests more generic to reauth configuration along with some utility functions for accessing sasl metrics. This commit also adds a `KrbClient.produce` which sticks a little bit of python onto the Kerberos node and executes it. This bit of python configures a GSSAPI-authenticated producer publishes a specified number of records. This commit also makes some minor changes to KeycloakService and PythonLibrdkafka to account for token lifetime (configuration and challenge tracking). Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
force push for OIDC test timing. Not necessarily fixed; needs a longer repeat run |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40345#018b9223-3b08-4c15-b08e-88bff54fade0: "rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTest.test_recover_after_delete_records" |
new failures detected in https://buildkite.com/redpanda/redpanda/builds/40345#018b9223-3b0a-4b91-a8e4-d0bd7e2a4b4e: "rptest.tests.memory_stress_test.MemoryStressTest.test_fetch_with_many_partitions.memory_share_for_fetch=0.7" |
/ci-repeat 1 |
/ci-repeat 1 |
"new" CI Failures seem to have issues opened already (according to pandatriage) |
/ci-repeat 1 |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/40424#018b978c-0af1-4156-a63a-38e47d7c4bda |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/40424#018b978c-0ae8-4aab-aff7-1bbaab348f37 |
This PR implements most of KIP-368.
This PR also bumps kafka client ducktape dependencies to a version that supports reauthentication.
Fixes https://github.com/redpanda-data/core-internal/issues/775
TODO:
CI not currently supported. Needs a manual smoke test. Results to followUpdate: Some challenges here:Had some issues setting up ADDS last week - as yet unresolvedDucktape/kerberos bits are working fine, but we don't have a client available that supports both GSSAPI and kip-368Backports Required
Release Notes
Features
connections_max_reauth_ms > 0
.